Category Archives: Uncategorized

Metrics With Greater Utility: The Community Manager Use Case

1 Introduction

Community managers take a variety of perspectives, depending on where their communities are in the lifecycle of growth, maturity and decline. This is an evolving report of what we are learning from community managers, some of whom we are working with on live experiments with a CHAOSS project prototyping software tool called Augur (http://www.github.com/CHAOSS/augur). At this point we are paying particular focus to how community managers consume metrics and how the presentation of open source software health and sustainability metrics could make them more and in some cases less useful for doing their jobs.

Right now, based on Augur prototypes and follow up discussions so far, we have the following observations that will inform our work both the the ”Growth Maturity and Decline” working group and in Augur Development. There are a few things we have learned from prototyping Augur with community managers. These features in Augur are particularly valued:

  1. Allowing comparisons with projects within a defined universe is essential
  2. Allow community managers to add and remove repositories that they monitor from their repertoires periodically.
  3. Downloadable graphics
  4. Downloadable data (.csv or .json)
  5. Availability of a ”Metrics API”, limiting the amount of software infrastructure the CM needs to maintain for themselves. This is more valued by program managers overseeing larger portfolios right now, but we think has potential to grow as awareness of the relatively light weight of this approach becomes more apparent. By apparent, we really mean ”easy to use and understand”; right now it is for a programmer, but less so for a community manager without this background or current interest.

2 Date Summarized Comparison Metrics

With these advantages in mind, making the most of this opportunity to help community managers with useful metrics is going to include the availability of date summarized comparison metrics. These types of metrics have two ”filters” or ”parameters” fed into them that are more abstractly defined in the Growth, Maturity and Decline metrics on the CHAOSS project.

  1.  Given a pool of repositories of interest for a community manager, rank them in ascending or descending order by a metric.
  2.  Over a specified time period or
  3.  Over a specified periodicity (i.e., month) for a length of time (i.e., year).

For example, one open source program office we talked with is interested in the following set of date summarized comparison metrics. Given a pool of repositories of interest to the program office (dozens to hundreds of repositories):

  1.  What ten repositories have the most commits this year (straight commits, and lines of code)?
  2.  How many new projects were launched this year?
  3.  What are the top ten new repositories in terms of commits this year (straight commits, and lines of code)?
  4.  How many commits and lines of code were contributed by outside contributors this calendar year? Organizationally sponsored contributors?
  5.  What organizations are the top five external contributors of commits, comments and merges?
  6.  What are the total number of repository watchers we have across all of our projects?
  7.  Which repositories have the most stars? Of the ones new this year? Of all the projects? Which projects have the most new stars this year?

3 Open Ended Community Manager Questions to Support with Metrics

There are other, more open ended questions that may be useful to open source community managers:

  1.  Is a repository active?
    1.  Visual differentiation that examines issue and commit data
    2.  Activity in the past 30 days
    3.  Across all repositories, present the 50th percentile as a baseline and show repositories above and below that line.
  2.  Should we archive this repository?
    1.  Enable an input from the manager after reviewing statistics.
    2.  Activity level, inactivity level and dependencies
    3.  Mean/Median/Mode histogram for commits/repo
  3.  Should we feature this repository in our top 10? (Probably a subjective decision based on some kind of composite scoring system that is likely specific to the needs of every community manager or program office.)
  4. Who are our top authors? (Some kind of aggregated contribution ranking by time period [year, month, week, day?]. nominally, I have a concern about these kinds of metrics being ”gameable”, but if they are not visible to contributors themselves, there is less ”gaming” opportunity.)
  5.  What are our top repositories? (Probably a subjective decision based on some kind of composite scoring system that is likely specific to the needs of every community manager or program office.)
  6.  Most active repositories by time period [Week? Month? Year?]. Activity to be revealed through a mix of Retention and Maintainer activity primarily focusing on the latter. Number of issues and commits. Also the frequency of pull requests and the number of closed issues.
  7.  Least active repositories by time period [Week? Month? Year?]. Bottom of scores calculated, as above.
  8.  Who is our most active contributor (Some kind of aggregated contribution ranking by time period [year, month, week, day?]. nominally, I have a concern about these kinds of metrics being ”gameable”, but if they are not visible to contributors themselves, there is less ”gaming” opportunity.)
  9.  What new contributors submitted their first new patches/issues this week? (Visualization Note: New contributors can be colored in visualizations and then additionally a graph can be made for number of)
  10.  Which contributors became inactive? (Will need a mechanism for setting ”inactive” thresholds.)
  11.  Baseline level for the ”average” repository in an organization and for each, individual organization repository.
  12.  What projects outside of a community manager’s general view (GitHub organization or other boundary) doe my repositories depend on or do my contributors also significantly contribute to?
  13.  Build a summary report in 140 characters or less. For example, ”Your total commits in this time period [week? month?] across the organization increased 12% over the last period. Your most active repositories remained the same. You have 8 new contributors, which is 1 below your mean for the past year. For more information, click here.”
  14.  Once a metrics baseline is established, what can be done to move them? 1
  15.  Are there optimal measures for some metrics?
    1. (a) Pull request size?
    2. (b) Ratio of maintainers to contributors?
    3. (c) New contributor to consistent contributor ratio?
    4. (d) New contributor to maintainer ratio?

4 Augur Specific Design Change Recommendations

Next is a list of Augur specific design changes suggested thus far, based on conversations with community managers.

  1.  Showing all of the projects in a GitHub organization in a dashboard by default is generally useful.
  2.  Make the lines more clear in the charts, especially when there are multiple lines in comparison
  3.  How to zoom in and out is not intuitive. In the case of Google Finance, for example, a default, subset period was displayed when they used the ”below the line mirrored line” interface this is modeled after. That old model makes it fairly clear that the ability to adjust the range of dates is what that box below the line in google finance is for. Alternately, Google’s more updated way of representing time, providing users choices, and showing comparisons may be even more useful and engaging. In general, its important that the time zooming is more clear.Figure 1: In one view, Google lets you see a 1 year window of a stock’s performance.Figure 2: In another view, you can choose a 3 month period. Comparing the two time periods also draws out the trend with red or green colors, depending on whether or not the index, in this case a stock’s price, has increased or decreased overall during the selected time period.Figure 3: Comparisons are similarly interesting in Google’s finance interface. You can simply add a number of stocks in much the same way our users want to add a number of different repositories.
  4.  For the projects a community manager chooses to follow, go ahead and give them comparison checkboxes at the top of the page. I think from a design point of view, we should limit comparisons as discussed, to 7 or 8, simply due to the limits in human visual perception.
  5.  The ability to adjust the viewing windows to a month summary level is desired.
  6.  Right now Augur does not make it clear that metrics are, by default, aggregated by week.
  7.  New contributor response time. When a new contributor joins a project, what is the response time for their contribution?
  8.  A graph **comparing** commits and commit comments on x and y axes **between projects** is desired. Same with Issue and Issue comments.
  9.  In general, the last two years of data gets the most use. We should focus our default display on this range.

5 Data Source Trust Issues

  1.  Greater transparency of metrics data origins will be helpful for understanding discrepancies between current understanding and what metrics show.
    1.  We should include some detailed notes from Brian Warner about how Facade is counting lines of code, and possibly some instrumentation to enable those counts to be altered by user provided parameters.
    2.  Outside contributor organization Data. One community manager reported that their lines of code by organization data seems to look wrong. I did explain that these are mapped from a list of companies and emails we put together, and getting this right is something community managers will need some kind of mapping tool to do. GitDM is a tool that people sometimes use to create these maps, and Augur does follow a derivative of that work. Its probably the case that maintaining these affiliation lists is something that needs to be made easier for community managers, especially in cases where the number of organizations contributing to a project is diverse (there is a substantial range among community managers we spoke with. Some are managing complex ecosystems involving mostly outside contributors. Most are in the middle. And some of contributor lists highly skewed toward their own organization.)
  2.  GHTorrent data, while excellent for prototyping, faces some limitations under the scrutiny of community managers. For example, when using the cloned repositories, and then going back to *issues*, the issues data in GHTorrent does not ”look right”. I think the graph API might offer some possibilities for us to store issue statistics we pull directly from GitHub and update periodically as an alternative to GHTorrent.
  3.  When issues are moved from an older system, like Gerrit, into GitHub issues, in general the statistics for the converted issues are dodgy, even through the GitHub API. We are likely to encounter this, and at some point may want to include Gerrit data in a common data structure with issues from GitHub and other sources.

6 New Metrics Suggested

  1.  Add metric ”number of clones”
  2.  ”Unique visitors” to a repository is a data point available from the GitHub API which is interesting.
  3.  Include a metric that is a comparison of the ratio of new committers and total committers in a time period. Or, perhaps simply those two metrics in alignment. Seeing the number of new committers in a set of repositories can be a useful indication of momentum in one direction or another; though I hasten to add that this is not canonically the case.
  4.  Some kind of representation of the ratio between commits and lines of code per commit
  5.  Test coverage within a repository is something to consider measuring for safety critical systems software.
  6.  Identifying the relationship between the DCO and the CLA.
  7.  There is a tension between risk and value that, as our metrics develop in those areas, we are well advised to keep in mind.
  8.  The work that Matt Snell and Matt Germonprez at the University of Nebraska-Omaha are starting related to risk metrics is of great interest. Getting these metrics into Augur is something we should plan for as soon as reasonably possible.

7 Design Possibilities

7.1 Augur

For Augur, I think the interface changes that enable comparisons and adjust the level of self apparent ways to compress or expand the time, as per the Google examples, are at the top of the list of things that will make Augur more useful for Kate and other community managers. Feedback on these notes will be helpful. I think the new committers to committers ratio is important, as well as enabling comparisons across projects in the bubble graphs as well. Transparency of data sources and limitations of data sources for both the API and the front end, which are above average but not complete, are important.

7.2 Growth Maturity and Decline Working Group

Many of the metrics of interest to community managers fall under the ”growth maturity and decline” working group. From a design perspective it appears that, possibly, the way that metrics are expressed and consumed by these stakeholders in their individual derivatives of the community manager use case is quite far removed from the detailed definition work occurring around specific metrics. Discussion around an example implementation like Augur is helping draw out some of this more ”zoomed out” feedback. The design of system interfaces frequently includes the need to navigate between granular details and the overall user experience Zemel et al. (2007); Barab et al. (2007). This is less of a focus in the development of software engineering metrics, though recent research is beginning to illustrate the criticality of visual design for interpreting analytic information González-Torres et al. (2016).

8 Acknowledgements

Many members of the CHAOSS community contributed to this report and analysis. I am happy to share names with permission from the contributors, but I have not requested permission as of the publication date.

References

  • S. Barab, T. Dodge, M. Thomas, C. Jackson and H. Tuzun (2007) Our designs and the social agendas they carryJournal of the Learning Sciences 16 (2), pp. 263–305. Cited by: §7.2.
  • A. González-Torres, F. J. García-Peñalvo, R. Therón-Sánchez and R. Colomo-Palacios (2016) Knowledge discovery in software teams by means of evolutionary visual software analyticsScience of Computer Programming 121, pp. 55–74 (en). External Links: ISSN 01676423, LinkDocument Cited by: §7.2.
  • A. Zemel, T. Koschmann, C. LeBaron and P. Feltovich (2007) What are we Missing? Usability’s Indexical GroundComputer Supported Cooperative Work. Cited by: §7.2.

Phil Agre’s Practical Republic (Because UCLA Finally Took his Pages Down)

Phil Agre wrote thoughtfully and critically about artificial intelligence and the role of technology in the political process (among other things).  The takeaways I have from this paper include

  1. Social skills are essential for anyone seeking influence in the political process
  2. A lot of political theory to date completely misses this essential point
  3. Issue entrepreneurship is a more effective path of influence for most individuals.
  4. Phil Agre was ~20 years ahead of his time

The attachment of the article is provided in the interest of the public good.

Agre – 2004 – The practical republic Social skills and the prog

 

Data Science and Analytics Program Founded by Dr. Goggins Wins Award

The University of Missouri Data Science and Analytics program received the Outstanding Credit Program Award from the University Professional and Continuing Education Association (UPCEA) during the Central Region Conference in St. Louis.
The Data Science and Analytics Masters Program was conceptualized by  Dr. Sean P. Goggins and Dr. Chi-Ren Shyu in the spring of 2013, following Dr. Goggins work on a similar program at Drexel University and Dr. Shyu’s long standing work in data scientific oriented endeavors, including founding the MU Informatics Institute over a decade ago.
Through support from the Mizzou Advantage fund, Grant Scott joined our leadership team in 2015. Later in 2015, core DSA Faculty from across campus signed on to the effort, including:
  1. Yi Shang shangy@missouri.edu : Computer Science, Course Coordinator
  2. Dong Xu xudong@missouri.edu : Computer Science
  3. Joshi, Trupti joshitr@missouri.edu : Computer Science
  4. Tenaja, Harsh tanejah@missouri.edu : Journalism
  5. Thorson, Esther L. thorsone@missouri.edu : Strategic Communications, Course Coordinator
  6. David Herzog herzogd@missouri.edu : Journalism, Course Coordinator
  7. Uhlmann, Jeffrey uhlmannj@missouri.edu : Computer Science
  8. Gibson, Twyla G. gibsontg@missouri.edu : School of Information Science and Learning
  9. Technologies
  10. Sanda Erdelez erdelezs@missouri.edu : School of Information Science and Learning
  11. Technologies
  12. Shyu, Chi-Ren shyuc@missouri.edu : Director, MU Informatics Institute
  13. Joi Moore moorejoi@missouri.edu : School of Information Science and Learning
  14. Technologies, Course Coordinator
  15. Ersoy, Ilker ersoyi@health.missouri.edu : Biotechnology, Course Coordinator

Helpful and Useful – The Open Source Software Metrics Holy Grail

1 Introduction

My colleague Matt Germonprez recently hit me and around 50 other people at CHAOSSCON North America (2018) with this observation:

A lot of times we get really great answers to the wrong questions.

Matt explained this phenomena as ”type III error”, an allusion to the more well known statistical phenomena of type I and type II errors. If you are trying to solve a problem or improve a situation, sometimes great answers to the wrong questions can still be useful because in all likelihood somebody is looking for the answer to that question! Or maybe it answers another curiosity you were not even thinking about. I think we should call this _metric encountering  Erdelez (1997). There’s an old adage:

Even a blind squirrel finds a nut every once in a while.

For open source professionals a ”Blind Squirrel” is little more than the potential name for a Jazz trio, and probably not the right imagery for explaining to your boss that you’re ”working on open source metrics”. Yet these blind squirrels will encounter nuts a LOT more often if we make more nuts! ”Metrics are nuts!”. Not a good slogan, but that’s my metaphor. Making more metrics is easy for us because we have lots of data, we write software, and it stands to reason that more _metrics encountering is going to generate more useful metrics. If you are the blind squirrel, its useful to find metrics.

Can you imagine all the useful things blind squirrels would find if we let them loose in an Ikea? ”I came for the Swedish meatballs, I left with 2 closet organizing systems and a new kitchen”! A lot of things are useful, but in order for something to be helpful it needs to help you meet an important goal. To summarize:

  • – Useful: Of all the different things I find in the Ikea, many of them are useful. Or, there are 75 metrics on this dashboard, and 3 of them are useful!
  • – Helpful: You go into the endeavor with a goal, and leave with 3 metrics that help you achieve that goal. Or, you’re a blind squirrel that just ordered nuts online from Ikea.

2 Open Source Software Health Metrics: Lets go Crazy! Lets Get Nuts!

Great answers to the wrong questions are more commonplace than we prefer because open source software work is evolving quickly and we do not yet have a list of the right questions for many specific project situations. Lets refer to questions as ”metrics” now. Questions and metrics are nuts! Still a terrible slogan. Sometimes we do not know the question-metric-nut and foraging through a forest of metrics is, if not helpful, a way to reduce the rising anxiety we feel when we are not sure what data helps to support our explanation of what is happening in a project ecosystem. So, if like me and dozens of others working in and around the CHAOSS project, you are trying to achieve a goal for your project there are two orthogonal, strategic starting points our colleague in CHAOSS, Jesus M. Gonzalez-Barahona, suggests:

  1. 1. Goals: What are metrics going to help you accomplish?
  2. 2. Use Cases: When you go to use metrics, what are the use cases you have? A case can be simple, ill formed and even ’unpretty’:
    1. (a) ”My manager wants to know if anyone else is working on this project?”
    2. (b) ”It seems like my community is leveling off? Is it? Or is it just so large now I cannot tell?”

2.1 Taking Action by Sharing Goals and Use Cases

Having a yard full of nuts to sort through can help you work toward the nuts you want. OK. The nut metaphor has gone too far. We are looking to use software, provided as a prototype and an example to help talk through the details of use cases you name. With you. The use cases of open source developers, foundations, community managers and others use to evaluate open source software health and sustainability metrics are probably a manageable number.

We can give you some metrics to work with quickly using the CHAOSS sponsored metrics prototyping tool Augur.

What are we trying to accomplish with metrics? With Augur? One of our goals is to make it easier for open source stakeholders to ”get their bearings” on a project and understand ”how things are going”. We think that’s most easily accomplished when comparisons to your own project over time, and other projects you are familiar with are readily available. Augur makes comparisons central.

2.2 Building Helpful Metrics

If you have already shared a list of repositories you are interested in with us, here’s what you have;

  1. 1. an Augur site with those repos
  2. 2. The opportunity to look at that site and help the whole CHAOSS community know:
    1. (a) What use cases which particular metrics help you address
    2. (b) What goals you have that could be met by something like Augur, but you cannot meet yet
    3. (c) Something to hate. If you’ve ever been to an NHL game, you know that hating the other team is how we show our team we love them. Its also a good brainstorming device.

So, OK. What do you want?

We want the opportunity speak with you about your goals, use cases, and the failings of tools currently at your disposal for ”getting there”. If you’re feeling adventurous, I would like to be able to reference our conversations (anonymously) in research papers, because research papers are kind of the ”code of the academic world”. That’s less important.

2.3 An Augur Experiment

AUGUR

If you do not have a list of repositories you have already shared with us, there are a few examples here: http://www.augurlabs.io/live-examples/.

Design Goals

The version of Augur that’s currently deployed has several design goals that seek to provide useful information through comparison within a project (over time) and across projects. The most fundamental metrics people are interested in include

  • – What individuals committed the most lines of code in a time period?
  • – From what companies or other organizations are the individuals who committed the most lines of code in a time period?
  • – Derivative of the first two: Is this changing? Did I lose anyone? Who can this project NOT afford to lose?

Projects You Care About

Figure 1 is an example from Twitter, which shows an instance of Augur configured for all of the repositories in the Twitter ecosystem. When you go to http://twitter.augurlabs.io you get the list of repositories that you see in figure 1.


Figure 1:
When you follow the URL above, or your own URL, you will see a list of repositories that we have cloned, and using the technology behind ”Facade”, a tool written by Brian Warner, calculated all the salient, basic, individual repository information about. Here’s a list of those repositories.

Looking at my projects

When I look at the most basic data for one of my repositories, I have enough information to answer the most basic questions about it (See above). Figure 2 and Figure 3 illustrate the Augur pages you will see at the next level of ”drill down”. Try clicking the months for even more information! Keep in mind this is ONLY the information for the repositories you shared with us, or the repositories part of one of our other live examples.

Figure 2:You can see the lines of code from the top two authors, as well as the space inefficient Augur tool bar. Please contact me if you have tips and tricks for getting developers to be more comfortable with putting aesthetics behind utility in web page design. I will buy you a case of beer.

Figure 3 is a second image of the same page, but scrolled down just far enough to see that you can look at the top ten contributors as well as the top organizational contributors. We used a list of over 500 top level domains, as well as tech companies we were able to ”guess” to start to resolve even these prototypes to specific companies. We did this because Amye asked us to, and we’re really gunning to make Gluster have more lustre. As if that’s possible.

Figure 3: A more detailed look at some of the information available on a repository by repository basis in Augur. We also show you the organizational affiliation information.

3 Explore the Rest of Augur

The focused repositories give that information which many open source folks tell us is their first line of interest when looking at their own projects. Keeping this conversation going is essential for the CHAOSS project, and for Augur’s utility for helping us identify which metrics map to which use cases and goals. There’s a lot here, and it might give you ideas. Also, as you go through the front end, keep in mind that all of the statistics you see represented as metrics are also available via our Restful API. You can use our data to explore building your own metrics. Or get an app developer to do that for you. Figure 4 provides a high level overview of the metrics representations on Augur that are built off the GitHub API, GHTorrent and Facade’s technology.

Figure 4: There’s a lot here. At the top of the screen you can enter an owner and a repository name to get information about a particular repository. Each of the CHAOSS Metric working groups are represented in tabs at the top of the screen (number 1). The repository you just searched for is listed below the metric category (number 2). The metric name is listed in the title (number 3), and that title corresponds with a CHAOSS metric that is linked below the graphic. These are line graphs, though other visualization styles are readily available, and the line over time is shown by (number 4). The gray area around (number 4) is the standard deviation. (Number 5) is a slider like you see on Google Finance, so you can zoom in on one period of time more closely. Finally, (number 6) has a LOT of different configuration and filtering options you can explore.
Figure 5: Here is a WAY zoomed out overview of the Growth, Maturity and Decline metrics you might see on the Augur page. (Number 1) is where you might enter another ”owner/repo” combination to compare your repository to. (Number 2) illustrates that sometimes there is no data available from the source we use for a particular metric.

Figure 6: This shows you two repositories compared with each other in Augur. Does this fit any of your use cases or goals? How would you make it different? (Number 1) shows what two repositories are being compared. (Number 2) shows the key for knowing which project is which. (Number 3) points out, again, that you can see the CHAOSS definition for the metric any time you like. To the right, you can also see how .json, .csv and .svg representations of the data can be downloaded for you to make whatever use you would like to make of it.)

4 Our Ask: Goals and Use Cases

Metrics use cases

What are the questions you have about your project? What metrics will help you to make clearer sense of the answer to that question in a productive way?

Give us your use cases

Walk through trying to solve the use case? Where do you get stuck? How might the use case become generalized? If you are expert in openstack you can contribute . … you can just describe the use case. Draw out the use cases that you see. We can ask back, why not use metric x and y? And the conversation will really get going!

References

  • S. Erdelez (1997) Information Encountering: A Conceptual Framework for Accidental Information DiscoveryTaylor Graham Publishing, Tampere, Finland.
    Cited by: §1.

Click Here for a PDF Version of this Post That is Much Easier to Read

This post originally appeared at http://www.chaoss.community and http://www.augurlabs.io

On the Art of the Bio

Writing a personal bio is difficult because you have to talk about yourself as though you actually think you are all that and a bag of chips. I mean, we all do, right? Still, its a weird task and I do not enjoy it. And these things are more dynamic than you would think because what I do, especially, as an academic, especially, has to be refined for the language of a particular audience. Students, colleagues, funders and family, for example. Here are a couple that I recently put together. Now its a blog post.

If you are looking for more of a press release flavored bio, here are a few choices:

Bio 1: After a decade as a software engineer, Sean decided his calling was in research. He is presently a social computing researcher and professor of computer science at the University of Missouri. He is also a co-director and founder of their Data Science Masters program. Sean’s publications focus on understanding how social technologies influence organizational, small group and community dynamics, typically including analysis of electronic trace data from systems combined with the perspectives of people whose behavior is traced. Group informatics is a methodology and ontology Sean has articulated with the aim of helping build consensus among researchers and developers for how to ethically and systematically make sense of electronic trace data.  Structural fluidity, a construct Sean developed with his collaborators Peppo Valetto and Kelly Blincoe, aims to make sense of structural dynamics in virtual software organizations, and how those dynamics affect performance. Working with Josh Introne, Bryan Semaan and Ingrid Erickson, Sean is elaborating on mechanisms for identifying structural fluidity and organizational dynamics in electronic trace data using the lens of complex systems theory. His other work includes collaborations with Matt Germonprez on the Open Collaboration Data Exchange and Open Source Health metrics projects. He lives in Columbia, MO with his wife Kate, two step daughters and a dog named Huckleberry.

Bio 2: Sean Goggins is just a guy. He writes stuff. He’s selfish, but not as selfish as he used to be. He’s painfully well organized, which means he has detailed lists of all the tasks he’s behind on. Computer Science. Social Computing. Learning Analytics. Learning Sciences. Small Groups. Published. Teaches. Funded. Does not suffer fools well. Eats control freaks for lunch. Pulled his groin on a bike ride last Sunday. Is generally concerned about the state of the world, and has enough self assuredness to think what he does each day could possibly make a difference. So, he’s naive. But not as naive as he used to be. He likes to ride his bicycle. 2 tattoos. Father. Step Father. Husband. Currently avoiding writing an actual bio.

 

Software Engineering and Data Science

People get excited about data science. Especially managers. Its instinctive. We are surrounded by data, nearly all of it overwhelming. Like the partner we dated through high school, it seems like there is something there, but it just doesn’t ever seem to come together. Data science is the camping trip where we figure each other out in our deluge of data.

When you head down that road, you are overwhelmed initially by 3 factoids. First, There is SO MUCH DATA. Second, the data is SO DISORGANIZED. Third, THERE ARE SO MANY TOOLS! We go down the rabbit hole.

Data scientists are, therefore, the janitors on the scene of a massive sewage leak. In the workshop (tool room). What makes data scientists successful or not: that’s what managers want to know. How do I *know* this person can clean up my sewage leak? There are 2 paths:

  1. The data scientist knows your business domain, and has figured out which tools work for your mess
  2. The data scientist has learned about all the tools; and probably cleaned up other messes in a few, assorted domains.

Conceptually, software engineering is about little more than being systematic about how you approach a project and its lifecycle. The discipline can be applied in application development, infrastructure, data science and food preparation (among a host of domains). Yeah, you can do software engineering on food. If you disagree, come over and try out my digital chicken.

I get to say I am a data scientist today because I have a Ph.D, a bunch of papers, and I have been working in “Big Data” since before somebody invented “Big Data”. Some day, somebody please tell me what “Big Data” is; other than an awkward euphemism that is not helping with the gender gap in computing disciplines.

Getting beyond Ph.D level credibility requirements requires systematic training and a software engineering discipline around data. That’s kind of what I do with my projects, which are spread across a host of GitHub Organizations. Many of our repositories remain private because my teams and I continue to publish on them. If you want a peak, drop me a line. Here’s a list of GitHub Organizations for Data Science work that I operate:

  1. http://www.github.com/sociallycompute
  2. http://www.github.com/OCDX
  3. http://www.github.com/expert-patients
  4. http://www.github.com/sgoggins
  5. http://sociallycompute.io

Software engineering. Data science. Together. That’s kind of a thing I do. Kind of one of the ways I maintain such a long list of projects.

SEAN_2016_10screen