The University of Missouri Data Science and Analytics program received the Outstanding Credit Program Award from the University Professional and Continuing Education Association (UPCEA) during the Central Region Conference in St. Louis.
The Data Science and Analytics Masters Program was conceptualized by Dr. Sean P. Goggins and Dr. Chi-Ren Shyu in the spring of 2013, following Dr. Goggins work on a similar program at Drexel University and Dr. Shyu’s long standing work in data scientific oriented endeavors, including founding the MU Informatics Institute over a decade ago.
Through support from the Mizzou Advantage fund, Grant Scott joined our leadership team in 2015. Later in 2015, core DSA Faculty from across campus signed on to the effort, including:
A lot of times we get really great answers to the wrong questions.
Matt explained this phenomena as ”type III error”, an allusion to the more well known statistical phenomena of type I and type II errors. If you are trying to solve a problem or improve a situation, sometimes great answers to the wrong questions can still be useful because in all likelihood somebody is looking for the answer to that question! Or maybe it answers another curiosity you were not even thinking about. I think we should call this _metric encountering Erdelez (1997). There’s an old adage:
Even a blind squirrel finds a nut every once in a while.
For open source professionals a ”Blind Squirrel” is little more than the potential name for a Jazz trio, and probably not the right imagery for explaining to your boss that you’re ”working on open source metrics”. Yet these blind squirrels will encounter nuts a LOT more often if we make more nuts! ”Metrics are nuts!”. Not a good slogan, but that’s my metaphor. Making more metrics is easy for us because we have lots of data, we write software, and it stands to reason that more _metrics encountering is going to generate more useful metrics. If you are the blind squirrel, its useful to find metrics.
Can you imagine all the useful things blind squirrels would find if we let them loose in an Ikea? ”I came for the Swedish meatballs, I left with 2 closet organizing systems and a new kitchen”! A lot of things are useful, but in order for something to be helpful it needs to help you meet an important goal. To summarize:
– Useful: Of all the different things I find in the Ikea, many of them are useful. Or, there are 75 metrics on this dashboard, and 3 of them are useful!
– Helpful: You go into the endeavor with a goal, and leave with 3 metrics that help you achieve that goal. Or, you’re a blind squirrel that just ordered nuts online from Ikea.
2 Open Source Software Health Metrics: Lets go Crazy! Lets Get Nuts!
Great answers to the wrong questions are more commonplace than we prefer because open source software work is evolving quickly and we do not yet have a list of the right questions for many specific project situations. Lets refer to questions as ”metrics” now. Questions and metrics are nuts! Still a terrible slogan. Sometimes we do not know the question-metric-nut and foraging through a forest of metrics is, if not helpful, a way to reduce the rising anxiety we feel when we are not sure what data helps to support our explanation of what is happening in a project ecosystem. So, if like me and dozens of others working in and around the CHAOSS project, you are trying to achieve a goal for your project there are two orthogonal, strategic starting points our colleague in CHAOSS, Jesus M. Gonzalez-Barahona, suggests:
1. Goals: What are metrics going to help you accomplish?
2. Use Cases: When you go to use metrics, what are the use cases you have? A case can be simple, ill formed and even ’unpretty’:
(a) ”My manager wants to know if anyone else is working on this project?”
(b) ”It seems like my community is leveling off? Is it? Or is it just so large now I cannot tell?”
2.1 Taking Action by Sharing Goals and Use Cases
Having a yard full of nuts to sort through can help you work toward the nuts you want. OK. The nut metaphor has gone too far. We are looking to use software, provided as a prototype and an example to help talk through the details of use cases you name. With you. The use cases of open source developers, foundations, community managers and others use to evaluate open source software health and sustainability metrics are probably a manageable number.
We can give you some metrics to work with quickly using the CHAOSS sponsored metrics prototyping tool Augur.
What are we trying to accomplish with metrics? With Augur? One of our goals is to make it easier for open source stakeholders to ”get their bearings” on a project and understand ”how things are going”. We think that’s most easily accomplished when comparisons to your own project over time, and other projects you are familiar with are readily available. Augur makes comparisons central.
2.2 Building Helpful Metrics
If you have already shared a list of repositories you are interested in with us, here’s what you have;
1. an Augur site with those repos
2. The opportunity to look at that site and help the whole CHAOSS community know:
(a) What use cases which particular metrics help you address
(b) What goals you have that could be met by something like Augur, but you cannot meet yet
(c) Something to hate. If you’ve ever been to an NHL game, you know that hating the other team is how we show our team we love them. Its also a good brainstorming device.
So, OK. What do you want?
We want the opportunity speak with you about your goals, use cases, and the failings of tools currently at your disposal for ”getting there”. If you’re feeling adventurous, I would like to be able to reference our conversations (anonymously) in research papers, because research papers are kind of the ”code of the academic world”. That’s less important.
The version of Augur that’s currently deployed has several design goals that seek to provide useful information through comparison within a project (over time) and across projects. The most fundamental metrics people are interested in include
– What individuals committed the most lines of code in a time period?
– From what companies or other organizations are the individuals who committed the most lines of code in a time period?
– Derivative of the first two: Is this changing? Did I lose anyone? Who can this project NOT afford to lose?
Projects You Care About
Figure 1 is an example from Twitter, which shows an instance of Augur configured for all of the repositories in the Twitter ecosystem. When you go to http://twitter.augurlabs.io you get the list of repositories that you see in figure 1.
Looking at my projects
When I look at the most basic data for one of my repositories, I have enough information to answer the most basic questions about it (See above). Figure 2 and Figure 3 illustrate the Augur pages you will see at the next level of ”drill down”. Try clicking the months for even more information! Keep in mind this is ONLY the information for the repositories you shared with us, or the repositories part of one of our other live examples.
Figure 3 is a second image of the same page, but scrolled down just far enough to see that you can look at the top ten contributors as well as the top organizational contributors. We used a list of over 500 top level domains, as well as tech companies we were able to ”guess” to start to resolve even these prototypes to specific companies. We did this because Amye asked us to, and we’re really gunning to make Gluster have more lustre. As if that’s possible.
3 Explore the Rest of Augur
The focused repositories give that information which many open source folks tell us is their first line of interest when looking at their own projects. Keeping this conversation going is essential for the CHAOSS project, and for Augur’s utility for helping us identify which metrics map to which use cases and goals. There’s a lot here, and it might give you ideas. Also, as you go through the front end, keep in mind that all of the statistics you see represented as metrics are also available via our Restful API. You can use our data to explore building your own metrics. Or get an app developer to do that for you. Figure 4 provides a high level overview of the metrics representations on Augur that are built off the GitHub API, GHTorrent and Facade’s technology.
4 Our Ask: Goals and Use Cases
Metrics use cases
What are the questions you have about your project? What metrics will help you to make clearer sense of the answer to that question in a productive way?
Give us your use cases
Walk through trying to solve the use case? Where do you get stuck? How might the use case become generalized? If you are expert in openstack you can contribute . … you can just describe the use case. Draw out the use cases that you see. We can ask back, why not use metric x and y? And the conversation will really get going!
S. Erdelez (1997) Information Encountering: A Conceptual Framework for Accidental Information Discovery. Taylor Graham Publishing, Tampere, Finland. Cited by: §1.
Writing a personal bio is difficult because you have to talk about yourself as though you actually think you are all that and a bag of chips. I mean, we all do, right? Still, its a weird task and I do not enjoy it. And these things are more dynamic than you would think because what I do, especially, as an academic, especially, has to be refined for the language of a particular audience. Students, colleagues, funders and family, for example. Here are a couple that I recently put together. Now its a blog post.
If you are looking for more of a press release flavored bio, here are a few choices:
Bio 1: After a decade as a software engineer, Sean decided his calling was in research. He is presently a social computing researcher and professor of computer science at the University of Missouri. He is also a co-director and founder of their Data Science Masters program. Sean’s publications focus on understanding how social technologies influence organizational, small group and community dynamics, typically including analysis of electronic trace data from systems combined with the perspectives of people whose behavior is traced. Group informatics is a methodology and ontology Sean has articulated with the aim of helping build consensus among researchers and developers for how to ethically and systematically make sense of electronic trace data. Structural fluidity, a construct Sean developed with his collaborators Peppo Valetto and Kelly Blincoe, aims to make sense of structural dynamics in virtual software organizations, and how those dynamics affect performance. Working with Josh Introne, Bryan Semaan and Ingrid Erickson, Sean is elaborating on mechanisms for identifying structural fluidity and organizational dynamics in electronic trace data using the lens of complex systems theory. His other work includes collaborations with Matt Germonprez on the Open Collaboration Data Exchange and Open Source Health metrics projects. He lives in Columbia, MO with his wife Kate, two step daughters and a dog named Huckleberry.
Bio 2: Sean Goggins is just a guy. He writes stuff. He’s selfish, but not as selfish as he used to be. He’s painfully well organized, which means he has detailed lists of all the tasks he’s behind on. Computer Science. Social Computing. Learning Analytics. Learning Sciences. Small Groups. Published. Teaches. Funded. Does not suffer fools well. Eats control freaks for lunch. Pulled his groin on a bike ride last Sunday. Is generally concerned about the state of the world, and has enough self assuredness to think what he does each day could possibly make a difference. So, he’s naive. But not as naive as he used to be. He likes to ride his bicycle. 2 tattoos. Father. Step Father. Husband. Currently avoiding writing an actual bio.
People get excited about data science. Especially managers. Its instinctive. We are surrounded by data, nearly all of it overwhelming. Like the partner we dated through high school, it seems like there is something there, but it just doesn’t ever seem to come together. Data science is the camping trip where we figure each other out in our deluge of data.
When you head down that road, you are overwhelmed initially by 3 factoids. First, There is SO MUCH DATA. Second, the data is SO DISORGANIZED. Third, THERE ARE SO MANY TOOLS! We go down the rabbit hole.
Data scientists are, therefore, the janitors on the scene of a massive sewage leak. In the workshop (tool room). What makes data scientists successful or not: that’s what managers want to know. How do I *know* this person can clean up my sewage leak? There are 2 paths:
The data scientist knows your business domain, and has figured out which tools work for your mess
The data scientist has learned about all the tools; and probably cleaned up other messes in a few, assorted domains.
Conceptually, software engineeringis about little more than being systematic about how you approach a project and its lifecycle. The discipline can be applied in application development, infrastructure, data science and food preparation (among a host of domains). Yeah, you can do software engineering on food. If you disagree, come over and try out my digital chicken.
I get to say I am a data scientist today because I have a Ph.D, a bunch of papers, and I have been working in “Big Data” since before somebody invented “Big Data”. Some day, somebody please tell me what “Big Data” is; other than an awkward euphemism that is not helping with the gender gap in computing disciplines.
Getting beyond Ph.D level credibility requirements requires systematic training and a software engineering discipline around data. That’s kind of what I do with my projects, which are spread across a host of GitHub Organizations. Many of our repositories remain private because my teams and I continue to publish on them. If you want a peak, drop me a line. Here’s a list of GitHub Organizations for Data Science work that I operate: