Software Engineering and Data Science

People get excited about data science. Especially managers. Its instinctive. We are surrounded by data, nearly all of it overwhelming. Like the partner we dated through high school, it seems like there is something there, but it just doesn’t ever seem to come together. Data science is the camping trip where we figure each other out in our deluge of data.

When you head down that road, you are overwhelmed initially by 3 factoids. First, There is SO MUCH DATA. Second, the data is SO DISORGANIZED. Third, THERE ARE SO MANY TOOLS! We go down the rabbit hole.

Data scientists are, therefore, the janitors on the scene of a massive sewage leak. In the workshop (tool room). What makes data scientists successful or not: that’s what managers want to know. How do I *know* this person can clean up my sewage leak? There are 2 paths:

  1. The data scientist knows your business domain, and has figured out which tools work for your mess
  2. The data scientist has learned about all the tools; and probably cleaned up other messes in a few, assorted domains.

Conceptually, software engineering is about little more than being systematic about how you approach a project and its lifecycle. The discipline can be applied in application development, infrastructure, data science and food preparation (among a host of domains). Yeah, you can do software engineering on food. If you disagree, come over and try out my digital chicken.

I get to say I am a data scientist today because I have a Ph.D, a bunch of papers, and I have been working in “Big Data” since before somebody invented “Big Data”. Some day, somebody please tell me what “Big Data” is; other than an awkward euphemism that is not helping with the gender gap in computing disciplines.

Getting beyond Ph.D level credibility requirements requires systematic training and a software engineering discipline around data. That’s kind of what I do with my projects, which are spread across a host of GitHub Organizations. Many of our repositories remain private because my teams and I continue to publish on them. If you want a peak, drop me a line. Here’s a list of GitHub Organizations for Data Science work that I operate:


Software engineering. Data science. Together. That’s kind of a thing I do. Kind of one of the ways I maintain such a long list of projects.


Leave a Reply