The roles of the Data Scientist and Data Steward in a world of "Big Data"
An article on SearchDataManagement caught my attention this week. The broad premise of the article is that there is an unhealthy tension between the role of the Data Steward (“rigour good, indiscipline bad”) and Data Scientists (“freedom good, structure bad”), and that data governance gets in the way of delivering analytical value. The authorities quoted in the article all come from people that represent technology or services vendors who are in the ”data analytics camp” (Shawn Rogers from Enterprise Management Associates, Jill Dyché from SAS, Jonathan Geiger from Intelligent Solutions, William McKnight from McKnight Consulting).
Now, call me a cynic (“You’re a cynic”), but don’t all these vendors all have a vested interest in getting the tools and projects deployed in the fastest time possible, and Devil take the hindmost?!
Well I’m not buying it.
As far as I’m concerned, the role of the Data Steward doesn't get set aside just because we're operating in the "Big Data" space, and in my opinion the article’s "straw man" perspective on Data Scientists' attitudes ("hands off, let the data speak") is naïve; indeed, in my experience it doesn't actually bear out in the real world.
Sure, the dynamics, speed of delivery and overall level of definitional rigour may change, at least in the short term. But applying good data stewardship principles to “Big Data” projects doesn’t need to mean slowing things down. Indeed, a general understanding of the semantics, interpretation and context of the data set is vital in order to derive any meaning from the data anyway. That cannot be done in isolation. In contrast to the “set aside” attitude offered by Dyché et al, I’d offer that bringing data stewardship to bear early in the process will enable more informed curation of the data set and provide a feedback loop to improve the overall quality of data set for the longer term (including the ability to think about and adapt the data to meet other business needs.)
The key thing is that there is a real opportunity for collaboration and co-operation. The data scientist brings tools, analytical and data processing expertise, the business data steward brings understanding of the value and utility of the data in context - which to my mind is a precursor to any data analytic task anyway Both are necessary, neither is sufficient.
The only way that the “do it fast, do it with the current data, don’t do any cleansing” approach can work is if you’ve got an individua who can offer both Data Scientist and Data Steward perspectives. And let’s face it, how many “unicorns” are out there who can genuinely offer both perspectives?
In any event, the business also needs to remain responsible for the data set throughout. And if opportunities arise to improve the data so that it is more fit for the purpose(s) that are required of it, so much the better.