The roles of the Data Scientist and Data Steward in a world of "Big Data"
An
article on SearchDataManagement caught my attention this week. The broad premise
of the article is that there is an unhealthy tension between the role of the
Data Steward (“rigour good, indiscipline bad”) and Data Scientists (“freedom
good, structure bad”), and that data governance gets in the way of delivering
analytical value. The authorities quoted in the article all come from people
that represent technology or services vendors who are in the ”data analytics
camp” (Shawn Rogers from Enterprise Management Associates, Jill Dyché from SAS,
Jonathan Geiger from Intelligent Solutions, William McKnight from McKnight
Consulting).
Now, call me a cynic (“You’re a cynic”),
but don’t all these vendors all have a vested interest in getting the tools and
projects deployed in the fastest time possible, and Devil take the hindmost?!
Well I’m not buying it.
As far as I’m concerned, the role
of the Data Steward doesn't get set aside just because we're operating in
the "Big Data" space, and in my opinion the article’s "straw
man" perspective on Data Scientists' attitudes ("hands off, let the
data speak") is naïve; indeed, in my experience it doesn't actually bear
out in the real world.
Sure, the dynamics, speed of delivery and overall
level of definitional
rigour may change, at least in the short term. But applying good data
stewardship principles to “Big Data” projects doesn’t need to mean slowing
things down. Indeed, a general
understanding of the semantics, interpretation and context of the data set
is vital in order to derive any meaning from the data anyway. That cannot be
done in isolation. In contrast to the “set aside” attitude offered by Dyché et
al, I’d offer that bringing data stewardship to bear early in the process will
enable more informed curation of the data set and provide a feedback loop to
improve the overall quality of data set for the longer term (including the
ability to think about and adapt the data to meet other business needs.)
The key thing is that there is a real
opportunity for collaboration and co-operation. The data scientist brings
tools, analytical and data processing expertise, the business data steward
brings understanding of the value and utility of the data in context - which to
my mind is a precursor
to any data analytic task anyway Both are necessary, neither is sufficient.
The only way that the “do it fast, do it
with the current data, don’t do any cleansing” approach can work is if you’ve
got an individua who can offer both Data Scientist and Data Steward
perspectives. And let’s face it, how many “unicorns” are out there who can
genuinely offer both perspectives?
In any event, the business also needs to
remain responsible for the data set throughout. And if opportunities arise to
improve the data so that it is more fit for the purpose(s) that are required of
it, so much the better.
Hi Alan
ReplyDeleteGreat article. I agree with everything you say apart from the role of the Data Steward. The foundation to all data quality, and this is no different for Big Data, is the underlying data architecture.
So, if big data is to be trusted, we need people who can ensure that it is structured in a manner that will not cause it to produce spurious results.
A Data Steward might be able to point out results are spurious or incorrect but would be unable to determine why and how to rectify it. What is needed is a Data Architect who can ensure that the data structures have the integrity to produce the correct results every time based on the data being queried.
Because pragmatic Data Architects are now so thin on the ground compared to 20 years ago, enterprises everywhere are currently very likely to be making critical business decisions based on flawed inferences made by queries on Big Data.
One major data black hole to which the whole industry remains blind is the violation of Fifth Normal (5NF) which is almost impossible to avoid when data sets from different sources are joined in queries. Even if the underlying data is 100% correct, violation of 5NF will always produce errors and no technology can detect or rectify these errors.
A pragmatic Data Architect, on the other hand, would be able to ensure that the underlying architecture of the data was such that it would produce no such errors.
The current lack of Data Architecture knowledge and skills is going to ensure that Big Data will produce some really Big Errors. The only question is, how big a catastrophe must occur before people are willing to admit that something is seriously wrong?
Again, great article.
Regards
John
HI Alan
ReplyDeleteI agree.
The link between successful governance and successful big data analytics was, largely, borne out by research run (independently) in the US and Europe - check out this post for links to the research http://dataqualitymatters.wordpress.com/2013/09/16/big-data-quality-matters/
Gary
Hi John - totally agree WRT the involvement of the Data Architect (or someone with the skills to play that role). In an ideal world, this would be so.
ReplyDeleteIt's an additional role that goes into the mix - and one that I deliberately chose to leave out of the blog, as it would "muddy the water" in the argument of responding to the "leave the raw data alone" mantra noted in the originating SearchDataManagement article.
Note that I am differentiating between *roles* and *individuals* - it's possible to encapsulate more than one role within the same participant person. You'd like to think (hope?!) that any Data Scientist worth their salt would understand the process of data design and be able to deal with any data structuring considerations as part of any data analytics project. I've worked with some very good statisticians who do work this way and build their analytical models "from the data up". I've also worked with some stinkers who just don't get it and are totally tools-bound.
To find an individual that can play all three roles? That's not just a unicorn. That's a horse of a different colour...! ( http://oz.wikia.com/wiki/Horse_of_a_Different_Color )
Thanks for the link Gary - there's definitely a strong body of opinion and plenty of research evidence linking analytic success to DQ effort (and failure when DQ is ignored)
ReplyDeleteThe Information Week BI & Analytics survey highlights DQ as the biggest barrier to successful Analytics: http://reports.informationweek.com/abstract/81/11715/Business-Intelligence-and-Information-Management/Research:-2014-Analytics,-BI,-and-Information-Management-Survey.html
Add additional comments from associated LinkedIn thread:
ReplyDeleteThanks Emma. At best, I think any approach that does no pre-planning about the content quality is naive (at worst, could be considered negligent).
Clearly, the BI/Analytic tool vendors have a vested interest in getting their products deployed quick-smart (that's when their license revenues hit, after all!), so no wonder they're pushing a message that says "to hell with data modeling & quality, we haven't got time for that!!"
But actually, most statistical analysts don't actually work that way anyway.
I find that the opposite is often true - statisticians will often spend inordinate amounts of time formatting, structuring and manipulating data, correcting anomalies and eliminating outliers in order to "make sure the model is right". They typically spend 80% of their time on data prep.
More power to them, says I!
Except that the knowledge and understanding of the underlying data and all the inferred rules that have been applied hardly ever get documented, shared or replicated back into the source data, so all the good data diagnosis work is then lost. There's also the problem that while the Statto almost always believes that they fully understand the business problem, they may not have the same perspectives as the business users themselves. (Leading to confusion, mis-communication and disappointment).
This is one of the reasons why I'm so excited to be advising QFire Software on their distributed data quality approach. QFire supports analysts who need to get going and just do some work on the data, while at the same time providing an environment where any DQ rules and the library of data sets can be captured, propagated and shared with colleagues. (They're also doing some interesting things in the area of data preparation & DQ for "Big Data"
A more optimistic perspective presented by Shiomo Argamon Information-Management.com: http://www.information-management.com/news/the-myth-of-the-mythical-unicorn-10025694-1.html
ReplyDeleteAlan, thanks for referencing my article. I agree wholeheartedly with your analysis that we need a rigorous approach to data stewardship and contextualization; the growth of more and more automated analytics solutions makes this problem even greater, since more analysis can be done without deep understanding of what the data truly represent. I think that the notion of the "data scientist" as a kind of "shoot-from-the-hip" analyst (to exaggerate somewhat) isn't the right conception - the data scientist must also be capable of understanding the business and environmental context of the data and the questions to be answered. Certainly, different team members will have specific expertise in different parts of the problem, but all need to be able to communicate about and understand the larger context of the data science problem. Otherwise, the data scientist is no better than a (presumably) well-designed black-box bit of machine learning software.
ReplyDeleteThanks Shlomo - I think we've got a real challenge in our industry to move the practitioners beyond being technically oriented and towards having the skills to convey context and narrative based on the evidence. Sadly, I think we're a long way from that being pervasive - most BI and Analytic people I encounter are still largely acting as conduits and nothing more.
DeleteThis article from Ted Cuzzillo really resonated with me, but I don't reckon on there being too many story-tellers around (at leras, not yet!): http://www.information-management.com/news/cue-the-data-storytellers-the-data-industrys-next-big-stars-10025687-1.html
SAS Data Quality Steward
ReplyDelete