"An extraordinary thinker and strategist" "Great knowledge and a wealth of experience" "Informative and entertaining as always" "Captivating!" "Very relevant information" "10 out of 7 actually!" "In my over 20 years in the Analytics and Information Management space I believe Alan is the best and most complete practitioner I have worked with" "Surprisingly entertaining..." "Extremely eloquent, knowledgeable and great at joining the topics and themes between presentations" "Informative, dynamic and engaging" "I'd work with Alan even if I didn't enjoy it so much." "The quintessential information and data management practitioner – passionate, evangelistic, experienced, intelligent, and knowledgeable" "The best knowledgeable, enthusiastic and committed problem solver I have ever worked with" "His passion and depth of knowledge in Information Management Strategy and Governance is infectious" "Feed him your most critical strategic challenges. They are his breakfast." "A rare gem - a pleasure to work with."

Monday, 24 February 2014

Hunting for unicorns

The roles of the Data Scientist and Data Steward in a world of "Big Data"

An article on SearchDataManagement caught my attention this week. The broad premise of the article is that there is an unhealthy tension between the role of the Data Steward (“rigour good, indiscipline bad”) and Data Scientists (“freedom good, structure bad”), and that data governance gets in the way of delivering analytical value. The authorities quoted in the article all come from people that represent technology or services vendors who are in the ”data analytics camp” (Shawn Rogers from Enterprise Management Associates, Jill Dyché from SAS, Jonathan Geiger from Intelligent Solutions, William McKnight from McKnight Consulting). 

Now, call me a cynic (“You’re a cynic”), but don’t all these vendors all have a vested interest in getting the tools and projects deployed in the fastest time possible, and Devil take the hindmost?!

Well I’m not buying it.

As far as I’m concerned, the role of the Data Steward doesn't get set aside just because we're operating in the "Big Data" space, and in my opinion the article’s "straw man" perspective on Data Scientists' attitudes ("hands off, let the data speak") is naïve; indeed, in my experience it doesn't actually bear out in the real world.

Sure, the dynamics, speed of delivery and overall level of definitional rigour may change, at least in the short term. But applying good data stewardship principles to “Big Data” projects doesn’t need to mean slowing things down. Indeed, a general understanding of the semantics, interpretation and context of the data set is vital in order to derive any meaning from the data anyway. That cannot be done in isolation. In contrast to the “set aside” attitude offered by Dyché et al, I’d offer that bringing data stewardship to bear early in the process will enable more informed curation of the data set and provide a feedback loop to improve the overall quality of data set for the longer term (including the ability to think about and adapt the data to meet other business needs.)

The key thing is that there is a real opportunity for collaboration and co-operation. The data scientist brings tools, analytical and data processing expertise, the business data steward brings understanding of the value and utility of the data in context - which to my mind is a precursor to any data analytic task anyway Both are necessary, neither is sufficient.

The only way that the “do it fast, do it with the current data, don’t do any cleansing” approach can work is if you’ve got an individua who can offer both Data Scientist and Data Steward perspectives. And let’s face it, how many “unicorns” are out there who can genuinely offer both perspectives?

In any event, the business also needs to remain responsible for the data set throughout. And if opportunities arise to improve the data so that it is more fit for the purpose(s) that are required of it, so much the better.


  1. Hi Alan

    Great article. I agree with everything you say apart from the role of the Data Steward. The foundation to all data quality, and this is no different for Big Data, is the underlying data architecture.

    So, if big data is to be trusted, we need people who can ensure that it is structured in a manner that will not cause it to produce spurious results.

    A Data Steward might be able to point out results are spurious or incorrect but would be unable to determine why and how to rectify it. What is needed is a Data Architect who can ensure that the data structures have the integrity to produce the correct results every time based on the data being queried.

    Because pragmatic Data Architects are now so thin on the ground compared to 20 years ago, enterprises everywhere are currently very likely to be making critical business decisions based on flawed inferences made by queries on Big Data.

    One major data black hole to which the whole industry remains blind is the violation of Fifth Normal (5NF) which is almost impossible to avoid when data sets from different sources are joined in queries. Even if the underlying data is 100% correct, violation of 5NF will always produce errors and no technology can detect or rectify these errors.

    A pragmatic Data Architect, on the other hand, would be able to ensure that the underlying architecture of the data was such that it would produce no such errors.

    The current lack of Data Architecture knowledge and skills is going to ensure that Big Data will produce some really Big Errors. The only question is, how big a catastrophe must occur before people are willing to admit that something is seriously wrong?

    Again, great article.


  2. HI Alan

    I agree.

    The link between successful governance and successful big data analytics was, largely, borne out by research run (independently) in the US and Europe - check out this post for links to the research http://dataqualitymatters.wordpress.com/2013/09/16/big-data-quality-matters/


  3. Hi John - totally agree WRT the involvement of the Data Architect (or someone with the skills to play that role). In an ideal world, this would be so.

    It's an additional role that goes into the mix - and one that I deliberately chose to leave out of the blog, as it would "muddy the water" in the argument of responding to the "leave the raw data alone" mantra noted in the originating SearchDataManagement article.

    Note that I am differentiating between *roles* and *individuals* - it's possible to encapsulate more than one role within the same participant person. You'd like to think (hope?!) that any Data Scientist worth their salt would understand the process of data design and be able to deal with any data structuring considerations as part of any data analytics project. I've worked with some very good statisticians who do work this way and build their analytical models "from the data up". I've also worked with some stinkers who just don't get it and are totally tools-bound.

    To find an individual that can play all three roles? That's not just a unicorn. That's a horse of a different colour...! ( http://oz.wikia.com/wiki/Horse_of_a_Different_Color )

  4. Thanks for the link Gary - there's definitely a strong body of opinion and plenty of research evidence linking analytic success to DQ effort (and failure when DQ is ignored)

    The Information Week BI & Analytics survey highlights DQ as the biggest barrier to successful Analytics: http://reports.informationweek.com/abstract/81/11715/Business-Intelligence-and-Information-Management/Research:-2014-Analytics,-BI,-and-Information-Management-Survey.html

  5. Add additional comments from associated LinkedIn thread:

    Thanks Emma. At best, I think any approach that does no pre-planning about the content quality is naive (at worst, could be considered negligent).

    Clearly, the BI/Analytic tool vendors have a vested interest in getting their products deployed quick-smart (that's when their license revenues hit, after all!), so no wonder they're pushing a message that says "to hell with data modeling & quality, we haven't got time for that!!"

    But actually, most statistical analysts don't actually work that way anyway.

    I find that the opposite is often true - statisticians will often spend inordinate amounts of time formatting, structuring and manipulating data, correcting anomalies and eliminating outliers in order to "make sure the model is right". They typically spend 80% of their time on data prep.

    More power to them, says I!

    Except that the knowledge and understanding of the underlying data and all the inferred rules that have been applied hardly ever get documented, shared or replicated back into the source data, so all the good data diagnosis work is then lost. There's also the problem that while the Statto almost always believes that they fully understand the business problem, they may not have the same perspectives as the business users themselves. (Leading to confusion, mis-communication and disappointment).

    This is one of the reasons why I'm so excited to be advising QFire Software on their distributed data quality approach. QFire supports analysts who need to get going and just do some work on the data, while at the same time providing an environment where any DQ rules and the library of data sets can be captured, propagated and shared with colleagues. (They're also doing some interesting things in the area of data preparation & DQ for "Big Data"

  6. A more optimistic perspective presented by Shiomo Argamon Information-Management.com: http://www.information-management.com/news/the-myth-of-the-mythical-unicorn-10025694-1.html

  7. Alan, thanks for referencing my article. I agree wholeheartedly with your analysis that we need a rigorous approach to data stewardship and contextualization; the growth of more and more automated analytics solutions makes this problem even greater, since more analysis can be done without deep understanding of what the data truly represent. I think that the notion of the "data scientist" as a kind of "shoot-from-the-hip" analyst (to exaggerate somewhat) isn't the right conception - the data scientist must also be capable of understanding the business and environmental context of the data and the questions to be answered. Certainly, different team members will have specific expertise in different parts of the problem, but all need to be able to communicate about and understand the larger context of the data science problem. Otherwise, the data scientist is no better than a (presumably) well-designed black-box bit of machine learning software.

    1. Thanks Shlomo - I think we've got a real challenge in our industry to move the practitioners beyond being technically oriented and towards having the skills to convey context and narrative based on the evidence. Sadly, I think we're a long way from that being pervasive - most BI and Analytic people I encounter are still largely acting as conduits and nothing more.

      This article from Ted Cuzzillo really resonated with me, but I don't reckon on there being too many story-tellers around (at leras, not yet!): http://www.information-management.com/news/cue-the-data-storytellers-the-data-industrys-next-big-stars-10025687-1.html

  8. Thank you so much for sharing such an awesome blog...Nice tips! Very well written information. Many thanks!
    SAS Certification course