INFORMATION ACTION: actionable ideas for evidence-based decision-making: The Religious Warfare of data modelling

Do you need to choose between 3NF and Dimensional techniques?

In reply to a recent post on LinkedIn, I commented on the question of using 3^rd Normal Form (3NF) vs Dimensional modelling as the underpinning design technique for the core data warehouse (often characterised as “Inmon vs Kimball”).

If I had a dollar for every time I’ve had a discussion with a data modeller or architect on this, then – well, lets just say I’d have a few dollars more than I’ve got right now. (Having kids didn’t help the bank balance either), There’s still a lot of emotional investment tied up in this, and some pretty entrenched positions, to the point where it can almost seem like religious warfare, rather than healthy debate. The architectural zealots are out in force, the modelling jihadists seek to impose one approach or another, and data dogma takes precedence over analytical enlightenment of rationality and pragmatism. (One time my colleague Matthew De George and I were discussing the collective nouns for different groups of things, and Matthew suggested that the term for a group of data architects should be an “argument”…)

And like most wars of doctrine, we’re really fighting about variations on the same theme, rather than fundamental differences of belief. (As the British satirical TV show Spitting Image once put it, “myGod is Bigger than Your God.”)

Now, when database technologies were still in their infancy with respect to supporting data warehousing functions (as distinct from OLTP applications), there was some degree of merit in choosing one approach or the other. Certain DBMS platforms were more suited to applying one approach or the other if you wanted to achieve a level of acceptable query performance.

So we've continued to spend the past twenty-odd years looking at the engineering of the corporate data factory [(c) Bill Inmon…] and there is absolutely still a place for a rigorous, robust, integrated, consistent and compliant point-of-access for reporting and analysis. Unless of course you’re a Kimballite (Kimballist? Kimballian?), and reckon on dealing with any data compliance issues that may arise with a devil-may-care nonchalance.

Except I don’t think it’s as simple as sticking to one approach or the other, at least not any more. The physical structuring of data isn’t nearly as important as it once was, at least, not in the first instance (Hadoop, anyone?!). Database technologies have moved on a whole heap in the last 20 years and the advent of the data warehouse appliance means that sheer technological brute force can derive excellent query performance without further ado.

But more to the point, many analytical scenarios actually only need an indicative level of accuracy to steer the business decision-making process - and in those cases, a quick-and-dirty "sandpit" delivery may well be more appropriate. If there's one thing I've learned in this data management game of ours, it's that delivering an early outcome gains goodwill and buys you the time to then go on and engineer a repeatable operational solution – you can do “quick and dirty” and “fully engineered” in parallel. (What I sometimes call the "do it twice, do it right" approach.)

So, what we should be giving our careful thought to is not so much the design method per se, but rather the delivery trade-offs that are necessary with respect to delivering iterations of the analytical solution quickly and responsively so that you reduce time-to-value, versus the costs to rigour, resilience and auditability (sometimes referred to as "technical debt"). This means delivering incrementally, being pragmatic, and avoiding introducing levels of complexity whenever possible.

In most circumstances, I'd say a hybrid approach is the way to go.

(As an example, I remember one situation a few years back when I was working for a UK telecoms retailer and service provider. Out of the blue, an entrepreneurial business came to market with a very cheap handset-and-calls bundle, marketed in a way that was clearly targeted at eating our lunch. Our usual EDW process would have taken about six weeks to engineer the scenario models that we needed to evaluate the impact of this threat. Instead, we dumped an ad hoc copy of our call records, re-processed them based on the competitor's new tariff, and responded with a lower-priced counter-offer to our customers that meant we retained the customer base and killed the competitor's move stone dead. We then took what we’d learned from dealing with the one-off scenario to engineer a tariff modelling solution which enabled us to deliver what-if analysis as a standard function of the warehouse.)

Also, don't forget that the techniques of 3NF and dimensional modelling are not only about delivering a physical design!

In this respect, I think sometimes that Ralph Kimball has actually done us all a bit of a dis-service in conflating both logical modelling and physical modelling processing into a single step.

Logical modelling of data is about developing a rigorous and shared understanding of the nature of your data – it’s semantic meaning - and Bill Inmon still makes an excellent case for the need for structuredapproach to deriving this understanding. Both normalisation and dimensional techniques are vital to properly inform, analyse and communicate a shared and rigorous intellectual understanding of your data. Therefore BOTH methods need to be applied to your business scenario, before you then make any decisions about the design approach that you will adopt within physical data management environment.

It is a separate step to then make the physical design choices about how to engineer your data structures, and the decisions will be entirely dependent upon the type of products used within your solution.

Amen.

2 comments:

Unknown5 July 2014 at 08:31
Dimensional modelling is no less rigorous than 3NF CIF. Dimensional models are 3NF if fully snowflaked and at least 2NF in the slowly changing dimensions and the facts are 3NF anyway.
Fully conformed dimensions are fed through an integration layer that has temporal 3NF to remove redundancy then fed into read only dimensions that the user cannot change. Its not OTP!!! The fact (pardon the pun) that dimensions are 2NF does not indicate the fear and trepidation one would have with an OTP system since these are read only tables that yes, may have dependencies between non key attributes however this is just for producing the same information (like product code and product name). The real source of data are from real word business processes, modelling from business processes gives the highest change of meaning not being lost through modelling the BPs into some abstract data model to make modellers and It people happy then hoping when you unwind it in a report out the other end it might still mean the same thing! Inmonites are becoming dinosaurs.
Alan D. Duncan9 July 2014 at 02:35
Pragmatism in all things. If you don't understand both Relational AND Dimension approaches to data modelling, then you shouldn't be doing data modelling! (And that's not even taking into account some of the new ideas coming out with the Data Vault approach).

The Kimball vs Inmon data warehouse architecture debate is clearly a related topic.

The overall watch-word here is to understand all the rules before you start to break them...

INFORMATION ACTION: actionable ideas for evidence-based decision-making

Pages

Friday, 6 December 2013

The Religious Warfare of data modelling

2 comments: