Do you need to choose between 3NF and Dimensional techniques?
In reply to a recent post on LinkedIn, I commented on the question of using 3rd Normal Form (3NF) vs Dimensional modelling as the underpinning design technique for the core data warehouse (often characterised as “Inmon vs Kimball”).
If I had a dollar for every time I’ve had a discussion with a data modeller or architect on this, then – well, lets just say I’d have a few dollars more than I’ve got right now. (Having kids didn’t help the bank balance either), There’s still a lot of emotional investment tied up in this, and some pretty entrenched positions, to the point where it can almost seem like religious warfare, rather than healthy debate. The architectural zealots are out in force, the modelling jihadists seek to impose one approach or another, and data dogma takes precedence over analytical enlightenment of rationality and pragmatism. (One time my colleague Matthew De George and I were discussing the collective nouns for different groups of things, and Matthew suggested that the term for a group of data architects should be an “argument”…)
And like most wars of doctrine, we’re really fighting about variations on the same theme, rather than fundamental differences of belief. (As the British satirical TV show Spitting Image once put it, “myGod is Bigger than Your God.”)
Now, when database technologies were still in their infancy with respect to supporting data warehousing functions (as distinct from OLTP applications), there was some degree of merit in choosing one approach or the other. Certain DBMS platforms were more suited to applying one approach or the other if you wanted to achieve a level of acceptable query performance.
So we've continued to spend the past twenty-odd years looking at the engineering of the corporate data factory [(c) Bill Inmon…] and there is absolutely still a place for a rigorous, robust, integrated, consistent and compliant point-of-access for reporting and analysis. Unless of course you’re a Kimballite (Kimballist? Kimballian?), and reckon on dealing with any data compliance issues that may arise with a devil-may-care nonchalance.
Except I don’t think it’s as simple as sticking to one approach or the other, at least not any more. The physical structuring of data isn’t nearly as important as it once was, at least, not in the first instance (Hadoop, anyone?!). Database technologies have moved on a whole heap in the last 20 years and the advent of the data warehouse appliance means that sheer technological brute force can derive excellent query performance without further ado.
But more to the point, many analytical scenarios actually only need an indicative level of accuracy to steer the business decision-making process - and in those cases, a quick-and-dirty "sandpit" delivery may well be more appropriate. If there's one thing I've learned in this data management game of ours, it's that delivering an early outcome gains goodwill and buys you the time to then go on and engineer a repeatable operational solution – you can do “quick and dirty” and “fully engineered” in parallel. (What I sometimes call the "do it twice, do it right" approach.)
So, what we should be giving our careful thought to is not so much the design method per se, but rather the delivery trade-offs that are necessary with respect to delivering iterations of the analytical solution quickly and responsively so that you reduce time-to-value, versus the costs to rigour, resilience and auditability (sometimes referred to as "technical debt"). This means delivering incrementally, being pragmatic, and avoiding introducing levels of complexity whenever possible.
In most circumstances, I'd say a hybrid approach is the way to go.
(As an example, I remember one situation a few years back when I was working for a UK telecoms retailer and service provider. Out of the blue, an entrepreneurial business came to market with a very cheap handset-and-calls bundle, marketed in a way that was clearly targeted at eating our lunch. Our usual EDW process would have taken about six weeks to engineer the scenario models that we needed to evaluate the impact of this threat. Instead, we dumped an ad hoc copy of our call records, re-processed them based on the competitor's new tariff, and responded with a lower-priced counter-offer to our customers that meant we retained the customer base and killed the competitor's move stone dead. We then took what we’d learned from dealing with the one-off scenario to engineer a tariff modelling solution which enabled us to deliver what-if analysis as a standard function of the warehouse.)
Also, don't forget that the techniques of 3NF and dimensional modelling are not only about delivering a physical design!
In this respect, I think sometimes that Ralph Kimball has actually done us all a bit of a dis-service in conflating both logical modelling and physical modelling processing into a single step.
Logical modelling of data is about developing a rigorous and shared understanding of the nature of your data – it’s semantic meaning - and Bill Inmon still makes an excellent case for the need for structuredapproach to deriving this understanding. Both normalisation and dimensional techniques are vital to properly inform, analyse and communicate a shared and rigorous intellectual understanding of your data. Therefore BOTH methods need to be applied to your business scenario, before you then make any decisions about the design approach that you will adopt within physical data management environment.
It is a separate step to then make the physical design choices about how to engineer your data structures, and the decisions will be entirely dependent upon the type of products used within your solution.