Do you need to choose
between 3NF and Dimensional techniques?
In
reply to a recent post on LinkedIn, I commented on the
question of using 3rd Normal Form (3NF) vs Dimensional modelling as
the underpinning design technique for the core data warehouse (often
characterised as “Inmon vs Kimball”).
If
I had a dollar for every time I’ve had a discussion with a data modeller or
architect on this, then – well, lets just say I’d have a few dollars more than
I’ve got right now. (Having kids didn’t help the bank balance either), There’s
still a lot of emotional investment tied up in this, and some pretty entrenched
positions, to the point where it can almost seem like religious warfare, rather
than healthy debate. The architectural zealots are out in force, the modelling
jihadists seek to impose one approach or another, and data dogma takes
precedence over analytical enlightenment of rationality and pragmatism. (One
time my colleague Matthew De George and I were discussing the collective nouns for different groups
of things, and Matthew suggested that the term for a group of data architects
should be an “argument”…)
And
like most wars of doctrine, we’re really fighting about variations on the same
theme, rather than fundamental differences of belief. (As the British satirical
TV show Spitting Image once put it, “myGod is Bigger than Your God.”)
Now,
when database technologies were still in their infancy with respect to
supporting data warehousing functions (as distinct from OLTP applications),
there was some degree of merit in choosing one approach or the other. Certain
DBMS platforms were more suited to applying one approach or the other if you
wanted to achieve a level of acceptable query performance.
So we've continued to spend the past twenty-odd years looking at the engineering of the corporate data factory [(c) Bill Inmon…] and there is absolutely still a place for a rigorous, robust, integrated, consistent and compliant point-of-access for reporting and analysis. Unless of course you’re a Kimballite (Kimballist? Kimballian?), and reckon on dealing with any data compliance issues that may arise with a devil-may-care nonchalance.
Except
I don’t think it’s as simple as sticking to one approach or the other, at least
not any more. The physical structuring of data isn’t nearly as important as it
once was, at least, not in the first instance (Hadoop, anyone?!). Database
technologies have moved on a whole heap in the last 20 years and the advent of
the data warehouse appliance means that sheer technological brute force can
derive excellent query performance without further ado.
But
more to the point, many analytical scenarios actually only need an indicative
level of accuracy to steer the business decision-making process - and in those
cases, a quick-and-dirty "sandpit" delivery may well be more
appropriate. If there's one thing I've learned in this data management game of
ours, it's that delivering an early outcome gains goodwill and buys you the
time to then go on and engineer a repeatable operational solution – you can do
“quick and dirty” and “fully engineered” in parallel. (What I sometimes call
the "do it twice, do it right" approach.)
So,
what we should be giving our careful thought to is not so much the design
method per se, but rather the delivery
trade-offs that are necessary with respect to delivering iterations of the analytical
solution quickly and responsively so that you reduce time-to-value, versus the
costs to rigour, resilience and auditability (sometimes referred to as
"technical debt"). This means delivering incrementally, being
pragmatic, and avoiding introducing levels of complexity whenever possible.
In
most circumstances, I'd say a hybrid approach is the way to go.
(As
an example, I remember one situation a few years back when I was working for a UK
telecoms retailer and service provider. Out of the blue, an entrepreneurial
business came to market with a very cheap handset-and-calls bundle, marketed in
a way that was clearly targeted at eating our lunch. Our usual EDW process
would have taken about six weeks to engineer the scenario models that we needed
to evaluate the impact of this threat. Instead, we dumped an ad hoc copy of our
call records, re-processed them based on the competitor's new tariff, and
responded with a lower-priced counter-offer to our customers that meant we
retained the customer base and killed the competitor's move stone dead. We then
took what we’d learned from dealing with the one-off scenario to engineer a
tariff modelling solution which enabled us to deliver what-if analysis as a
standard function of the warehouse.)
Also,
don't forget that the techniques of 3NF and dimensional modelling are not only
about delivering a physical design!
In
this respect, I think sometimes that Ralph Kimball has actually done us all a
bit of a dis-service in conflating both logical modelling and physical
modelling processing into a single step.
Logical
modelling of data is about developing a rigorous and shared understanding of
the nature of your data – it’s semantic meaning - and Bill Inmon still makes an excellent case for the need for structuredapproach to deriving this understanding. Both normalisation and dimensional
techniques are vital to properly inform, analyse and communicate a shared and
rigorous intellectual understanding of your data. Therefore BOTH methods need
to be applied to your business scenario, before you then make any decisions
about the design approach that you will adopt within physical data management
environment.
It
is a separate step to then make the physical design choices about how to
engineer your data structures, and the decisions will be entirely dependent
upon the type of products used within your solution.
Amen.
Dimensional modelling is no less rigorous than 3NF CIF. Dimensional models are 3NF if fully snowflaked and at least 2NF in the slowly changing dimensions and the facts are 3NF anyway.
ReplyDeleteFully conformed dimensions are fed through an integration layer that has temporal 3NF to remove redundancy then fed into read only dimensions that the user cannot change. Its not OTP!!! The fact (pardon the pun) that dimensions are 2NF does not indicate the fear and trepidation one would have with an OTP system since these are read only tables that yes, may have dependencies between non key attributes however this is just for producing the same information (like product code and product name). The real source of data are from real word business processes, modelling from business processes gives the highest change of meaning not being lost through modelling the BPs into some abstract data model to make modellers and It people happy then hoping when you unwind it in a report out the other end it might still mean the same thing! Inmonites are becoming dinosaurs.
Pragmatism in all things. If you don't understand both Relational AND Dimension approaches to data modelling, then you shouldn't be doing data modelling! (And that's not even taking into account some of the new ideas coming out with the Data Vault approach).
ReplyDeleteThe Kimball vs Inmon data warehouse architecture debate is clearly a related topic.
The overall watch-word here is to understand all the rules before you start to break them...