Tuesday, July 3, 2012

Write What You Mean First

We recently had the pleasure of a week-long visit from Dave Rodabaugh of Atlas Analytics, during which Josh, Dave, and I got to debate the appropriate design patterns in our ETL processes.  My intent is to be able to periodically perform a full load of our data warehouse in case we change some of our processing logic.  This is particularly pertinent to our logic in grouping claim filings into encounters and matching remittances to their originating claims.  As Dave put it, “full loads will occur far more often than anticipated,” and, well, I anticipate performing full loads with some regularity.

Just our current incremental loads take several hours to load a day’s worth of claims.  (I sometimes joke that it takes 25 hours to load 24 hours’ worth of claims.)  Since our full data source contains many years of claims, a full data load seems almost out of the question.  However, both Josh and I do share the confidence that we can come up with a feasible way to perform a full load of our data warehouse.

So Dave is explaining the standard ETL design pattern of processing dimensions separately from your facts.  There is a fair bit of elegance and simplicity to this approach, but I raised the objection that, with the quantities of data we are dealing with, multiple scans through such massive data sets looking for unique dimension values is prohibitively wasteful.  Josh countered that our limiting factor is write speed, and that read speed is negligible.  In general, he is right, but I felt that in this specific case, we may be scanning one or two billion rows to find just a thousand or two unique values to put in a dimension table, and doing so for each dimension will multiply our processing time.

Dave Rodabaugh, ever the diplomat, was able to phrase my position succinctly, but also in such a way that caused me to acquiesce to the suggestion that we process dimensions separate from (and prior to) our fact flow.  We all agreed that there were optimizations that could be done so that the situation was not quite so dire as I predicted, but to paraphrase Dave:

If the concern is that the time it takes to scan through the data will be the biggest time sink, then it makes sense to handle the data less often.  In such a case, performing dimension processing within the fact dataflow makes sense.

That’s a pretty heavy paraphrase, since I do not have the gift of oratory that Dave has.  However, notice how he couched his agreement with me in a condition: If the concern is that the time it takes to scan through the data will be the biggest time sink.  Well, when someone makes any statement of fact, I have the kind of argumentative mind that wants to find holes in the argument.  In this case, examining the condition is the obvious place to start, and I realized that I was making an assumption in my position.  My assumption was not without basis — it was based on my experiences dealing with our current data warehouse — but it was not based on any data.  The title of this blog is “Software Science,” and here I am taking a wholly unscientific approach to the design of my ETL processes.  Even worse, I am constantly a proponent of the philosophy Write what you mean first, and tune only when required.

I am not the first to come up with that opinion.  Forty years ago, Donald Knuth expressed a similar sentiment: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil” [Knuth 1974].  Hearing my position phrased this way, however, made me realize that I was optimizing prematurely.

So today we took our first cut at a full data load: loading two dimensions and a single measure.  Not much, but it will give us a lower bound as to how long our full data load may take.  To my very pleasant surprise, two SELECT DISTINCT’s on a couple of hundred million rows to produce a few thousand distinct values took less than 30 seconds!  This serves as a great reminder to me to:

Write what you mean first.

No comments:

Post a Comment