Software Science: A Data-Driven Date Dimension

A date dimension is usually seeded with all possible valid dates, and any fact with a date outside of that range is assigned an unknown, invalid, or out-of-range dimension key (very often –1). However, we are building not only a reporting data warehouse, but also an operational data store (ODS) that will be used for enterprise application integration (EAI). Therefore, we need to retain values that were received out of range, even if they are clearly wrong or invalid. For instance, a transaction dated in 2092 (and, yes, we do have some; most likely, the user meant 2002 or 2012) must record that in our ODS so we can show it to a user, even if it was obviously a typographical error. It is not our job to correct these errors, but to show what happened. It is valuable information to someone to see that date on a report and investigate what happened so it can be corrected.

We have seeded a wide range of dates (2000-2013) in our dimension table, but we can never seed every possible date that might come in, even if we were to seed ever date from January 1, 1 AD to December 31, 9999. What would we then do with a transaction date of February 29, 2011? Or March 99, 2012? Some of these dates come in on ANSI X12 transactions which generally mandate a YYYYMMDD format for dates; but what do we do when such a transaction comes in with a date of “ABCDEFGH”? We cannot reject the entire transaction out of hand due to an error in one field, and yet, if someone wants to know what was in that particular field of that particular transaction, we should show them so they can go and complain to whoever is producing such invalid dates.

Our solution is to treat date like any other data-driven dimension: if a date comes in that is not yet in the dimension table, add it.

We are currently doing this in our initial entry stage: we scan the data source and select all distinct date values. In SSIS, we then perform a lookup of the date value in the date dimension table, directing all unmatched values to a destination which is the date dimension table. This will ensure that whatever date appears in the subsequent fact data flow will already be populated in the dimension table. We retain the fallback of assigning –1 to any date value in the fact flow that is not found in the dimension, but this should never happen, and we can audit for any –1 date keys to let us know that something in our data warehouse has gone wrong.

One of my concerns with this approach was performance. How long does it take to scan the hundreds of millions of rows in the data source to pick out the distinct values? I thought it might be an hour or more, but figured if it came back in 10 or 15 minutes we could work with it. This was our first iteration, so we would be modeling our processes for other data-driven dimensions off this one, so we would expect to be doing this unique scan for every dimension. I kicked off the query (a SELECT DISTINCT … UNION SELECT DISTINCT …), to run it in the background to see, and it came back in 28 seconds! I ran it again to make sure, and it came back in even less the second time. I was amazed that our source database could scan over 260 million rows in under 30 seconds. Without indexes on those columns, by the way, so it had to have been doing a full table scan. (I do not have explain plan permissions in our source systems to confirm.)

With performance like this, we are easily able to perform the necessary SELECT DISTINCT scans to populate our dimensions prior to pulling in the facts. I had assumed that this approach would have been been prohibitively slow, but as I explained in a previous blog post, “Write What You Mean First”, rather than assume, test.

Software Science

Saturday, July 7, 2012

A Data-Driven Date Dimension

1 comment: