Software Science: Long and Thin: Iteration 1

Josh and I completed our first iteration of our data warehouse today! We started Tuesday, took Wednesday off for the Independence Day holiday, and then completed it this afternoon. All told, we probably only put four or so hours of pair programming into it, for a total of eight man-hours. And we have a working data warehouse!

Granted, it only has one measure (two if you count the bonus “count” measure SSAS throws in for free) and two dimensions (one if you think it’s double counting to use the same dimension in two different roles), but it does contain over a quarter of a billion fact rows, and loads in under two hours. If this were our final product, we could get the load down to a few minutes (straight from source to SSAS), but we knew this isn’t our final product, so we built some additional steps into our data warehouse to accommodate our future plans for expansion.

It is not uncommon for data warehouses to land and stage their data in multiple steps along the way to a unified dimensional model (UDM). Indeed, I expect that it is very rare for any data warehouse of even medium complexity to not employ at least one intermediate stage on the way to the UDM. In our case we are not only data warehousing, but also facilitating enterprise application integration (EAI) throughout our parent organization. We therefore stage our data in an operational data store (ODS) that will serve as the hub for our EAI processes. The ODS will be a cleansed and enhanced normalized relational representation of the data using conformed dimensions for many of its foreign keys. The use of the conformed dimensions will be key to standardizing across both operational applications that consume the ODS and the enterprise reporting. We therefore built our first data warehouse iteration with four staging steps associated with four staging steps we know we will have in our final data warehouse:

An initial staging area for assembly of conformed dimensions prior to loading the ODS.
The operational data store.
A staging area for data being assembled between the ODS and UDM.
A relational database containing the unified dimensional model, which will be consumed directly, without modification, by SSAS.

This means we copied and stored our entire data set (260-odd million rows) four times on its way to the MOLAP cube. Writing to these databases was by far the bulk of the time. We could copy the data between stages in less than half an hour each. In this our first iteration, the data in each stage was identical, so the only real processing was done in the first stage, which created the conformed dimension tables and performed the foreign key lookups for the facts. Still, the time difference between that first stage and the others was negligible.

Amazingly enough, once the data was in our UDM database, SSAS was able to read the full database and process it into a MOLAP cube in around six minutes!

SSIS does not by default display start and end times of its packages. In our first iteration, we have not yet set up logging and timing infrastructure for our processes — we were running them manually through the debugger — so we do not have exact timings. We do, however, want records, logs, and timings so we can evaluate the impact of our future decisions on the behavior of our data warehouse process, so it is likely that out second iteration, rather than adding data, will add this infrastructure.

Honestly, I am looking forward to our next iteration, and I have not had such enthusiasm for our data warehouse since I took it over six months ago. Dave Rodabaugh’s advice was excellent: go long and thin. Meaning iterate small incremental improvements through the entire assembly line rather than engineering to perfection each stage in a waterfall-like process. Currently, we have no idea how long our data warehouse rewrite process will take (well, we do have our guesses), but we expect to have several iterations under our belts in short order (4-6 weeks), and so we will then have an idea as to how long it takes to do certain things. We should then be able to roadmap out the things we do want to do and attach time estimates to them. These estimates will not be perfect (“prediction is very difficult, especially about the future”), but they will be a whole lot better than our current guesswork.

3 comments:

TheJoshStephensJuly 6, 2012 at 8:37 PM
Our perception of the time we were pair programming was amusing. My blog states 2 hours of pair programming and yours states 4. Perhaps your watch was off?
UnknownJuly 7, 2012 at 1:58 PM
@Josh: I'm probably counting my non-pair time in my estimate. I spent some time getting the Visual Studio database project put together and I had ported the date dimension table and seed procedure before we got together. Also, after you left, I was too eager to wait for us to get together again, so I pushed on to the final goal, populating the UDM, and creating and processing the cube. None of those were major undertakings, but an hour here and an hour there adds up.
TheJoshStephensJuly 8, 2012 at 9:15 PM
Heh. That's what I figured you were doing.
I finished chapter 4 regarding SCDs. Interesting read. Made me wonder if we could use type 3 scd's for identifying previous filings of the same claim. I just saw your new web log entry and am going to go read it.

Regards
Josh

Thursday, July 5, 2012

Long and Thin: Iteration 1

3 comments: