Friday, July 27, 2012

Impractical SSIS Configurations

If you are developing SSIS packages for use in an enterprise, it is likely that you have different (and possibly multiple) development and production environments.  When you deploy a package to production, therefore it is necessary to change all your connections and other settings from their development values to production.  Without SSIS configurations, your workflow would be to load the package into BIDS, change all the connections and variables, save the modified package and then deploy it to the production environment.  In many of our ETL packages, there can be half a dozen connections that need to be changed, and we may have 20 or so SSIS packages, so we’re talking on the order of 100 changes that need to be made to release an updated ETL process.

This is the problem configurations in SSIS are intended to solve.

But they don’t.

I don’t think our requirements are unusual or unreasonable: manage sets of connections and variable settings.  However, here is some of the things we tried which didn’t really seem practical.

  1. Use XML configuration files, with different versions deployed to the same location on the development and production servers.

    Problem 1: Unless *all* your SSIS packages use the same set of connections and variables, you will require a different configuration file for each SSIS package. Failure to have *only* the correct settings in the configuration file (i.e., spurious settings in your configuration file) will result in a myriad of errors and warnings at design time and/or execution time.  (I forget precisely which, as I am writing this on an iPad somewhere over the Pacific Ocean between Tahiti and Auckland, but if memory serves me, it was errors at run time.). You might think you could mitigate this problem by creating *every* connection in *every* package regardless of whether it is used or not, but (1) this is an awkward hack, and (2) when you add a new connection to one package, you now have to add it to every package.

    Problem 2: Because the configuration file is server-wide, this approach only allows you to have one configuration per server.  However, this is impractical when you might have your regular ETL process, but you may also have the need for additional processes to run in parallel.  One example of this would be when your regular process does an incremental load, but you also want to run a full load.  Since the full load may take a considerable amount of time (possibly several days), you want to run the full load in parallel with the incremental load and into different databases until the full load is complete, at which time you will make the newly fully-loaded database the active production database.  This situation is exacerbated by the fact that configuration files are not overridden by command line options — configuration files are applied after command line options — so it is impossible to correct this problem.  Using environment variables to identify the appropriate configuration file also does not alleviate the situation, since environment variables, too, are server-wide, so we are still restricted to a single configuration per server.
  2. Use database-stored configuration settings.

    This approach suffers the same as problem 1 with XML configuration files: we cannot have a single generic set of settings in our database table, because that would require every package to have every database connection.  Luckily, with a database table, we have at our full disposal the tools of SQL Server to assist us.

    Our first thought was to have a table (or set of tables) describing the settings and connections for each package, and then define a view that contains all the requisite settings. We are thwarted, however, by BIDS, which won't accept a view in place of the settings table. (I believe it works OK at run time, just BIDS doesn't like it at design time.) While we could have created update and insert triggers on the view to try to fool BIDS into thinking our view is a table, we would rather just have a physical table and create a stored procedure that populates it (probably with a MERGE statement), since this gives us some added flexibility in case a strict view will not work (or would be unwieldy), and also we did not know for sure whether we could fool BIDS into thinking the view was a table.

So our final idea, as yet untried, is to use a set of tables describing the connections and settings in our packages, and create a stored procedure that populates an SSIS configuration table with the appropriate values. To accommodate our requirement to have multiple configurations on the same server, we may have multiple ETL settings databases with the difference sets of setting values.

This ETL database will also likely be used for other command and control requirements of our ETL processes.

Monday, July 16, 2012

References to Other Databases in Visual Studio Database Projects

We recently decided to move our maintenance of our database schemas from a pile-of-scripts to Visual Studio 2010's database projects, so I created a database project and imported its objects from our existing Development database.  The resulting project had thousands of errors and warnings.

The significant majority of the errors were due to cross-database references in stored procedures and functions. Here is how we resolved them:
  1. The first type of error was a reference to a project that I intended to bring into our solution anyway. I created a new project in the same solution and then created a database reference from the first database to the second. Since they were on the same server, and since we, at this time, have no intention of changing the database names, I just set the database name to a literal in the reference options. This obviated my having to replace references to the database with a variable.

  2. The second type of error was for a database that we have no control over and that we had no intention of including in our solution. To solve this problem, we had to create a .dbschema file. To do this, we used the vsdbcmd command line tool.

    The vsdbcmd command line tool resides in C:\Program Files\Microsoft Visual Studio 10.0\VSTSDB\Deploy\ on a 32-bit Windows XP system, and the syntax to generate the .dbschema file is as follows:
    "C:\Program Files\Microsoft Visual Studio 10.0\VSTSDB\Deploy\vsdbcmd" /Action:Import /ConnectionString:"Data Source=«server name»;Initial Catalog=«database name»;Integrated Security=True" /DatabaseSchemaProvider:SQL /ModelFile:«output filename».dbschema
    Once the .dbschema file is created, a database reference was set up to reference the file.
These steps eliminated about half the errors, but there were still about 1,500 remaining. Some of these were the result of obsolete stored procedures and functions referencing obsolete object. These could safely be removed from the project, since they clearly were not being used.

Another class of errors was when an object referenced another object in the same database with the full three-part object name (ie., database name, schema, object name).  Visual Studio complains about this because it allows you to rename the database at deployment time. However, if your scripts have hard-code the database name, this will cause problems. The solution to this is simply to remove the database name from the object reference.

Finally, I discovered that I had references in database 1 to objects in database 2, and vice versa. Visual Studio does not permit database 1 to reference database 2 and database 2 reference database 1.  It considers this a circular reference, even though it is different objects in each database referencing and being referenced by the other database.  The solution we used for this was to suppress certain warnings at the file level.  To do this, right-click on the file in the Solution Explorer and choose “Properties” (or press Alt-Enter) and set the value of the “Suppress Warnings” property to the warning number (or numbers, comma-separated) you wish to ignore.

Saturday, July 7, 2012

A Data-Driven Date Dimension

A date dimension is usually seeded with all possible valid dates, and any fact with a date outside of that range is assigned an unknown, invalid, or out-of-range dimension key (very often –1).  However, we are building not only a reporting data warehouse, but also an operational data store (ODS) that will be used for enterprise application integration (EAI).  Therefore, we need to retain values that were received out of range, even if they are clearly wrong or invalid.  For instance, a transaction dated in 2092 (and, yes, we do have some; most likely, the user meant 2002 or 2012) must record that in our ODS so we can show it to a user, even if it was obviously a typographical error.  It is not our job to correct these errors, but to show what happened.  It is valuable information to someone to see that date on a report and investigate what happened so it can be corrected.

We have seeded a wide range of dates (2000-2013) in our dimension table, but we can never seed every possible date that might come in, even if we were to seed ever date from January 1, 1 AD to December 31, 9999.  What would we then do with a transaction date of February 29, 2011?  Or March 99, 2012?  Some of these dates come in on ANSI X12 transactions which generally mandate a YYYYMMDD format for dates; but what do we do when such a transaction comes in with a date of “ABCDEFGH”?  We cannot reject the entire transaction out of hand due to an error in one field, and yet, if someone wants to know what was in that particular field of that particular transaction, we should show them so they can go and complain to whoever is producing such invalid dates.

Our solution is to treat date like any other data-driven dimension: if a date comes in that is not yet in the dimension table, add it.

We are currently doing this in our initial entry stage: we scan the data source and select all distinct date values.  In SSIS, we then perform a lookup of the date value in the date dimension table, directing all unmatched values to a destination which is the date dimension table.  This will ensure that whatever date appears in the subsequent fact data flow will already be populated in the dimension table.  We retain the fallback of assigning –1 to any date value in the fact flow that is not found in the dimension, but this should never happen, and we can audit for any –1 date keys to let us know that something in our data warehouse has gone wrong.

One of my concerns with this approach was performance.  How long does it take to scan the hundreds of millions of rows in the data source to pick out the distinct values?  I thought it might be an hour or more, but figured if it came back in 10 or 15 minutes we could work with it.  This was our first iteration, so we would be modeling our processes for other data-driven dimensions off this one, so we would expect to be doing this unique scan for every dimension.  I kicked off the query (a SELECT DISTINCT … UNION SELECT DISTINCT …), to run it in the background to see, and it came back in 28 seconds!  I ran it again to make sure, and it came back in even less the second time. I was amazed that our source database could scan over 260 million rows in under 30 seconds.  Without indexes on those columns, by the way, so it had to have been doing a full table scan.  (I do not have explain plan permissions in our source systems to confirm.)

With performance like this, we are easily able to perform the necessary SELECT DISTINCT scans to populate our dimensions prior to pulling in the facts.  I had assumed that this approach would have been been prohibitively slow, but as I explained in a previous blog post, “Write What You Mean First”, rather than assume, test.

Thursday, July 5, 2012

Long and Thin: Iteration 1

Josh and I completed our first iteration of our data warehouse today!  We started Tuesday, took Wednesday off for the Independence Day holiday, and then completed it this afternoon.  All told, we probably only put four or so hours of pair programming into it, for a total of eight man-hours.  And we have a working data warehouse!

Granted, it only has one measure (two if you count the bonus “count” measure SSAS throws in for free) and two dimensions (one if you think it’s double counting to use the same dimension in two different roles), but it does contain over a quarter of a billion fact rows, and loads in under two hours.  If this were our final product, we could get the load down to a few minutes (straight from source to SSAS), but we knew this isn’t our final product, so we built some additional steps into our data warehouse to accommodate our future plans for expansion.

It is not uncommon for data warehouses to land and stage their data in multiple steps along the way to a unified dimensional model (UDM).  Indeed, I expect that it is very rare for any data warehouse of even medium complexity to not employ at least one intermediate stage on the way to the UDM.  In our case we are not only data warehousing, but also facilitating enterprise application integration (EAI) throughout our parent organization.  We therefore stage our data in an operational data store (ODS) that will serve as the hub for our EAI processes.  The ODS will be a cleansed and enhanced normalized relational representation of the data using conformed dimensions for many of its foreign keys.  The use of the conformed dimensions will be key to standardizing across both operational applications that consume the ODS and the enterprise reporting.  We therefore built our first data warehouse iteration with four staging steps associated with four staging steps we know we will have in our final data warehouse:

  1. An initial staging area for assembly of conformed dimensions prior to loading the ODS.
  2. The operational data store.
  3. A staging area for data being assembled between the ODS and UDM.
  4. A relational database containing the unified dimensional model, which will be consumed directly, without modification, by SSAS.

This means we copied and stored our entire data set (260-odd million rows) four times on its way to the MOLAP cube.  Writing to these databases was by far the bulk of the time.  We could copy the data between stages in less than half an hour each.  In this our first iteration, the data in each stage was identical, so the only real processing was done in the first stage, which created the conformed dimension tables and performed the foreign key lookups for the facts.  Still, the time difference between that first stage and the others was negligible.

Amazingly enough, once the data was in our UDM database, SSAS was able to read the full database and process it into a MOLAP cube in around six minutes!

SSIS does not by default display start and end times of its packages.  In our first iteration, we have not yet set up logging and timing infrastructure for our processes — we were running them manually through the debugger — so we do not have exact timings.  We do, however, want records, logs, and timings so we can evaluate the impact of our future decisions on the behavior of our data warehouse process, so it is likely that out second iteration, rather than adding data, will add this infrastructure.

Honestly, I am looking forward to our next iteration, and I have not had such enthusiasm for our data warehouse since I took it over six months ago.  Dave Rodabaugh’s advice was excellent: go long and thin.  Meaning  iterate small incremental improvements through the entire assembly line rather than engineering to perfection each stage in a waterfall-like process.  Currently, we have no idea how long our data warehouse rewrite process will take (well, we do have our guesses), but we expect to have several iterations under our belts in short order (4-6 weeks), and so we will then have an idea as to how long it takes to do certain things.  We should then be able to roadmap out the things we do want to do and attach time estimates to them.  These estimates will not be perfect (“prediction is very difficult, especially about the future”), but they will be a whole lot better than our current guesswork.

Tuesday, July 3, 2012

Write What You Mean First

We recently had the pleasure of a week-long visit from Dave Rodabaugh of Atlas Analytics, during which Josh, Dave, and I got to debate the appropriate design patterns in our ETL processes.  My intent is to be able to periodically perform a full load of our data warehouse in case we change some of our processing logic.  This is particularly pertinent to our logic in grouping claim filings into encounters and matching remittances to their originating claims.  As Dave put it, “full loads will occur far more often than anticipated,” and, well, I anticipate performing full loads with some regularity.

Just our current incremental loads take several hours to load a day’s worth of claims.  (I sometimes joke that it takes 25 hours to load 24 hours’ worth of claims.)  Since our full data source contains many years of claims, a full data load seems almost out of the question.  However, both Josh and I do share the confidence that we can come up with a feasible way to perform a full load of our data warehouse.

So Dave is explaining the standard ETL design pattern of processing dimensions separately from your facts.  There is a fair bit of elegance and simplicity to this approach, but I raised the objection that, with the quantities of data we are dealing with, multiple scans through such massive data sets looking for unique dimension values is prohibitively wasteful.  Josh countered that our limiting factor is write speed, and that read speed is negligible.  In general, he is right, but I felt that in this specific case, we may be scanning one or two billion rows to find just a thousand or two unique values to put in a dimension table, and doing so for each dimension will multiply our processing time.

Dave Rodabaugh, ever the diplomat, was able to phrase my position succinctly, but also in such a way that caused me to acquiesce to the suggestion that we process dimensions separate from (and prior to) our fact flow.  We all agreed that there were optimizations that could be done so that the situation was not quite so dire as I predicted, but to paraphrase Dave:

If the concern is that the time it takes to scan through the data will be the biggest time sink, then it makes sense to handle the data less often.  In such a case, performing dimension processing within the fact dataflow makes sense.

That’s a pretty heavy paraphrase, since I do not have the gift of oratory that Dave has.  However, notice how he couched his agreement with me in a condition: If the concern is that the time it takes to scan through the data will be the biggest time sink.  Well, when someone makes any statement of fact, I have the kind of argumentative mind that wants to find holes in the argument.  In this case, examining the condition is the obvious place to start, and I realized that I was making an assumption in my position.  My assumption was not without basis — it was based on my experiences dealing with our current data warehouse — but it was not based on any data.  The title of this blog is “Software Science,” and here I am taking a wholly unscientific approach to the design of my ETL processes.  Even worse, I am constantly a proponent of the philosophy Write what you mean first, and tune only when required.

I am not the first to come up with that opinion.  Forty years ago, Donald Knuth expressed a similar sentiment: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil” [Knuth 1974].  Hearing my position phrased this way, however, made me realize that I was optimizing prematurely.

So today we took our first cut at a full data load: loading two dimensions and a single measure.  Not much, but it will give us a lower bound as to how long our full data load may take.  To my very pleasant surprise, two SELECT DISTINCT’s on a couple of hundred million rows to produce a few thousand distinct values took less than 30 seconds!  This serves as a great reminder to me to:

Write what you mean first.
There was an error in this gadget