posted 4/20/2012 by Daniel - Views: [2569]
This is a variation of a pattern I saw in SQL Server Integration Services Problem Design Solution by Erik Veerman, Jessica Moss, Brian Knight, and Jay Hackney. I would highly recommend this book. I found at least 4 chapters in the book that were each worth the price of the book. It is full of good and practical ideas for the SSIS developer.
In an ideal ETL all your data would load at the right time, your dimensions would come first so that by the time you got to your fact load all the lookups found the right member in the dimension, pulled the proper reference key and proceeded. Unfortunately we don’t all live in that world so we have to worry about missing dimensional data during our fact load. There are a lot of reasons why this can happen and this is one approach to solving this issue.
When I first started out I would shunt fact data aside when a dimension lookup failed. I would then build another process, or even worse I would try to deal with the problems manually. Obviously a manual intervention is only useful in the short term when you are troubleshooting the process and the data, not when you move everything to production. So I would build a separate process to deal with the shunted records. Unfortunately this is another process that requires set up and maintenance. I think the better option is to add those missing dimension records to your dimension(s) during your fact load process.
The basic idea is to shunt the records with a failed dimension lookup aside for a few extra transforms and then bring them back into the main data stream after creating your record and getting your dimension reference key. Here is the basic transform pattern I use:
Start with your lookup. Set the Cache mode to Full Cache and set the non matching entries to “Redirect rows to no match output.”
The first thing you do after you branch off your failed lookups is add a derived column named appropriately to receive the dimension key. Assuming your dimension has surrogate keys that are Integers; make the derived column a null integer.
The next step is really kind of a 2 part step. The first thing you need to do is to create a stored procedure that will take the little bit of valid data you can pass to the dimension and fill in the gaps with unknowns, blanks, zeros, or nulls for each column where you do not have useful data. In addition to taking standard input parameters and filling a dimension record this stored procedure must have an output parameter. It is pretty simple to do, just add an appropriately named parameter (I usually name it something like DimensionNameSurrogateID, but really any naming convention you follow should be fine) in the SP as you normally would, but with the keyword OUTPUT after the parameter definition.
Additionally, since these newly added records will not be in the lookup cache for the dimension (assuming you are using caching) you need to check the dimension to see if the record exists and return that key, or add the record and return the key. The reasoning here is pretty simple. Many fact records may refer to a single dimension record, you may have just added a new record in the previous row that you need in the next row. You don’t want to add every record that travels down this part of the data flow. In any case, at the bottom of the stored procedure you want to return the identity of the record
I start by declaring a local variable and checking the dimension table to see if the record exists and set the local variable to the ID value of the record if it does exist.
DECLARE @SKID INT
SET @SKID = NULL
SELECT @SKID = ID
FROM dbo.dimensionTable
WHERE TableValues = @InputValues
Next I test the local variable for NULL. If it is null then I need to insert a record, if it is not null I already have a record and I set my Output parameter to the variable value.
IF @SKID IS NOT NULL
BEGIN
SET @SKScheduleID = @SKID
END
ELSE
INSERT INTO dbo.dimensionTable
(
Your
,table
,column
,names
,here
)
SELECT @YourInputVariables
,'AndDummy'
,'Values'
,'Here'
SET @SKScheduleID = SCOPE_IDENTITY()
Then if an insert is needed I set my output parameter to SCOPE_IDENTITY() to return the new record ID to the derived column value. Obviously we are only adding the bare minimum of data to create a sort of place holder in the dimension and get an ID back. We are not trying to populate all the dimensional data.
The second part of this step is to create the OLEDB Command that feeds the stored procedure. You need to configure the SQL Command for the stored procedure, and make sure to follow your last question mark (for the output parameter) with the keyword OUTPUT.
On the mapping tab of the OLEDB Command map all of your input values from your data stream to the correct input parameter, and map your output parameter to the derived column you created in the previous step.
Personally after I get the Key from the dimension I like to add a Count transform. I start by creating a package level variable appropriately named for the object it is counting, and then configure a Count downstream from the OLEDB Command. But this is an optional step.
Then finally, I route the data stream back to a Union transform to rejoin the shunted data stream for the inferred record back to the main data stream. Make sure to use the derived column you just populated with the output value from the stored procedure in the Union. I think this is fairly obvious but it is easy to overlook when configuring the mapping,
Some people may wonder, “Why not just feed all the required information in to the fact load data stream and populate your dimensions this way rather than doing a separate ETL for them.” There are several good reasons. You will probably be able to optimize your dimension load ETL a little more easily if you keep it separate. Also the OLE DB Command transform is very slow and will not scale well when adding thousands or millions of records, but for the occasional one off of a missing dimensional record it should be just fine.
When you add a dimension record in this manner during a fact load with just enough data to make a valid record you are adding what is called an inferred dimension record. There is at least one additional issue you need to deal with when adding inferred records. Your primary dimension load must have steps to accommodate these inferred records. Basically I create a conditional split upstream of any Type 1 or Type 2 SCD conditional split evaluations to look for one of my dummy values. If there is a dummy value present the entire record is updated with the data from the data stream. If you treat it like any old Type 1 or Type 2 evaluations you may find that you have either incomplete data or a new record when you don’t want a new record.
I hope you find this useful. Thanks for reading.