One of the most challenging aspects of data warehouse ETL is handling slowly changing dimensions (SCD). First, you must determine what type of changes are required for each dimension attribute. Will no changes be allowed? Will updates be made to some attributes but you don't care about history? Or, do you need to keep track of historical changes? For example, a birthdate should never change, but marital status can change. Is it important to know the marital status for any point in time? If so, then marital status is a Type 2 SCD. If you want to make the change, but don’t care about history, then it is a Type 1 SCD. For a given dimension, you can have a mixture of Type 1 and Type 2 attributes.
Whenever there is a change to a Type 2 attribute, the history must be saved. The way this is usually implemented is by having multiple rows for the same record in the dimension table with start and end dates to signify the time period that the row is valid. Each time there is a change, the current row is expired by updating the end date column. Then a new row for the entity is inserted.
Also, a surrogate key, usually an integer, is used as the primary key of the dimension table. The source system key is also used in the dimension table and it is called a business key or alternate key.
When loading dimensions, you have several options for loading the changes. The first option is to use the SSIS Slowly Changing Dimension Wizard transform. The SCS Wizard relatively easy to use but it is only really appropriate for a small number of changes. The reason for this is that it checks one row at a time for changes and then performs each update individually.
Figure 1 is an example of the results of running the SCD Wizard. On the right, the derived column transform adds an end date that will be used to expire rows by the OLE DB command. The Union All transform is used to combine the actual new rows and replacement rows for the changes. The derived column on the left is used to add a value for start date. Finally, all the rows are inserted into the destination.
FIGURE 1: The SCD Wizard
All DBAs know that row by row processing is usually a bad idea. Despite the ease of using the SCD Wizard it will probably be more beneficial in the long run, despite the extra development time, to create your own solution. A popular method is by directing the changed rows to a staging table. Then one UPDATE statement can be used to set the end date column all at once instead of one row at a time. Once the update is complete, the replacement rows can be inserted. Figure 2 is an example that takes advantage of the 2012 Change Data Capture components. The updated rows are directed to a staging table for processing in later steps.
FIGURE 2: Using the 2012 CDC components and a built from scratch method
There is a third option if you are lucky enough to use Pragmatic Works Task Factory in your shop. Task Factory has a replacement for the SCD wizard called the TF Dimension Merge SCD transform that is easy to use, performs well, and has more granular options.
To use the transform, set up a data source from your source system and your existing dimension. Connect both sources to the TF Dimension Merge SCD transform as in Figure 3.
FIGURE 3: The TF Dimension Merge Slowly Changing Dimension transform
Set the SCD Column type for each attribute (Figure 4), map the columns and continue setting the properties on each tab. These also include properties for auditing and logging.
FIGURE 4: Configure the transform
Once the properties are configured, the output can be directed to several data destinations (Figure 5).
FIGURE 5: Multiple destinations
How to deal with Slowly Changing Dimensions is one of the most important design decisions you will have to make. Luckily, you have a choice that is both easy to implement and performs great!