In the first article in this series, Incremental Loads with Change Data Capture – Enabling Change Data Capture, I explained how to enable change data capture. Once CDC is enabled you now have the ability to use functions to query change data. This change data can then be used to perform incremental loads of Facts and Dimensions. In this article I will provide you with a detailed explanation of how to identify changed data using the CDC tables and functions.
Before I dive to deep into the change detection process you will need to create a table that stores information about the change capture process. Specifically, each row in the table will contain the name of a table that is enabled for change capture and the last time the table was modified. I will explain how this data will be used later in the article. The following is the T-SQL script that will create the table:
CREATE TABLE dbo.ChangeTables
TableName varchar(256) NOT NULL,
Note that, when a row is initially inserted into this table it is likely that the LastChangeTime column for that row will be null. This is because you are just beginning the change detection process and the initial execution of the process should consume all data in your change tables. I will explain this in detail later.
When a table is enabled for changed capture a change table is created that will contain one row for each insert and delete operations performed against the source table, and two rows for each update operation performed against the source table. In article one of the series the Customers table was enabled for change capture. A corresponding change table, cdc.dbo_Customers_CT, was created. The first five columns in the change table are meta data columns that provide additional information applicable to the recorded changes. Two of these columns will be used in the change tracking process and will be explained later in this series of articles. Visit Books Online for additional information about each column.
QUERYING CHANGE DATA
A system stored procedure, sys.sp_cdc_generate_wrapper_function, is available to simplify the process of querying the change tables. Executing the following script will produce two table-valued functions:
EXEC sys.sp_cdc_generate_wrapper_function 'dbo_Customers'
One function (dbo.fn_all_changes_dbo_Customers) will return all the rows for a given table and the other (dbo.fn_net_change_dbo_Customers) will return the net changes for a row in a given table. Both functions use datetime intervals as parameters as opposed to LSNs. Note that each function is suffixed with the change table instance name. In the case of the Customers table the net function generated by the wrapper stored procedure would be named dbo.fn_net_changes_dbo_Customers.
For the purpose of change detection I suggest using the net changes function since it will only return one version of the truth for each row. For example, assume that an insert operation occurred on a row and two subsequent updates also occurred. In the event that the change detection process does not run until after all of these operations have completed, it will only pick up an insert into the table with the latest updated information for each column. For example, if the following row was inserted into the customers table:
Then immediately after the insert, an update was issued that changed the Company to PragmaticWorks.
If the dbo.fn_all_changes_dbo_Customers function was used it would return two rows, one row for the insert and one for the update. In this case, additional logic would be required to determine which row would be used in the incremental load. You could, on the other hand, pass both rows, but this could potentially create problems during the data load process.
Therefore, you should use the dbo.fn_net_changes_dbo_Customers function, which would return the final or net changes of all operations for each row. The result set would be as follows:
The result would contain a single row that represents the data as it exists in the table at the time the change detection process is executed. An additional column is included in the result set which represents the DML operation performed on the row. The column will contain three possible values, an I (Insert), D (Delete) or UN(Update). In the next article I will explain how these values can be used within an SSIS package when building Type I and Type II dimensions. In addition, the wrapper function will be modified to return an additional column that will assist in effectively determining if the row will be added as a Type I or Type II.
CHANGE DETECTION PROCESS
Now that all the pieces are in place, I will provide you with the steps needed to complete the change detection process. I will be using the Customers table in all of the following examples.
1. The first step is to find out when was the last time change detection ran for a particular table.
a. In this step you should obtain the LastChangeTime for the table from dbo.ChangeTables table. If it is not null got to step 2, if it is go to the next step (b).
b. If the value returned is null you will use the sys.fn_cdc_get_min_lsn function to return the minimum log sequence number or the low endpoint for the specified table.
c. Then you will use another function, sys.fn_cdc_map_lsn_to_time, which will return the corresponding date and time value for the specified lsn.
2. Now that you have the starting point, you will need to obtain the ending or high endpoint for the capture instance. This will be accomplished using the sys.fn_cdc_get_max_lsn() function. Then you will use the sys.fn_cdc_map_lsn_to_time function again to return the corresponding date and time value for the returned LSN.
3. Once these two values have been obtained you will use the dbo.fn_net_change_dbo_Customers table-valued function, passing in the low end point and high end point, resulting in the changed data.
4. Finally you will update the dbo.ChangeTables table, setting the LastChangeTime value for the specified table to the value returned in step 2.
The final script should resemble the following:
--Check change table for LastChangeTime
@fromDateTime = LastChangeTime
TableName = 'Customers'
--If the Change table does not have a value use the cdc lsn map to time function
IF(@fromDateTime IS NULL)
@fromDateTime = sys.fn_cdc_map_lsn_to_time(sys.fn_cdc_get_min_lsn('dbo_Customers'))
--Obtain the high point using the get max lsn function
@toDateTime = sys.fn_cdc_map_lsn_to_time(sys.fn_cdc_get_max_lsn())
--select change data using wrapper function
FROM dbo.fn_net_changes_dbo_Customers(@fromDateTime, @toDateTime, 'all with mask')
--update lastchange time
LastChangeTime = @toDateTime
Assuming that you have followed the steps in both this article and the previous article, you can now perform some DML operations to test the script. If you execute a couple of insert statements and update statements against the customers table and then run the script, a row should be returned for each insert statement. Any updates that were run on the existing table should be reflected in the result set.
To summarize, through the first two articles details have been provided to assist you in enabling CDC at the database and table level and how to write T-SQL scripts that will identify changed data. In the final article of this series I will outline the steps required to build an SSIS package that will consume the data produced from the change detection process and build Type I and Type II Slowly Changing Dimensions.
As always, if you have any questions, comments or concerns regarding this topic please send me an email at firstname.lastname@example.org.