We all know that robust exception handling is critical for any mature development effort. Experienced developers go even further and include steps to validate data BEFORE it is fed into a process. This validation step is extremely valuable for Extract, Transform and Load (ETL) processes as well. Failure to do this usually means that you will spend a lot of time handling errors throughout the ETL process.
The usual culprits for ETL headaches are NULLs, zero-length strings and leading/trailing spaces. Failure to take care of these items means that you will encounter errors if a table field does not permit NULLs. All three of these can cause problems during lookups when using the Lookup Transformation or Merge Join components.
In this post, I will demonstrate two ways to handle loads. The first is fairly simple and is commonly used by many ETL developers. The second is the approach I prefer to use and is easier AND faster to implement.
This approach uses a Derived Component Transformation to test for and replace NULLs with a non-NULL value. The Derived Column is used to check for NULLs in the columns FirstName, LastName, AddressLine1 and AddressLine2. When a NULL is detected, the NULL is replaced with an “N/A”. I have added two data viewers to confirm that the Derived Column transformation is working as expected and replaced the NULLs that exist in the field AddressLine2.
Derived Component Transformation
Data Viewer – Before Derived Column
Data Viewer – After Derived Column
Approach #2 – T-SQL Query Utility
If your data source can be queried using SQL, NULLs can be tested using built-in T-SQL functions. For example, to test for NULLs within the Addressline2 field the following syntax used for each field will accomplish the same task.
SELECT (field list) ,CASE WHEN AddressLine2 IS NULL THEN ‘N/A’ ELSE AddressLine2 END AS AddressLine2 FROM dbo.SomeTable
Naturally, writing this kind of query would extremely painful! This is where laziness provides the motivation for a better way. The attached T-SQL script at the end of this article is part of a tool I started many moons ago to ease the ETL development process. It has grown over time to include various items that have proven useful. To use the script, connect to the target source database and edit the values of the variables @object_name and @schema_name which are at the top of the script to match the source table or view.
When executed the script provides useful information about the table fields. The results include a field by field DEFAULT constraint create/drop, and index create/default which you may find useful. The result column labeled “NULL_SpaceOnly_Handler” contains the T-SQL syntax that will take care of NULLs, zero-length strings and leading/trailing spaces. The script substitutes a -999 for numeric values and “N/A” for string values. Simply copy the results into your SELECT statement and set the FROM clause to your target table. I have included a sample row below:
,CASE WHEN ISNULL(LEN(LTRIM(RTRIM([JobTitle]))), 0) != 0 THEN LTRIM(RTRIM([JobTitle])) ELSE 'N/A' END AS [JobTitle]
In the event your source query references multiple tables/views, the script can easily modified to include an alias for each table or view. The script will not work for unusual data types such as hierarchyid (at least not in this version!).
I hope this helps take some of the pain out of your ETL development. If you find any problems please let me know and I’ll make the correction.
Craig
DECLARE @object_name sysname, @schema_name sysname, @column_name sysname SET @object_name = 'Employee' SET @schema_name = 'HumanResources' SET @column_name = NULL /* Name: Craig Love Date: 11/08/2011 Company: PragmaticWorks Software Project: (TEMPLATE) Description: Provides details about a table's columns. */ SELECT sch.name AS [Schema] ,CAST(so.name AS varchar(100)) AS Table_Name ,CAST(sc.name AS varchar(100)) AS Col_Name ,CAST(st.name AS varchar(100)) AS Column_Type ,CASE WHEN st.name = 'varchar' THEN CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST(sc.max_length AS varchar(4)) END WHEN st.name = 'char' THEN CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST(sc.max_length AS varchar(4)) END WHEN st.name = 'nvarchar' THEN CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST((sc.max_length / 2) AS varchar(4)) END WHEN st.name = 'nchar' THEN CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST((sc.max_length / 2) AS varchar(4)) END ELSE '' END AS CharacterLengthOnly ,CASE sc.is_nullable WHEN 1 THEN 'Y' ELSE 'N' END AS Nullable ,',' + CAST(sc.name AS varchar(100)) AS Col_Name_w_Comma ,CAST(st.name AS varchar(30)) + CASE WHEN st.name = 'varchar' THEN '(' + CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST(sc.max_length AS varchar(4)) END + ')' WHEN st.name = 'char' THEN '(' + CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST(sc.max_length AS varchar(4)) END + ')' WHEN st.name = 'nvarchar' THEN '(' + CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST((sc.max_length / 2) AS varchar(4)) END + ')' WHEN st.name = 'nchar' THEN '(' + CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST((sc.max_length / 2) AS varchar(4)) END + ')' ELSE '' END AS ConsolidateDataType ,CAST(sc.name AS varchar(100)) + ' = @' + CAST(sc.name AS varchar(100)) + ',' AS UpdteText ,'@' + CAST(sc.name AS varchar(100)) + ' ' + CAST(st.name AS varchar(100)) + CASE WHEN st.name = 'varchar' THEN '(' + CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST(sc.max_length AS varchar(4)) END + ')' WHEN st.name = 'char' THEN '(' + CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST(sc.max_length AS varchar(4)) END + ')' WHEN st.name = 'nvarchar' THEN '(' + CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST((sc.max_length /2) AS varchar(4)) END + ')' WHEN st.name = 'nchar' THEN '(' + CASE WHEN sc.max_length = -1 THEN 'max' ELSE CAST((sc.max_length / 2) AS varchar(4)) END + ')' ELSE '' END + ', ' AS Proc_Parameters ,'@' + CAST(sc.name AS varchar(100)) + ', ' AS Proc_Input_Parameters ,'ALTER TABLE [' + sch.name + '].[' + CAST(so.name AS varchar(100)) + '] ADD CONSTRAINT def' + sch.name + '_' + CAST(so.name AS varchar(100)) + '_' + CAST(sc.name AS varchar(100)) + ' DEFAULT ' + CASE WHEN st.name = 'varchar' THEN '''N/A''' WHEN st.name = 'char' THEN '''N/A''' WHEN st.name = 'nvarchar' THEN '''N/A''' WHEN st.name = 'nchar' THEN '''N/A''' ELSE '0' END + ' FOR ' + CAST(sc.name AS varchar(100)) AS [DEFAULT] ,'ALTER TABLE [' + sch.name + '].[' + CAST(so.name AS varchar(100)) + '] DROP CONSTRAINT def' + sch.name + '_' + CAST(so.name AS varchar(100)) + '_' + CAST(sc.name AS varchar(100)) AS [DROP DEFAULT] ,',CASE WHEN ISNULL(LEN(LTRIM(RTRIM([' + CAST(sc.name AS varchar(100)) + ']' + + '))), 0) != 0 THEN LTRIM(RTRIM([' + CAST(sc.name AS varchar(100)) + '])) ' + CASE WHEN st.name IN ('varchar', 'char','nvarchar','nchar') THEN ' ELSE ''N/A'' END ' WHEN st.name IN ('int', 'tinyint', 'smallint', 'bigint', 'money') THEN ' ELSE -999 END ' ELSE ' ELSE ''N/A'' END ' END + 'AS [' + CAST(sc.name AS varchar(100)) + ']' AS [NULL_SpaceOnly_Handler] ,'CREATE INDEX [' + sch.name + '].[IDX_' + CAST(so.name AS varchar(100)) + '_' +CAST(sc.name AS varchar(100)) + '] ON ' + sch.name + '.' + CAST(so.name AS varchar(100)) + '(' + CAST(sc.name AS varchar(100)) + ')' AS Index_Create ,'DROP INDEX [' + sch.name + '].[IDX_' + CAST(so.name AS varchar(100)) + '_' +CAST(sc.name AS varchar(100)) + ']' AS Index_Drop FROM sys.columns sc INNER JOIN sys.objects so ON (sc.object_id = so.object_id) INNER JOIN sys.types st ON (sc.user_type_id = st.user_type_id) INNER JOIN sys.schemas sch ON (so.schema_id = sch.schema_id) WHERE so.name LIKE @object_name AND sch.name = @schema_name AND (@column_name IS NULL OR sc.name = @column_name)