ETL Best Practice - Pre-Treat your Data!

Who is online?  0 guests and 0 members
Home  »  Articles  »  ETL Best Practice - Pre-Treat your Data!

ETL Best Practice - Pre-Treat your Data!

change text size: A A A
Published: 11/24/2011 by  CraigLove  - Views:  [1185]  

 


We all know that robust exception handling is critical for any mature development effort.  Experienced developers go even further and include steps to validate data BEFORE it is fed into a process.  This validation step is extremely valuable for Extract, Transform and Load (ETL) processes as well.  Failure to do this usually means that you will spend a lot of time handling errors throughout the ETL process.

The usual culprits for ETL headaches are NULLs, zero-length strings and leading/trailing spaces.  Failure to take care of these items means that you will encounter errors if a table field does not permit NULLs.  All three of these can cause problems during lookups when using the Lookup Transformation or Merge Join components.

In this post, I will demonstrate two ways to handle loads.  The first is fairly simple and is commonly used by many ETL developers.  The second is the approach I prefer to use and is easier AND faster to implement.

Approach #1 – Derived Column Transformation

This approach uses a Derived Component Transformation to test for and replace NULLs with a non-NULL value.   The Derived Column is used to check for NULLs in the columns FirstName, LastName, AddressLine1 and AddressLine2.  When a NULL is detected, the NULL is replaced with an “N/A”.    I have added two data viewers to confirm that the Derived Column transformation is working as expected and replaced the NULLs that exist in the field AddressLine2.

Derived Component Transformation

Derived Column Transformation

 Data Viewer – Before Derived Column

Data Viewer - Before Derived Column

 Data Viewer – After Derived Column

Data Viewer - After Derived Column

 

Approach #2 – T-SQL Query Utility

If your data source can be queried using SQL, NULLs can be tested using built-in T-SQL functions.  For example, to test for NULLs within the Addressline2 field the following syntax used for each field will accomplish the same task.

SELECT 
	(field list)
	,CASE
WHEN AddressLine2 IS NULL THEN ‘N/A’
ELSE AddressLine2
	END AS AddressLine2
FROM dbo.SomeTable

 Naturally, writing this kind of query would extremely painful! This is where laziness provides the motivation for a better way. The attached T-SQL script at the end of this article is part of a tool I started many moons ago to ease the ETL development process. It has grown over time to include various items that have proven useful. To use the script, connect to the target source database and edit the values of the variables @object_name and @schema_name which are at the top of the script to match the source table or view.

When executed the script provides useful information about the table fields. The results include a field by field DEFAULT constraint create/drop, and index create/default which you may find useful. The result column labeled “NULL_SpaceOnly_Handler” contains the T-SQL syntax that will take care of NULLs, zero-length strings and leading/trailing spaces. The script substitutes a -999 for numeric values and “N/A” for string values. Simply copy the results into your SELECT statement and set the FROM clause to your target table. I have included a sample row below:

,CASE WHEN ISNULL(LEN(LTRIM(RTRIM([JobTitle]))), 0) != 0 
        THEN LTRIM(RTRIM([JobTitle]))  ELSE 'N/A' END AS [JobTitle]

In the event your source query references multiple tables/views, the script can easily modified to include an alias for each table or view.  The script will not work for unusual data types such as hierarchyid (at least not in this version!).

 I hope this helps take some of the pain out of your ETL development.  If you find any problems please let me know  and I’ll make the correction. 

Craig

 

DECLARE @object_name sysname,
        @schema_name sysname,
        @column_name sysname
        
SET @object_name = 'Employee'
SET @schema_name = 'HumanResources'
SET @column_name = NULL


/*
Name:			Craig Love
Date:			11/08/2011
Company:		PragmaticWorks Software
Project:		(TEMPLATE)
Description:	Provides details about a table's columns.  

*/


SELECT 	
	sch.name AS [Schema]
	,CAST(so.name AS varchar(100)) AS Table_Name
	,CAST(sc.name AS varchar(100)) AS Col_Name
	,CAST(st.name AS varchar(100)) AS Column_Type
	,CASE
			WHEN st.name = 'varchar' THEN 
			 	CASE
					WHEN sc.max_length = -1 THEN 'max'
					ELSE CAST(sc.max_length AS varchar(4))
				END
			WHEN st.name = 'char' THEN 
			 	CASE
					WHEN sc.max_length = -1 THEN 'max'
					ELSE CAST(sc.max_length AS varchar(4))
				END
			WHEN st.name = 'nvarchar' THEN 
			 	CASE
					WHEN sc.max_length = -1 THEN 'max'
					ELSE CAST((sc.max_length / 2) AS varchar(4))
				END
			WHEN st.name = 'nchar' THEN 
			 	CASE
					WHEN sc.max_length = -1 THEN 'max'
					ELSE CAST((sc.max_length / 2) AS varchar(4))
				END
			ELSE ''
		END AS CharacterLengthOnly
	,CASE sc.is_nullable 
		WHEN 1 THEN 'Y'
		ELSE 'N'
	END AS Nullable
	,',' + CAST(sc.name AS varchar(100)) AS Col_Name_w_Comma 
	,CAST(st.name AS varchar(30)) 
		+ 
		CASE
			WHEN st.name = 'varchar' THEN '(' 
				+ 	CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST(sc.max_length AS varchar(4))
					END
				+ ')'
			WHEN st.name = 'char' THEN '(' 
					+ 	CASE
							WHEN sc.max_length = -1 THEN 'max'
							ELSE CAST(sc.max_length AS varchar(4))
						END
				+ ')'
			WHEN st.name = 'nvarchar' THEN '(' 
				+ 	CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST((sc.max_length / 2) AS varchar(4))
					END
			+ ')'
			WHEN st.name = 'nchar' THEN '(' 
					+ 	CASE
							WHEN sc.max_length = -1 THEN 'max'
							ELSE CAST((sc.max_length / 2) AS varchar(4))
						END
				+ ')'
			ELSE ''
		END AS ConsolidateDataType

	,CAST(sc.name AS varchar(100)) + ' = @' + CAST(sc.name AS varchar(100)) + ',' 
		 AS UpdteText
	,'@' + CAST(sc.name AS varchar(100)) + ' ' + 
		CAST(st.name AS varchar(100)) 
		+ 
		CASE
			WHEN st.name = 'varchar' THEN '(' 
			 		+ CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST(sc.max_length AS varchar(4))
					END
				+ ')'
			WHEN st.name = 'char' THEN '(' 
			 		+ CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST(sc.max_length AS varchar(4))
					END
				+ ')'
			WHEN st.name = 'nvarchar' THEN '(' 
			 		+ CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST((sc.max_length /2) AS varchar(4))
					END
				+ ')'
			WHEN st.name = 'nchar' THEN '(' 
			 		+ CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST((sc.max_length / 2) AS varchar(4))
					END
				+ ')'
			ELSE ''
		END 
		+ ', '
		AS Proc_Parameters

	,'@' + CAST(sc.name AS varchar(100)) + ', ' AS Proc_Input_Parameters
	,'ALTER TABLE [' + sch.name + '].[' + CAST(so.name AS varchar(100)) + '] ADD CONSTRAINT def' + sch.name + '_' + CAST(so.name AS varchar(100)) + '_' + CAST(sc.name AS varchar(100)) + ' DEFAULT ' +
		CASE 
			WHEN st.name = 'varchar' THEN '''N/A'''
			WHEN st.name = 'char' THEN '''N/A'''
			WHEN st.name = 'nvarchar' THEN '''N/A'''
			WHEN st.name = 'nchar' THEN '''N/A'''
			ELSE '0'
		END
		+ ' FOR ' + CAST(sc.name AS varchar(100))  AS [DEFAULT]
	,'ALTER TABLE [' + sch.name + '].[' + CAST(so.name AS varchar(100)) + '] DROP CONSTRAINT def' + sch.name + '_' + CAST(so.name AS varchar(100)) + '_' + CAST(sc.name AS varchar(100))  AS [DROP DEFAULT]

	,',CASE WHEN ISNULL(LEN(LTRIM(RTRIM([' + CAST(sc.name AS varchar(100)) + ']' +
		+ '))), 0) != 0 THEN LTRIM(RTRIM([' + CAST(sc.name AS varchar(100)) + '])) '
	+ CASE 
			WHEN st.name IN ('varchar', 'char','nvarchar','nchar') THEN ' ELSE ''N/A'' END '
			WHEN st.name IN ('int', 'tinyint', 'smallint', 'bigint', 'money') THEN ' ELSE -999 END '
			ELSE ' ELSE ''N/A'' END '
		END 
		+ 'AS [' + CAST(sc.name AS varchar(100)) + ']' AS [NULL_SpaceOnly_Handler]

	,'CREATE INDEX [' + sch.name + '].[IDX_' + CAST(so.name AS varchar(100)) + '_' +CAST(sc.name AS varchar(100))  + '] ON ' + sch.name + '.' + CAST(so.name AS varchar(100)) + '(' + CAST(sc.name AS varchar(100)) + ')'  AS Index_Create
	,'DROP INDEX [' + sch.name + '].[IDX_' + CAST(so.name AS varchar(100)) + '_' +CAST(sc.name AS varchar(100))  + ']'  AS Index_Drop
FROM sys.columns sc
INNER JOIN sys.objects so
	ON (sc.object_id = so.object_id)
INNER JOIN sys.types st
	ON (sc.user_type_id = st.user_type_id)
INNER JOIN sys.schemas sch
	ON (so.schema_id = sch.schema_id)
WHERE so.name LIKE @object_name
AND sch.name = @schema_name
AND (@column_name IS NULL
     OR sc.name = @column_name)

DECLARE @object_name sysname,
        @schema_name sysname,
        @column_name sysname
        
SET @object_name = 'Employee'
SET @schema_name = 'HumanResources'
SET @column_name = NULL


/*
Name:			Craig Love
Date:			11/08/2011
Company:		PragmaticWorks Software
Project:		(TEMPLATE)
Description:	Provides details about a table's columns.  

*/

SELECT 	
	sch.name AS [Schema]
	,CAST(so.name AS varchar(100)) AS Table_Name
	,CAST(sc.name AS varchar(100)) AS Col_Name
	,CAST(st.name AS varchar(100)) AS Column_Type
	,CASE
			WHEN st.name = 'varchar' THEN 
			 	CASE
					WHEN sc.max_length = -1 THEN 'max'
					ELSE CAST(sc.max_length AS varchar(4))
				END
			WHEN st.name = 'char' THEN 
			 	CASE
					WHEN sc.max_length = -1 THEN 'max'
					ELSE CAST(sc.max_length AS varchar(4))
				END
			WHEN st.name = 'nvarchar' THEN 
			 	CASE
					WHEN sc.max_length = -1 THEN 'max'
					ELSE CAST((sc.max_length / 2) AS varchar(4))
				END
			WHEN st.name = 'nchar' THEN 
			 	CASE
					WHEN sc.max_length = -1 THEN 'max'
					ELSE CAST((sc.max_length / 2) AS varchar(4))
				END
			ELSE ''
		END AS CharacterLengthOnly
	,CASE sc.is_nullable 
		WHEN 1 THEN 'Y'
		ELSE 'N'
	END AS Nullable
	,',' + CAST(sc.name AS varchar(100)) AS Col_Name_w_Comma 
	,CAST(st.name AS varchar(30)) 
		+ 
		CASE
			WHEN st.name = 'varchar' THEN '(' 
				+ 	CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST(sc.max_length AS varchar(4))
					END
				+ ')'
			WHEN st.name = 'char' THEN '(' 
					+ 	CASE
							WHEN sc.max_length = -1 THEN 'max'
							ELSE CAST(sc.max_length AS varchar(4))
						END
				+ ')'
			WHEN st.name = 'nvarchar' THEN '(' 
				+ 	CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST((sc.max_length / 2) AS varchar(4))
					END
			+ ')'
			WHEN st.name = 'nchar' THEN '(' 
					+ 	CASE
							WHEN sc.max_length = -1 THEN 'max'
							ELSE CAST((sc.max_length / 2) AS varchar(4))
						END
				+ ')'
			ELSE ''
		END AS ConsolidateDataType

	,CAST(sc.name AS varchar(100)) + ' = @' + CAST(sc.name AS varchar(100)) + ',' 
		 AS UpdteText
	,'@' + CAST(sc.name AS varchar(100)) + ' ' + 
		CAST(st.name AS varchar(100)) 
		+ 
		CASE
			WHEN st.name = 'varchar' THEN '(' 
			 		+ CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST(sc.max_length AS varchar(4))
					END
				+ ')'
			WHEN st.name = 'char' THEN '(' 
			 		+ CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST(sc.max_length AS varchar(4))
					END
				+ ')'
			WHEN st.name = 'nvarchar' THEN '(' 
			 		+ CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST((sc.max_length /2) AS varchar(4))
					END
				+ ')'
			WHEN st.name = 'nchar' THEN '(' 
			 		+ CASE
						WHEN sc.max_length = -1 THEN 'max'
						ELSE CAST((sc.max_length / 2) AS varchar(4))
					END
				+ ')'
			ELSE ''
		END 
		+ ', '
		AS Proc_Parameters

	,'@' + CAST(sc.name AS varchar(100)) + ', ' AS Proc_Input_Parameters
	,'ALTER TABLE [' + sch.name + '].[' + CAST(so.name AS varchar(100)) + '] ADD CONSTRAINT def' 
		+ sch.name + '_' + CAST(so.name AS varchar(100)) + '_' + CAST(sc.name AS varchar(100)) 
			+ ' DEFAULT ' +
		CASE 
			WHEN st.name = 'varchar' THEN '''N/A'''
			WHEN st.name = 'char' THEN '''N/A'''
			WHEN st.name = 'nvarchar' THEN '''N/A'''
			WHEN st.name = 'nchar' THEN '''N/A'''
			ELSE '0'
		END
		+ ' FOR ' + CAST(sc.name AS varchar(100))  AS [DEFAULT]
	,'ALTER TABLE [' + sch.name + '].[' + CAST(so.name AS varchar(100)) + '] DROP CONSTRAINT def' 
		+ sch.name + '_' + CAST(so.name AS varchar(100)) + '_' + CAST(sc.name AS varchar(100))  
			AS [DROP DEFAULT]

	,',CASE WHEN ISNULL(LEN(LTRIM(RTRIM([' + CAST(sc.name AS varchar(100)) + ']' +
		+ '))), 0) != 0 THEN LTRIM(RTRIM([' + CAST(sc.name AS varchar(100)) + '])) '
	+ CASE 
			WHEN st.name IN ('varchar', 'char','nvarchar','nchar') THEN ' ELSE ''N/A'' END '
			WHEN st.name IN ('int', 'tinyint', 'smallint', 'bigint', 'money') THEN ' ELSE -999 END '
			ELSE ' ELSE ''N/A'' END '
		END 
		+ 'AS [' + CAST(sc.name AS varchar(100)) + ']' AS [NULL_SpaceOnly_Handler]

	,'CREATE INDEX [' + sch.name + '].[IDX_' + CAST(so.name AS varchar(100)) + '_' +CAST(sc.name AS varchar(100))  
		+ '] ON ' + sch.name + '.' + CAST(so.name AS varchar(100)) + '(' + CAST(sc.name AS varchar(100)) + ')'  
		AS Index_Create
	,'DROP INDEX [' + sch.name + '].[IDX_' + CAST(so.name AS varchar(100)) + '_' +CAST(sc.name AS varchar(100))  + ']'  
		AS Index_Drop
FROM sys.columns sc
INNER JOIN sys.objects so
	ON (sc.object_id = so.object_id)
INNER JOIN sys.types st
	ON (sc.user_type_id = st.user_type_id)
INNER JOIN sys.schemas sch
	ON (so.schema_id = sch.schema_id)
WHERE so.name LIKE @object_name
AND sch.name = @schema_name
AND (@column_name IS NULL
     OR sc.name = @column_name)

Saint101 likes this.
 
2.5
/5
Avg: 2.5/5: (3 votes)

Comments (7)

gopichand
gopichand said:
Hi Craig, The article is Very Useful. But iam confused with the functionality of the script. Can you explain briefly what this script does. Top simple script removes null values. What does this long scripts down do.
12/8/2011
 · 
 
by
CraigLove
CraigLove said:
I apologize for the confusion. The second (bottom) script is identical to the first and was not intended to be included in the posting. Is this the source of the confusion you mentioned?
12/8/2011
 · 
 
by
ayyappan
ayyappan said:
Thanks Graig for the template.
1/9/2012
 · 
 
by
Saint101
Saint101 said:
Lovely! Thanks Craig for the template. Is there anyway this can be manipulated to work with a stored procedure?
1/10/2012
 · 
 
by
Kingdom
Kingdom said:
Nice one Craig. The Approach #1 is well understood by me. Can you pls explain how to implement the long script of the Approach #2 as it got me confused. Thanks
1/10/2012
 · 
 
by
CraigLove
CraigLove said:
Kingdom- the trick is to use the script to build the SELECT clause in your Data Transformation's source component. The script sets up the NULL and empty string cleanup by reading the source table's metadata. It's a matter of copy/paste from the results of this script to the query that will be used in the ETL. I also want to point out that the script is not 100% perfect. There are some datatypes that it cannot handle or does not handle well- i.e. hierarchy. Oh the joys of continuous improvement!
1/10/2012
 · 
 
by
CraigLove
CraigLove said:
It could be changed to be a stored procedure. Of course the challenge there is where to store it. I like keeping it as a script because I try to avoid adding objects to databases where I might not be the DBA in charge.
1/10/2012
 · 
 
by
lcarpay
lcarpay said:
thanks Craig, it's works and more over it is a nice method of saving me from tedious work... thanks
1/12/2012
 · 
 
by

Most Recent Articles