The lookup transform in SSIS 2008 gives you the ability to join data and eliminate non matching rows. One of the limitations of the lookup is the fact it requires an OLEDB connection. However, with the cache connection manager you can use just about any other data source as your lookup table. If you have an excel file that contains your lookup information, it would be nice to be able to lookup the information in the excel file without having to load the excel file into a table.
To accomplish this we will load the excel file into a cache connection manager and use this in the data flow. Below is my SQL Table with first names and my excel file with last names. They both have matching ID columns. I will match up the id in the lookup transform to combine the names.
Excel File with First Names
SQL Table with last names
The SSIS package will require two data flows. The first will load the excel file into cache. The second will have the table source and the lookup transform. Here is the Control Flow of the package.
In the first data flow we will have the excel source and a cache transform.
The Excel source is just a simple source with the ID column and the First Name Column. The only change you may have to make is to the data types in the excel source. I had to change the data type of the ID field to integer. That is because my table has an ID column with a data type of integer. To change the data type, right click on the excel source and select Advanced Editor. Click on the Input and Output Properties Tab, expand the output columns folder, select the ID column, and change the data type property as needed.
After you have the source ready, drag the green data flow line to the cache transform. Click the New button and give the cache connection manager a name. Then click on the columns tab and set the ID column’s index property to 1. You must have at least columns with an index property value greater than zero.
Click on the mappings node in the cache transform and map the columns from excel to the cache connection. Notice the magnifying glass next to the ID, this indicates the index column.
In the next data flow we will have and OLEDB source from the SQL table and a lookup transform using the cache connection. I also have a terminator destination from task factory, just so we have a place to put a data viewer to see if the first and last names are joined.
The OLEDB Source is just a select from a table pulling the ID and the Last Names. In the lookup transform, set the connection type to the cache connection manager.
Click on the connection node and set the connection to the cache connection manager.
Click on the columns node and map the ID column to the ID column in the cache. Place a check next to the first name field.
Here are the results from the data viewer after the lookup.
You can use this same method to load just about anything into cache and use it in the lookup transform.
The cache connection manager has some major Pros and Cons. The Pro is the speed, once the data is loaded into cache the lookup performs by reading the rows saved in memory instead of having to perform table reads. The Con would be the limited amount of memory. If you have a large excel file it could cause buffer overruns and start spooling to disk. You would have to watch your PerfMon to see if this happens. A general rule I follow is about 50,000 rows. This varies with the number of columns and the size of the data. If you have a few columns with just integers then you can fit a lot more rows in cache. If you have a bunch of columns that are varchar(5000) then you will not be able to fit as many rows in cache. Some testing when setting this up.
Great article Mike!
What are the pro and cons with using a Cache Connection Manager to performs Lookups? Would you recommend it for huge table lookups? For example, when loading a Fact and looking up the surrogate key from multiple huge dimension tables, would pre-loading the data in Cache connection managers help?
Thanks!
Bebel,
Using a cache connection manager is recommended when your look up data is static, it is good for performance but if your reference data changes rapidely you have to use the OLEDB connection.