SSIS packages house the business logic to handle the transfer of data from source to destination. Integration Services can be used in a number of applications, whether it be business intelligence analytics or migrating a legacy system to a new system. The capabilites of SSIS are vast, and covering all of them in detail can take multiple volumes of books. However, this brief article will give you everything you need to know to develop an SSIS package. Starting with creating a package and data sources for that package, to creating and editing Control Flow objects, and finally using Data Flow adapters and transformations.
First things first, you must create a package in order to get anything done. Now, there are two ways to do this. The first method is to use the Import and Export Wizard that can be found in SQL Server Management Studio. This method is only good for simple extraction and loading. The Import and Export Wizard is not able to manipulate any data. The wizard allows you to either run the package immediately, schedule the packages to run at a later time, or save as a package in a project where you can run later or add transformations (click here for more details on the Import and Export Wizard). The second method is to use the Business Intelligence Design Studio (BIDS). In BIDS, you first have to create a new SSIS project to create and develop new packages. To do this, open BIDS and choose New, Project from the File menu. Then select the Integration Services template, give your project a suitable name, and file path... Once you have created your project, a new package named Package.dtsx will be right there waiting for you. You can rename it to whatever you'd like by right-clicking on the package and then clicking "Rename".
Now that you have your project created and a package ready to develop, you will need some Package Connections to allow you to connect to sources and destinations to transfer your data. But before you do this, it is a good idea to create Project Data Sources. A data source is optional and is on the project level. Data sources contain connection strings that point to files or databases. The convenient thing about these is that they can be used by more than one package and if you need to update a connection string, you just have to do it once in the data source and the go and open all of the packages that use that data source and the connection will be automatically synchronized. To create a Package Data Source right-click on the Data Sources folder seen above, and then click New Data Source. If you have already created a connection in the past, these will show up under "Data connection" and you can just select from there. If it is a brand new connection, you will have to click on "New" and set up the connection in the connection manager below.
Package Connections are not optional. These are required in order to connect to any number of sources. You create a package connection by right-clicking in the "Connection Managers" section at the bottom of your package window. You can choose a connection that you have previously configured as a data source, or you can create a brand new connection that might only be used by this package. If you are creating a brand new connection, the connection manager will look the same as the one you saw when creating a data source (above).
Now that you have your project created, and package created, and your connections configured, it's time to take a look at some of the other Integration Services features. First, we'll take a look that the Control Flow. The Control Flow primarily composed of three objects: Control flow tasks, Control flow containers, and Constraints. A Control flow task performs an operation such as sending an e-mail message, executing a SQL statement, or copying a file from an FTP server. There are two outcomes from a control flow task, success or failure, and they are coordinated by their placement in the control flow. To use a task, simply drag the task to the Control Flow pane from the toolbox and configure it for the purpose you intend by double-clicking on the task and completing the task editor.
Here is a list of tasks available for use in the control flow:
· ActiveX Script - Runs Microsoft Visual Basic Scripting Edition and JScript code and is included mainly for legacy support when a Data Transformation Services (DTS) package is migrated to SSIS.
· Analysis Services Execute DDL - Runs SML for Analysis (XMLA) code against an SSAS database.
· Analysis Services Processing - Allows the processing of SSAS objects through an SSIS package.
· Bulk Insert - Allow the execution of bulk copy operations for SQL Server. (works only against SQL Server Databases)
· Data Flow - Allows data processing from sources to destinations.
· Data Mining Query - Performs data mining queries and lets you capture the results for analysis.
· Data Profiling - Allows the analysis of source data for patterns, missing data, candidate keys, and statistics.
· Execute DTS 2000 Package - Runs a DTS package within SSIS.
· Execute Package - Runs other SSIS packages either deployed to SQL Server or in the file system.
· Execute Process - Runs a command-line operation such as program or batch file execution.
· Execute SQL - Runs SQL code against any underlying database connection in the SQL language of the connected database engine.
· File System - Lets you copy, move, and delete files as well as perform other file and folder operations.
· FTP - Sends and receives files between the file system and an FTP server and performs simple file and folder operations on the FTP server.
· Message Queue - Integrates with Message Queuing (MSMQ) on a server running Windows to read and send messages.
· Script - Runs Microsoft VB 2008 or C# within an SSIS package.
· Send Mail - Sends an e-mail message through an SMTP server.
· Transfer - Tasks that copy SQL Server objects from one system to another, including databases, SQL Server Agent jobs, error messages, login, master stored procedures, and database-level objects.
· Web Service - Lets you connect to a Web service to send or receive information.
· WMI Data Reader - Lets you run a Windows Management Instrumentation (WMI) query against the operation system to capture server information.
· WMI Event Watcher - Waits for a particular event before executing.
· XML - Combines, queries, and differentiates multiple XML files on the server.
Additionally, you can use containers, which can also be found on the toolbox while in the control flow pane. Containers give you an added level of control when it comes to coordinating the execution of your tasks. Tasks and other containers can be placed inside of a container. The three types of containers are listed below:
· Sequence - Lets you organize subordinate tasks by grouping them together, and lets you apply transactions or assign logging to the container.
· For Loop - Provides the same functionality as the Sequence Container except that it also lets you run the tasks within it multiple times based on an evaluation condition, such as looping from 1 to 10.
· Foreach Loop - Also allows looping, by instead of providing a condition expression, you loop over a set of objects, such as files in a folder.
Finally, the last topic we'll look at is the Data Flow. The data flow is actually a task that you pull over while in the control flow, but when you double-click on it to configure it, it takes you to a whole new pane, with a brand new set of objects in the toolbox. These objects are grouped into Sources, Transformations, and Destinations.
Data flow sources use package connections to point to a server or file to extract data from. You use these in the same way you use all of the other objects from the toolbox that we've gone over. Simply click on the source that you want and drag it over into the Data Flow pane. Then double-click the source to configure. The types of data sources include:
· ADO.NET - Provides connections to tables or queries through an ADO.NET provider.
· Excel - Allows extractions from an Excel worksheet defined in an Excel file.
· Flat File - Connects to a delimited or fixed-width file created with different code pages.
· OLE DB - Connect to installed OLE DB providers, such as SQL Server, Access, SSAS, and Oracle.
· Raw File - Stores native SSIS data in a binary file type useful for data staging.
· XML - Allows raw data to be extracted from an XML file; requires an XML schema to define data associations.
Data flow destinations are a lot like sources, except that they are the endpoint of the ETL process instead of the starting point. Destinations also require a package connection to know where to send the data. They are configured in the same way that the source is configured. They include:
· ADO.NET - Allows insertion of data by using an ADO.NET managed provider.
· Data Mining Model Testing - Lets you pass data from the data flow into a data mining model in SSAS.
· DataReader - Lets you put data in an ADO.NET recordset that can be programmatically referenced.
· Dimension Processing - Lets SSAS dimensions be processed directly from data flowing through the data flow.
· Excel - Used for inserting data into Excel, including Excel 2007.
· Flat File - Allows insertion of data to a flat file such as a comma-delimited or tab-delimited file.
· OLE DB - Uses the OLE DB provider to insert rows into a destination system that allows an OLE DB connection.
· Partition Processing - Allows SSAS partitions to be processed directly from data flowing through the data flow.
· Recordset - Takes the data flow and creates a recordset in a package variable of type object.
· SQL Server Compact - Lets you send data to a mobile device running SQL mobile.
· SQL Server - Provides a high-speed destination specific to SQL Server 2008 if the package is running on SQL Server.
Data flow transformations allow you to modify and manipulate data in the data flow. A transformation performs an operation either on one row of data at a time or on several rows at once. Again, these are used the same way all of the other objects in BIDS are. Just drag from the toolbox to the data flow pane, connect the line (data path) from the previous transformation or source, and then double-click to configure. The green data paths are for rows that are successfully transformed, and the red arrows are for rows that failed the transformation. There are a few different classifications of data flow transformations, and they include: Logical row-level transformations, Multi-input/Multi-output transformations, Multi-row transformation, and Advanced data-preparation transformations.
Logical row-level transformations perform operation on rows without needing other rows from the source.
· Audit - Adds additional columns to each row based on system package variables such as ExecutionStartTime and PackageName.
· Cache Transform - Allows data that will be used in a Lookup Transformation to be cached and available for multiple Lookup components.
· Character Map - Performs common text operations, such as Uppercase, and allows advanced linguistic bit conversion operations.
· Copy Column - Duplicates column values in each row to new named columns.
· Data Conversion - Creates new columns in each row based on new data types converted from other columns - for example, converting text to numeric.
· Derived Column - Uses e SSIS Expression language to perform in-place calculations on existing values; alternatively, allows the addition of new columns based on expressions and calculations from other columns and variables.
· Export Column - Exports binary large object (BLOB) columns, one row at a time, to a file.
· Import Column - Loads binary files such as images into the pipeline; intended for a BLOB data type destination.
· Row Count - Tracks the number of rows that flow through the transformation and stores the number in a package variable after the final row.
Multi-input and multi-output transformations can combine or separate data from multiple sources to one destination or one source to multiple destinations.
· Conditional Split - Routes or filters data based on Boolean expression to one or more outputs, from which each row can be sent out only one output path.
· Lookup - Allows matching between pipeline column values to external database tables; additional columns can be added to the data flow from the external table.
· Merge - Combines the rows of two similar sorted inputs, one on top of the other, based on a defined sort key.
· Merge Join - Joins the rows of two sorted inputs based on a defined join column(s), adding columns from each source.
· Multicast - Generates one or more identical outputs, from which every row is sent out every output.
· Union All - Combines one or more similar inputs, stacking rows one on top of another, based on matching columns.
Multi-row transformations perform work based on criteria from multiple input rows or generate multiple output rows from a single input row.
· Aggregate - Associates records based on defined groupings and generates aggregations such as SUM, MAX, MIN, and COUNT.
· Percent Sampling - Filters the input rows by allowing only a defined percent to be passed to the output path.
· Pivot - Takes multiple input tows and pivots the rows to generate an output with more columns based on the original row values.
· Row Sampling - Outputs a fixed number of rows, sampling the data from the entire input, no matter how much latger than the defined output the input is.
· Sort - Orders the input based on defined sort columns and sort direction and allows the removal of duplicates across the sort columns.
· Unpivot - Takes a single row and outputs multiple rows, moving column values to the new row based on defined columns.
The final group of transformations lets you perform advanced operations on rows in the data flow.
· OLE DB Command - Performs database operations such as updates and deletes, one row at a time, based on mapped parameters from input rows.
· Slowly Changing Dimension - Process dimension changes, including tracking dimension history and updating dimension values. The Slowly Changing Dimension Transformation handles these common dimension change types: Historical Attributes, Fixed Attributes, and Changing Attributes.
· Data Mining Query - Applies input rows against a data mining model for prediction.
· Fuzzy Grouping - Associates column values with a set of rows based on similarity, for data cleansing.
· Fuzzy Lookup - Joins a data flow input to a reference table based on column similarity. The Similarity Threshold setting specifies the closeness of allowed matches - a high setting means that matching values are closer in similarity.
· Script Component - Provides VB.NET scripting capabilities against rows, columns, inputs, and outputs in the data flow pipeline.
· Term Extraction - Analyzes text input columns for English nouns and noun phrases.
· Term Lookup - Analyzes test input columns against a user-defined set of words for association.
These are just a handfull of basics that one should have at least some knowledge of before developing an SSIS package. This is by far not the full potential of SSIS, but it's everything you need to know for a basic ETL package.
Please feel free to let me know if you have any questions or comments. Or if you think there's another basic feature that should be addressed in this article, be my guest. Your input is greatly appreciated.