The Aggregate Transform in SSIS can be very slow and cumbersome as it will cause a bottleneck in your packages. Granted it can do some cool things, like provide you with a SUM of all the sales in your data, or give you the maximum sales date. The downside, IT IS SLOW!! Luckily there is an alternative. Whenever possible do the aggregate in the source query. We will go through the aggregates in the SSIS transform and how they can be replaced in the source query for your data flow.
An aggregate is function that performs a calculation on a particular set of values or data. These will all return a single value. For instance, the COUNT function will return a single number as the result. This is because all of the data provided is used in the given function producing a single output. Many times the aggregate function will be used with, and sometimes require, a GROUP BY in your query. For instance, selecting the MAX date along with the customer number to find when the last order was placed for each customer would require a group by. Since the aggregate will produce only a single result, i.e. you can only have one MAX date, you have to tell SQL how to group the rest of the data around the MAX date. Grouping by the customer in this case will return the maximum date of order for EACH customer in the table. For all the aggregates discussed today a GROUP BY will be required if additional columns are specified.
For the examples presented a table called Aggregates will be used with the following data:
A set of sample queries will be provided after each example using the AdventureWorks (not 2008 or R2, just AdventureWorks) database so that the results can be compared. They will be in the form of SELECT AGGREGATE(Column) FROM Table > RESULT. The smaller dataset given above is here for demonstration purposes so we can see easily how each function works and interacts with NULL.
This article will briefly touch on 5 aggregates: average, count, maximum, minimum and sum.
Average
Average, AVG, is used to find the average of a group of numbers. NULL values are ignored by this function. By default all the values in a set will be used. This means that if we have a column of quantities and the number 5 appears twelve times, then it will be included in the average twelve times. Therefore the average of all the Numbers in the sample table is 35 10+20+30+40+50+50+50 = 250/7 = 35.71, which since our data is an integer the result is an integer producing the result of 35. Had our column been in decimal or float then we would have gotten the result of 35.71. However, DISTINCT can be specified and each number will only be counted once giving the following result: 10+20+30+40+50 = 150/5 = 30. This is divided by 5 because there are five distinct values in the Numbers column.
SELECT AVG(ListPrice) FROM Production.Product > 438.6662SELECT AVG(DISTINCT ListPrice) FROM Production.Product > 437.4042SELECT AVG(ListPrice) FROM Production.Product WHERE ListPrice > 230 > 1145.1201
Count
The count aggregate is unique in several ways. This can be applied to a single column, SELECT COUNT(Numbers) FROM Aggregates or to the entire dataset, SELECT COUNT(*) FROM Aggregates. Each situation has its own default behaviors. COUNT(*) will include NULL values. In the sample dataset, this would return 8 since there are 8 rows in the table. On the other hand, COUNT(Numbers) will give the result of 7 as it does not include NULL values in the count. Additionally, DISTINCT can be specified to only give a count of distinct non-NULL values. COUNT(DISTINCT Numbers) returns 5 since there are 5 unique numbers in the column. Keep in mind what is being counted when using a where statement. If we specify a column in the count where the column is null the result will always be zero; that is because COUNT(column) only includes non-NULL values. To accomplish that count you would need to combine the COUNT(*) with a WHERE Column IS NULL.
SELECT COUNT(*) FROM Production.Product > 504 rowsSELECT COUNT(Size) FROM Production.Product > 211 rowsSELECT COUNT(DISTINCT Size) FROM Production.Product > 18 rowsSELECT COUNT(*) FROM Production.Product WHERE ListPrice > 230 > 186 rows
Maximum and Minimum
We will group MAX and MIN together since they do essentially the opposite, but operate the same way. With these aggregates, as with AVG and you will see with SUM, NULLs are ignored. Also, in similar fashion all the values are used but DISTINCT can be specified although is ultimately useless. The max of the sample dataset above is 50 regardless of including all three records of 50 or not. Since NULL is eliminated from these functions the MAX of the set is 50 and the MIN is 10, NULL is not even considered. There are however, ways around that, a topic for another time. For character columns, MSDN states that the highest (or lowest for MIN) value in the collating sequence will be returned. In the record set a-z, the MAX of the set would return Z and the MIN of the set would return A. It is noteworthy that MAX and MIN cannot be used on BIT columns, only numeric, date time and character data types.
SELECT MAX(ListPrice) FROM Production.Product > 3578.27SELECT MIN(ListPrice) FROM Production.Product > 0.00SELECT MIN(ListPrice) FROM Production.Product WHERE Size = 'M' > 8.99
SUM
As previously stated NULL values are eliminated from this function automatically. Again, as with the average, all values are used by default but you may specify only distinct records be used. Therefore the SUM(Numbers) = 250, while the SUM(DISTINCT Numbers) = 150. This will only work with numeric data types.
SELECT SUM(ListPrice) FROM Production.Product > 221087.79SELECT SUM(DISTINCT ListPrice) FROM Production.Product > 45052.64SELECT SUM(ListPrice) FROM Production.Product WHERE Size = 'M' > 543.41
A key to remember is when NULLs are and are not included in your results, which are generally excluded except COUNT(*). The other common mix-up is the value that is returned. For most functions tiny, small and regular INT data fields will result in an INT which can cause rounding errors. A simple way around this is to cast the field as a float or domical before applying the aggregate function.