Business Intelligence Blogs

View blogs by industry experts on topics such as SSAS, SSIS, SSRS, Power BI, Performance Tuning, Azure, Big Data and much more! You can also sign up to post your own business intelligence blog.

Don’t get left behind: Learn to use LAG

  • 20 October 2012
  • Author: Kathi Kellenberger
  • Number of views: 10095
  • 0 Comments

Two of the most exciting developments in the T-SQL world are the enhancements to the OVER clause and the new analytic functions. These particular enhancements will enable us to write code to solve complex problems in a very efficient way. As they say in sales: This is a game changer. 

There are eight analytic functions: CUME_DIST, FIRST_VALUE, LAG, LAST_VALUE, LEAD, PERCENTILE_CONT, PERCENTILE_DISC, and PERCENT_RANK. In this post, I’m going to explain how to use the LAG function.

The LAG function allows you to access data from a previous row in your data. For example, in the AdventureWorks database, what if you wanted to produce a list of the customers and their orders and the date of their last order? 

To do this pre-2012, you might write a cursor or a query like this:

SELECT CustomerID, SalesOrderNumber, OrderDate,
       (SELECT TOP(1) OrderDate
        FROM Sales.SalesOrderHeader A
        WHERE A.CustomerID = B.CustomerID
        AND A.OrderDate < B.OrderDate
        ORDER BY OrderDate) AS LastOrderDate
FROM Sales.SalesOrderHeader B
ORDER BY CustomerID, OrderDate;

This query uses a correlated sub-query to find the last order.  Take a look at the WHERE clause to see how the inner query is filtered by the outer query on CustomerID and for rows less than the outer query’s OrderDate. 

Here is a sample of the data. Notice that Customer 11001 placed orders on 2005-07-18, 2007-07-20, and 2008-06-12. The row showing the first order returns NULL in the LastOrderDate column.  The next row shows 2005-07-18 as the LastOrderDate. Finally, the last row returns 2007-07-20. The calculated LastOrderDate starts over with the next customer.

While this query doesn’t perform too badly on a small number of rows, it won’t scale. As more data is added over time, the performance will continue to get worse. It is essentially one-half of a Cartesian product.  *To learn more about why this query is a bad idea see Jeff Moden’s article on SQL Server Central.

If I turn on STATISTICS IO and run the query, I see 126,811 logical reads! 

We can easily and efficiently solve this query using LAG. Here is the syntax for LAG from Books Online:

LAG (scalar_expression [,offset] [,default])
    OVER ( [ partition_by_clause ] order_by_clause )

The scaler expression is the column name for the value we want to return. In this case the OrderDate. The offset, 1 by default, is how many rows in the group or partition you want to go back. If I want to skip the last value and go back one more, then the offset should be 2.  For our example, we will accept the default.

Like many functions in SQL Server we will use the OVER clause to specify the sort and the partition. In this case the data should be sorted by the OrderDate and partitioned by the CustomerID. Here is the new query:

SELECT CustomerID, SalesOrderNumber, OrderDate,
      LAG(OrderDate) OVER(PARTITION BY CustomerID ORDER BY OrderDate)
FROM Sales.SalesOrderHeader
ORDER BY CustomerID, OrderDate

The results returned are identical, but the performance is fantastic. The logical reads for this query is just 686. That’s an amazing improvement. This query is not only easy to write, it performs great!


If you want to get the number of days since the last order instead of the last order date, just nest the LAG expression in the DATEDIFF function. Make sure to include the OVER clause. The query to find the number of days can be written like this:

SELECT CustomerID, SalesOrderNumber, OrderDate,
     DATEDIFF(d,LAG(OrderDate) OVER(PARTITION BY CustomerID ORDER BY OrderDate),
     OrderDate) AS DaysSinceLast
FROM Sales.SalesOrderHeader
ORDER BY CustomerID, OrderDate

Here is a sample of the results:

Since the first order for customer 11001 is the first order, a NULL is returned. The second rows returns 732 since it was that long before the customer placed another other. Finally, the query returns 328 for the last row.
While we are talking about LAG, why don’t we take a look at LEAD as well? LEAD works just like LAG except that it returns the next value instead of the last value. Just for run, let’s use both LAG and LEAD in the same query to find the number of days since the last order and the number of days until the next order.

SELECT CustomerID, SalesOrderNumber, OrderDate,
     DATEDIFF(d,LAG(OrderDate) OVER(PARTITION BY CustomerID ORDER BY OrderDate),
     OrderDate) AS DaysSinceLast,
     DATEDIFF(d,OrderDate,LEAD(OrderDate) OVER(PARTITION BY CustomerID ORDER BY OrderDate))
     AS DaysUntilNext
FROM Sales.SalesOrderHeader
ORDER BY CustomerID, OrderDate

Here is a sample of the results:


 
The new T-SQL functions, LAG and LEAD, are powerful and easy to use. Once you are running SQL Server 2012, you’ll have one less excuse to write a poorly performing triangular join or cursor.

 

 

 

Print
Categories: SQL Server
Tags:
Rate this article:
No rating

Please login or register to post comments.