Often DISTINCT and GROUP BY produce the same results even though they operate at different points in query processing. DISTINCT operates as the last part of the SELECT clause, after any operations are performed to remove duplicate rows from the results. And, of course, GROUP BY is processed right after the WHERE clause. It is possible for the optimizer to rearrange things depending on the query.
You may be wondering why I even care about this. Well, I ran into a situation where the results of queries using these techniques are very, very different.
Window functions were initially introduced with SQL Server 2005. Microsoft enhanced this functionality in 2012 and added analytic functions. The window functions in the SELECT clause operate before DISTINCT, so using DISTINCT along with window functions may give you results you don’t expect.
Probably the most popular window function is ROW_NUMBER. I am just basing this on the response I get when I speak at SQL Saturdays and user group events and ask who uses the different functions. ROW_NUMBER applies incrementing row numbers to the results of the query in a specified order. You can also start the numbers over based on a column or multiple columns by using the PARITION BY option. Here is an example of using ROW_NUMBER. NOTE: I am using the AdventureWorks database for these examples.
SELECT ROW_NUMBER() OVER (ORDER BY CustomerID) AS RowNbr,
The OVER clause defines how the row numbers are applied. In this case, the row numbers are applied in in the order of CustomerID. Say I wanted to get a distinct list of the CustomerID values from the Sales.SalesOrderHeader table. I could do this in two different ways:
SELECT DISTINCT CustomerID
GROUP BY CustomerID;
Each query returns 19,119 rows from my copy of the database. Identical results. Now what would happen if I add ROW_NUMBER to each query? Would I continue to get back identical results from each query? My intuition says yes, but intuition doesn’t count with SQL Server.
The two queries are below.
SELECT DISTINCT CustomerID,
ROW_NUMBER() OVER(ORDER BY CustomerID) AS RowNbr
SELECT CustomerID, ROW_NUMBER() OVER(ORDER BY CustomerID) AS RowNbr
GROUP BY CustomerID;
The first query returns over 31,465 rows and the second query returns 19,119 rows. Why the difference? In the first case, the database engine produces a list of the CustomerID values from the SalesOrderHeader table. Those results are going to contain some duplicate customers since a customer can have more than one order. Next the row numbers are applied. This now makes the results unique for each row. After the row numbers are added, DISTINCT has no duplicates to remove.
In the second query, the GROUP BY clause groups CustomerIDs before the SELECT clause. Once the processing gets to the SELECT clause and adds the row numbers, the CustomerIDs are already unique so we get a unique list of CustomerIDs with the row numbers.
Make sure that you test your results whenever you write a query. There is always more to learn with T-SQL!