Matching Current and Preceding Rows, Revisited
Fastest solutions use new APPLY operator and row numbers
September 20, 2006
Last month, I explored three solutions to a common T-SQL problem: matching rows based on their location in a sequence. The solutions addressed a classic example of such a problem, matching each customer's current and preceding orders and returning various calculations that you can use to compare measures and analyze trends. But if you use some of the new T-SQL features in SQL Server 2005, you can write simpler and faster solutions.
Last month, I compared the performance and I/O costs of a cursor solution and two set-based solutions.The cursor-based solution ran for 58 seconds, the scalar subquery using TOP with OR logic ran for 114 seconds, and the scalar subquery using TOP with AND logic ran for 17 seconds. (See "Matching Current and Preceding Rows,"http://www.sqlmag.com, Instant-Doc ID 92743, for details about those solutions.) This month, I present three additional solutions: one that uses the new APPLY operator and two that use the ROW_NUMBER function.
Refresher Course
To demonstrate this and last month's solutions, I use the SalesOrderHeader table in the AdventureWorks database.
The following query returns the columns of interest for our problem, as Table 1 shows in abbreviated form:
SELECT CustomerID AS CustID, CONVERT(CHAR(8), OrderDate, 112) AS OrderDate, SalesOrderID AS OrderID, TotalDue FROM Sales.SalesOrderHeader ORDER BY CustID, OrderDate, OrderID;
Your task is to write a query against the SalesOrderHeader table that performs the calculations needed to return the output that Table 2 shows (in abbreviated form).The table shows current customer ID, current order date, current sales order ID, and current total due. It also includes the elapsed time in days between the order dates for the customer's preceding and current orders ("Elapsed" column), the difference in total due for the current and preceding orders ("Diff" column), and the percentage difference in total due for the current and preceding orders ("Pct" column). Sorting by OrderDate, then OrderID determines precedence among each customer's orders.
Because the SalesOrderHeader table in AdventureWorks is too small for performance testing, run the code in Web Listing 1, to create in tempdb a version of SalesOrderHeader with 1 million rows. All the performance measures I present are based on the larger table in tempdb and don't include the time it takes to generate the output in SQL Server Management Studio (SSMS).To exclude SSMS output time, navigate to Tools, Options, Query Results, SQL Server, Results to Grid (or Results to Text), and enable the Discard results after execution option. Because of differences between your system and my test system, you might get slightly different runtimes than I did.
APPLY Solution
The fastest solution last month was based on a scalar subquery using TOP with AND logic. It ran for 17 seconds on my test system and incurred 5,992,708 logical reads. Listing 1 shows that solution's FROM clause, which matches current and preceding rows.
In this FROM clause, C represents the current row and P represents the row preceding the current row. The subquery in the FROM clause returns the OrderID for the order that precedes the one in instance C. The clause then matches each row in C with a row from P that has the OrderID returned by the scalar subquery.
The first solution this month follows similar logic to return, for each current order, the customer's preceding order.
However, instead of using a scalar subquery to return only the OrderID from the preceding order, I use the APPLY operator and a table subquery (aka derived table) to return all required attributes from the preceding order.
Listing 2 shows the solution's query.You specify the new APPLY relational operator in the FROM clause to invoke a table-valued function for each row of an outer table. APPLY comes in two versions: CROSS APPLY, which doesn't return the outer table's row if the table-valued function returns an empty set, and OUTER APPLY, which returns a row with NULLs instead of the function's columns. Listing 2 uses OUTER APPLY.
The execution plan produced for Listing 2's query is nearly identical to the one I showed last month for the scalar subquery solution. There is, however, one important difference: No clustered index seek operations are needed to look up the data row. In last month's solution, the scalar subquery forced a seek operation on the idx_cid_od_oid_i_td index for each order to return the OrderID for the preceding order. But the scalar subquery also still required a clustered index seek operation per order to look up the rest of the necessary attributes from the data row.
By using the APPLY operator, the derived table query causes a seek operation on the index for each order. However, this seek operation returns all the required attributes from the preceding order. No additional clustered index seek is required. Thus, this query ran for about 13 seconds on my test system and incurred 3,195,230 logical reads—about half the I/O required by the scalar subquery solution. (Note that to get the I/O information with the STATISTICS IO option, I ran the query without turning on the discard results options. Then to get the runtimes, I reran the query with results discarded.)
ROW_NUMBER Solutions
SQL Server 2005's new ROW_NUMBER analytical ranking function is handy for solutions that rely on a certain order of the data. ROW_NUMBER lets you assign sequential row numbers to a row set, then restart the numbering for a coordinating row set by using the BY PARTITION clause. In our case, the current and preceding row can be expressed as an offset of one between row numbers. Listing 3's code simply calculates row numbers for each customer's orders separately (PARTITION BY CustomerID), based on OrderDate, then SalesOrderID sorting.
Table 3 shows Listing 3's results in abbreviated form. If you examine the row numbers, you can see how simple it is to match each customer's preceding row to the current row. You just need to join two instances of OrdersCTE (aliased as C and P), based on CustomerID match and an offset of one between the row numbers.
Listing 4 shows the solution that uses the row numbers generated in Listing 3. This solution's query ran for 12 seconds on my system.The upper plan in Figure 1 (shown in abbreviated form) shows that SQL Server scanned the index idx_cid_od_oid_i_td twice and used a hash join to join the two instances of OrdersCTE.The hash join tells you the optimizer chose to create a hash table to satisfy the join instead of using what would seem like a more efficient choice—a merge join based on two ordered scans of the covering index.
I suspect that the optimizer used the hash join because the code in Listing 4 didn't match current and preceding rows based on row-number offset alone—which would have relied on the sorting of the covering index, lending itself to a merge join—but also matched rows based on CustomerID.To get a merge join, I revised the row number to be global across customers.That is, I calculated a row number with no PARTITION BY element, instead using only an ORDER BY element based on CustomerID, Order-Date, and SalesOrderID sorting.This lets the join condition match rows based solely on an offset of one between the row numbers. However, when the current customer ID differs from the preceding one, you need to return a NULL in the calculation columns comparing current and preceding order.You do this by using CASE expressions in the SELECT list.
Listing 5 shows this revised solution's query. The execution plan for this query, presented in abbreviated form at the bottom of Figure 1, shows the desired merge join.This query ran for only 6 seconds on my system and incurred a mere 7466 logical reads.
A Sweet Tune
In the past two articles, I presented six solutions to the problem of matching current and preceding rows.The slowest solution ran in about 2 minutes, and the fastest solution ran in about 6 seconds. I got from the slowest solution to the fastest by using a classic performance-tuning process. Besides using good index design, you can gain dramatic performance improvements by revising your queries to use the best solution for the task at hand.And as you saw in this case, SQL Server 2005's new APPLY and ROW_NUMBER features enable simpler and faster solutions than you can write in SQL Server 2000.
About the Author
You May Also Like