How To: Previous and Next with Condition
Window functions are the best thing that happened since sliced bread. And I’m not talking about just T-SQL features, but generally.
December 10, 2015
There’s a common T-SQL need that involves computing for each current row a value from a previous or next row. For this purpose, T-SQL has the LAG and LEAD window functions. Things get trickier, though, when you need to add a condition. For example, suppose you need to compute the last col1 value that was greater than x; or, based on col1 order, compute the last col2 value that was greater than x. In this article I’ll explain how you can achieve such tasks.
The Challenge
To demonstrate computing previous and next with a condition, I’ll use daily rainfall data in tables called Locations and Precipitation. The Locations table holds locations where rainfall data is collected, and the Precipitation table holds daily precipitation per location in millimeters. Use the code in Listing 1 to create the tables and fill them with a small set of sample data.
Listing 1: DDL and small set of sample data
SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.Precipitation', N'U') IS NOT NULL DROP TABLE dbo.Precipitation;IF OBJECT_ID(N'dbo.Locations', N'U') IS NOT NULL DROP TABLE dbo.Locations;GOCREATE TABLE dbo.Locations( locid INT NOT NULL CONSTRAINT PK_Locations PRIMARY KEY);CREATE TABLE dbo.Precipitation( locid INT NOT NULL, dt DATE NOT NULL, val INT NOT NULL, CONSTRAINT PK_Precipitation PRIMARY KEY(locid, dt, val), CONSTRAINT FK_Precipitation_Locations FOREIGN KEY(locid) REFERENCES dbo.Locations(locid));INSERT INTO dbo.Locations(locid) VALUES(1),(2);INSERT INTO dbo.Precipitation(locid, dt, val) VALUES (1, '20151214', 10), (1, '20151215', 0), (1, '20151216', 0), (1, '20151217', 100), (1, '20151218', 50), (1, '20151219', 20), (1, '20151220', 210), (1, '20151221', 20), (1, '20151222', 0), (1, '20151223', 0), (1, '20151224', 0), (1, '20151225', 40), (2, '20151214', 0), (2, '20151215', 140), (2, '20151216', 60), (2, '20151217', 40), (2, '20151218', 0), (2, '20151219', 20), (2, '20151220', 0), (2, '20151221', 0), (2, '20151222', 0), (2, '20151223', 45), (2, '20151224', 120), (2, '20151225', 130);
Suppose you get a task to compute, for each row in Precipitation, two things:
1. The number of days that passed since the last day the precipitation was more than 24 millimeters (not including today). Call the result column diffprev.
2. The number of days that will pass until the next day the precipitation will be more than 24 millimeters (not including today). Call the result column diffnext.
Table 1 has the desired result for the small set of sample data in Listing 1.
Table 1: Desired result
locid dt val diffprev diffnext----------- ---------- ----------- ----------- -----------1 2015-12-14 10 NULL 31 2015-12-15 0 NULL 21 2015-12-16 0 NULL 11 2015-12-17 100 NULL 11 2015-12-18 50 1 21 2015-12-19 20 1 11 2015-12-20 210 2 51 2015-12-21 20 1 41 2015-12-22 0 2 31 2015-12-23 0 3 21 2015-12-24 0 4 11 2015-12-25 40 5 NULL2 2015-12-14 0 NULL 12 2015-12-15 140 NULL 12 2015-12-16 60 1 12 2015-12-17 40 1 62 2015-12-18 0 1 52 2015-12-19 20 2 42 2015-12-20 0 3 32 2015-12-21 0 4 22 2015-12-22 0 5 12 2015-12-23 45 6 12 2015-12-24 120 1 12 2015-12-25 130 1 NULL
Try and figure out what’s the most efficient solution you can come up with for this challenge.
Use the small set of sample data from Listing 1 to check the validity of your solution. To test your solution’s performance, you’ll need much more data. For this purpose, use the code in Listing 2 to create a helper function called GetNums, which generates a sequence of integers in a requested range.
Listing 2: Helper function GetNums
IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;GOCREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLEASRETURN WITH L0 AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)), L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L5) SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n FROM Nums ORDER BY rownum;GO
Use the code below to populate the tables with 10,000 locations with three years’ worth of daily measurements for each (about 10,000,000 measurements in total).
Large set of sample data
-- 10,000 locations x 3 years ~= 10,000,000 measurementsTRUNCATE TABLE dbo.Precipitation;DELETE FROM dbo.Locations;INSERT INTO dbo.Locations(locid) SELECT n FROM dbo.GetNums(1, 10000);INSERT INTO dbo.Precipitation(locid, dt, val) SELECT L.n AS locid, DATEADD(day, D.n, '20130101') AS dt, ABS(CHECKSUM(NEWID())) % 2 * ABS(CHECKSUM(NEWID())) % 100 AS val FROM dbo.GetNums(0, DATEDIFF(day, '20130101', '20151231')) AS D CROSS JOIN dbo.GetNums(1, 10000) AS L;
To work!
Solution with TOP
Probably the most obvious solution is to use a subquery with a TOP filter to get the desired previous or next value. For example, to get the previous date where the precipitation value was greater than 24 (call it prevdt), you’d use the following subquery (assuming you alias the outer instance of Precipitation P1):
( SELECT TOP (1) dt FROM dbo.Precipitation AS P2 WHERE P2.locid = P1.locid AND P2.dt < P1.dt AND val > 24 ORDER BY P2.dt DESC ) AS prevdt
Similarly, to get the next date where the precipitation value is greater than 24 (call it nextdt), you’d use the following subquery:
( SELECT TOP (1) dt FROM dbo.Precipitation AS P2 WHERE P2.locid = P1.locid AND P2.dt > P1.dt AND val > 24 ORDER BY P2.dt ) AS nextdt
Here the complete query computing both prevdt and nextdt:
SELECT locid, dt, val, ( SELECT TOP (1) dt FROM dbo.Precipitation AS P2 WHERE P2.locid = P1.locid AND P2.dt < P1.dt AND val > 24 ORDER BY P2.dt DESC ) AS prevdt, ( SELECT TOP (1) dt FROM dbo.Precipitation AS P2 WHERE P2.locid = P1.locid AND P2.dt > P1.dt AND val > 24 ORDER BY P2.dt ) AS nextdtFROM dbo.Precipitation AS P1/* ORDER BY locid, dt */ ; -- uncomment to present ordered
This query generates the output shown in Table 2 (presented sorted by location and date).
Table 2: Result of Step1
locid dt val prevdt nextdt----------- ---------- ----------- ---------- ----------1 2015-12-14 10 NULL 2015-12-171 2015-12-15 0 NULL 2015-12-171 2015-12-16 0 NULL 2015-12-171 2015-12-17 100 NULL 2015-12-181 2015-12-18 50 2015-12-17 2015-12-201 2015-12-19 20 2015-12-18 2015-12-201 2015-12-20 210 2015-12-18 2015-12-251 2015-12-21 20 2015-12-20 2015-12-251 2015-12-22 0 2015-12-20 2015-12-251 2015-12-23 0 2015-12-20 2015-12-251 2015-12-24 0 2015-12-20 2015-12-251 2015-12-25 40 2015-12-20 NULL2 2015-12-14 0 NULL 2015-12-152 2015-12-15 140 NULL 2015-12-162 2015-12-16 60 2015-12-15 2015-12-172 2015-12-17 40 2015-12-16 2015-12-232 2015-12-18 0 2015-12-17 2015-12-232 2015-12-19 20 2015-12-17 2015-12-232 2015-12-20 0 2015-12-17 2015-12-232 2015-12-21 0 2015-12-17 2015-12-232 2015-12-22 0 2015-12-17 2015-12-232 2015-12-23 45 2015-12-17 2015-12-242 2015-12-24 120 2015-12-23 2015-12-252 2015-12-25 130 2015-12-24 NULL
Remember that you need to compute the difference in terms of days between prevdt and dt (call it diffprev), and between dt and nextdt (call it diffnext). For this, you need to use the DATEDIFF function with inputs based on the current date and the results of the TOP subqueries. If you want to avoid providing the subqueries directly as inputs to the DATEDIFF function, you can use the CROSS APPLY operator and a derived table based on the VALUES clause to define the prevdt and nextdt aliases in the FROM clause. Since the FROM clause is logically evaluated before the SELECT clause, aliases that you define in the FROM clause are made available to computations in the SELECT clause. Here’s the completer solution query applying this approach:
SELECT locid, dt, val, DATEDIFF(day, prevdt, dt) AS diffprev, DATEDIFF(day, dt, nextdt) AS diffnextFROM dbo.Precipitation AS P1 CROSS APPLY ( VALUES( ( SELECT TOP (1) dt FROM dbo.Precipitation AS P2 WHERE P2.locid = P1.locid AND P2.dt < P1.dt AND val > 24 ORDER BY P2.dt DESC ), ( SELECT TOP (1) dt FROM dbo.Precipitation AS P2 WHERE P2.locid = P1.locid AND P2.dt > P1.dt AND val > 24 ORDER BY P2.dt ) ) ) AS A(prevdt, nextdt)/* ORDER BY locid, dt */ ; -- uncomment to present ordered
The query plan I got for this query is shown in Figure 1.
Figure 1: Plan for solution with TOP
Notice that for each row in Precipitation you get two index seek operations—one to compute prvdt and another to compute nextdt. This results in a very high number of logical reads. The execution of this query took 71 seconds to complete on my system, preforming 65,844,026 logical reads (reported by an Extended Events session). Note that this is the only query that I ran on my laptop. In reality, if this query runs in a system where lots of other queries are executed, there’re will be contention due to the excessive I/O operations it performs. It’s probably a good idea to look for a solution that performs fewer reads.
By the way: Often you only need to compute a previous or next value, but not both. When I ran this solution computing only diffprev, removing the computation of diffnext, it took 42 seconds to complete. Still, the number of reads is in the dozens of millions.
Solution with window functions
You can achieve the same task using window functions, resulting in a plan that performs significantly fewer reads. First, compute a value called gooddt representing the current date when the precipitation value is greater than 24 and NULL otherwise. Then, compute prevdt with the MAX window function applied to gooddt, with the frame ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING. Similarly, compute nextdt with the MIN window function applied to gooddt, with the frame ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING. The rest is the same as in the previous solution. Here’s the complete solution query:
WITH C AS( SELECT *, MAX(gooddt) OVER(PARTITION BY locid ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prevdt, MIN(gooddt) OVER(PARTITION BY locid ORDER BY dt ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) AS nextdt FROM dbo.Precipitation CROSS APPLY ( VALUES( CASE WHEN val > 24 THEN dt END ) ) AS A(gooddt))SELECT locid, dt, val, DATEDIFF(day, prevdt, dt) AS diffprev, DATEDIFF(day, dt, nextdt) AS diffnextFROM C/* ORDER BY locid, dt */ ; -- uncomment to present ordered
I got the plan shown in Figure 2 for this query.
Figure 2: Plan for solution with window functions
It took this query 51 seconds to complete on my system (and 33 seconds when computing only diffprev). The important thing, though, is that the number of reads dropped by three orders of magnitude to 64,769 (including for spilling the sort to tempdb). This means that this query will generate much less contention on I/O resources in an environment with multiple queries running concurrently.
You can achieve further optimization using Adam Machanic’s parallel CROSS APPLY trick where you query the Locations table, and with the CROSS APPLY operator, apply the solution’s logic to a single location. This trick tends to get better treatment of parallelism, and also breaks operations like sorting that scale in an extra linear fashion to multiple smaller operations that in total perform better.
To implement this technique first use the code in Listing 3 to encapsulate the solution’s logic for a single location in an inline table-valued function.
Listing 3: Encapsulate logic for single location in inline TVF
IF OBJECT_ID(N'dbo.GetDiff', N'IF') IS NOT NULL DROP FUNCTION dbo.GetDiff;GOCREATE FUNCTION dbo.GetDiff(@locid AS INT, @minprecip AS INT) RETURNS TABLEASRETURN WITH C AS ( SELECT dt, val, MAX(gooddt) OVER(ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prevdt, MIN(gooddt) OVER(ORDER BY dt ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) AS nextdt FROM dbo.Precipitation CROSS APPLY ( VALUES( CASE WHEN val > @minprecip THEN dt END ) ) AS A(gooddt) WHERE locid = @locid ) SELECT dt, val, DATEDIFF(day, prevdt, dt) AS diffprev, DATEDIFF(day, dt, nextdt) AS diffnext FROM C;GO
Use the following query to apply the function to each location:
SELECT L.locid, A.dt, A.val, A.diffprev, A.diffnextFROM dbo.Locations AS L CROSS APPLY dbo.GetDiff( L.locid, 24 ) AS A/* ORDER BY locid, dt */; -- uncomment to present ordered
I got the plan shown in Figure 3 for this query.
Figure 3: Plan for solution with APPLY
Notice also that the sort spill disappeared in this plan. It took this query 41 seconds to complete on my system (20 seconds when computing only diffprev, since there’s no sort required at all in such a case). This query performed 67,375 logical reads.
Return a different value than the ordering element
The last challenge involved returning the same value that is considered also the ordering element (the date in our example). But what if you need to return a different value than the ordering element? For example, what if the task was: Return the precipitation values of the previous and next days where the value was greater than 24? To achieve this, when you compute goodval, instead of holding just the date, have it hold a concatenated string made of the date plus the value, using the following expression which preserve correct ordering behavior:
CASE WHEN val > 24 THEN CONVERT(CHAR(8), dt, 112) + STR(val, 10) END
Use the MIN and MAX window functions like before (call the result columns this time prevgoodval and nextgoodval). Then, in the outer query, extract from each of the result columns the 10 rightmost characters, and convert them to integers. Here’s the complete solution query:
WITH C AS( SELECT *, MAX(goodval) OVER(PARTITION BY locid ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prevgoodval, MIN(goodval) OVER(PARTITION BY locid ORDER BY dt ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) AS nextgoodval FROM dbo.Precipitation CROSS APPLY ( VALUES( CASE WHEN val > 24 THEN CONVERT(CHAR(8), dt, 112) + STR(val, 10) END ) ) AS A(goodval))SELECT locid, dt, val, prevval, nextvalFROM C CROSS APPLY ( VALUES( CAST(RIGHT(prevgoodval, 10) AS INT), CAST(RIGHT(nextgoodval, 10) AS INT) ) ) AS A(prevval, nextval)/* ORDER BY locid, dt */ ; -- uncomment to present ordered
This query generates the output shown in Table 3 for the small set of sample data.
Table 3: Returning different value than ordering element
locid dt val prevval nextval----------- ---------- ----------- ----------- -----------1 2015-12-14 10 NULL 1001 2015-12-15 0 NULL 1001 2015-12-16 0 NULL 1001 2015-12-17 100 NULL 501 2015-12-18 50 100 2101 2015-12-19 20 50 2101 2015-12-20 210 50 401 2015-12-21 20 210 401 2015-12-22 0 210 401 2015-12-23 0 210 401 2015-12-24 0 210 401 2015-12-25 40 210 NULL2 2015-12-14 0 NULL 1402 2015-12-15 140 NULL 602 2015-12-16 60 140 402 2015-12-17 40 60 452 2015-12-18 0 40 452 2015-12-19 20 40 452 2015-12-20 0 40 452 2015-12-21 0 40 452 2015-12-22 0 40 452 2015-12-23 45 40 1202 2015-12-24 120 45 1302 2015-12-25 130 120 NULL
More Windowing Power
Window functions are the best thing that happened since sliced bread. And I’m not talking about just T-SQL features, but generally. Seriously--if you asked me which of the two I’d rather give up if I had to, I’d probably give up bread. It keeps amazing me what wide range of problems can be solved so elegantly and efficiently with window functions. There’s so much more in standard SQL involving window functions, like nested window functions, more powerful RANGE capabilities and other features. Hopefully, some folks from Microsoft are reading this and will continue the investment to add the significant functionality that is still missing.
About the Author
You May Also Like