New Solution to the Packing Intervals Problem
Packing intervals is a classic T-SQL problem that involves packing groups of intersecting intervals into their respective continuous intervals. I set a challenge to myself to try and find an elegant solution that can achieve the task by using only one supporting index and a single scan of the data, and I found one.
August 11, 2015
Packing intervals is a classic T-SQL problem that involves packing groups of intersecting intervals into their respective continuous intervals. I covered the problem and two solutions in the past in this article. The two older solutions are quite elegant and efficient but they do require you to create two supporting indexes and to perform two scans of the data. I set a challenge to myself to try and find an elegant solution that can achieve the task by using only one supporting index and a single scan of the data, and I found one. In this article I present the new solution and compare it to the two older ones.
Sample Data for the Packing Intervals Challenge
As sample data for the packing intervals challenge I’ll use a table called Accounts that holds account information, and a table called Sessions that holds session information against some service. Use the code in Listing 1 to create the Accounts and Sessions tables and fill them with small sets of sample data so that you will be able to verify the correctness of the solutions.
Listing 1: DDL and sample data (small set)
SET NOCOUNT ON;USE tempdb;IF OBJECT_ID('dbo.Sessions') IS NOT NULL DROP TABLE dbo.Sessions;IF OBJECT_ID('dbo.Accounts') IS NOT NULL DROP TABLE dbo.Accounts;CREATE TABLE dbo.Accounts( actid INT NOT NULL, CONSTRAINT PK_Accounts PRIMARY KEY(actid));GOINSERT INTO dbo.Accounts(actid) VALUES(1), (2), (3);CREATE TABLE dbo.Sessions( sessionid INT NOT NULL IDENTITY(1, 1), actid INT NOT NULL, starttime DATETIME2(0) NOT NULL, endtime DATETIME2(0) NOT NULL, CONSTRAINT PK_Sessions PRIMARY KEY(sessionid), CONSTRAINT CHK_endtime_gteq_starttime CHECK (endtime >= starttime));GOINSERT INTO dbo.Sessions(actid, starttime, endtime) VALUES (1, '20151231 08:00:00', '20151231 08:30:00'), (1, '20151231 08:30:00', '20151231 09:00:00'), (1, '20151231 09:00:00', '20151231 09:30:00'), (1, '20151231 10:00:00', '20151231 11:00:00'), (1, '20151231 10:30:00', '20151231 12:00:00'), (1, '20151231 11:30:00', '20151231 12:30:00'), (2, '20151231 08:00:00', '20151231 10:30:00'), (2, '20151231 08:30:00', '20151231 10:00:00'), (2, '20151231 09:00:00', '20151231 09:30:00'), (2, '20151231 11:00:00', '20151231 11:30:00'), (2, '20151231 11:32:00', '20151231 12:00:00'), (2, '20151231 12:04:00', '20151231 12:30:00'), (3, '20151231 08:00:00', '20151231 09:00:00'), (3, '20151231 08:00:00', '20151231 08:30:00'), (3, '20151231 08:30:00', '20151231 09:00:00'), (3, '20151231 09:30:00', '20151231 09:30:00');GO
As you can see, each session was active during a temporal interval with the start and end times stored in the starttime and endtime columns, respectively. Your task is to pack groups of intersecting intervals for each account. This means that for each group of intervals that intersect you are supposed to return one continuous interval; as the start and end times of the packed interval you are supposed to use the minimum start time in the group and the maximum end time in the group, respectively. For the sample data in Listing 1 the desired result is shown later in this article in Table 4.
To test the performance of the solutions you will need larger sets of sample data. Use the code in Listing 2 to fill the Accounts table with 5,000 accounts and the Sessions table with 200 sessions per account (one million sessions in total). Feel free to change the parameters to test the solutions with different numbers of rows as you see fit.
Listing 2: Large set of sample data
-- 1,000,000 intervalsDECLARE @num_accounts AS INT = 5000, @sessions_per_account AS INT = 200, @start_period AS DATETIME2(3) = '20160101', @end_period AS DATETIME2(3) = '20160201', @max_duration_in_seconds AS INT = 86400; TRUNCATE TABLE dbo.Sessions;TRUNCATE TABLE dbo.Accounts;INSERT INTO dbo.Accounts(actid) SELECT A.n AS actid FROM TSQLV3.dbo.GetNums(1, @num_accounts) AS A;WITH C AS( SELECT A.n AS actid, DATEADD(second, ABS(CHECKSUM(NEWID())) % (DATEDIFF(s, @start_period, @end_period) - @max_duration_in_seconds), @start_period) AS starttime FROM TSQLV3.dbo.GetNums(1, @num_accounts) AS A CROSS JOIN TSQLV3.dbo.GetNums(1, @sessions_per_account) AS I)INSERT INTO dbo.Sessions WITH (TABLOCK) (actid, starttime, endtime) SELECT actid, starttime, DATEADD(second, ABS(CHECKSUM(NEWID())) % (@max_duration_in_seconds + 1), starttime) AS endtime FROM C;GO
New Solution using Window Aggregates
In order for the new solution to perform well you will want to create a single supporting index that orders the data by the account ID, the session start time, the session end time, and finally the session ID as a tiebreaker (in case you have two sessions with the same account that started and ended at the same time). Use the following code to create the recommended index:
CREATE UNIQUE INDEX idx_start_end ON dbo.Sessions(actid, starttime, endtime, sessionid);
The following code implements the first step in the solution:
SELECT *, MAX(endtime) OVER(PARTITION BY actid ORDER BY starttime, endtime, sessionid ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prvendFROM dbo.Sessions;
For each session, the code uses a running maximum calculation that computes the maximum end time until the previous session by the same account (call the result prvend). In previous session I rely on the order: start time, end time and session ID. The output of step 1 for the small sets of sample data is shown in Table 1.
Table 1: Output of Step 1
---------- ------ -------------------- -------------------- --------------------1 1 2015-12-31 08:00:00 2015-12-31 08:30:00 NULL2 1 2015-12-31 08:30:00 2015-12-31 09:00:00 2015-12-31 08:30:003 1 2015-12-31 09:00:00 2015-12-31 09:30:00 2015-12-31 09:00:004 1 2015-12-31 10:00:00 2015-12-31 11:00:00 2015-12-31 09:30:005 1 2015-12-31 10:30:00 2015-12-31 12:00:00 2015-12-31 11:00:006 1 2015-12-31 11:30:00 2015-12-31 12:30:00 2015-12-31 12:00:007 2 2015-12-31 08:00:00 2015-12-31 10:30:00 NULL8 2 2015-12-31 08:30:00 2015-12-31 10:00:00 2015-12-31 10:30:009 2 2015-12-31 09:00:00 2015-12-31 09:30:00 2015-12-31 10:30:0010 2 2015-12-31 11:00:00 2015-12-31 11:30:00 2015-12-31 10:30:0011 2 2015-12-31 11:32:00 2015-12-31 12:00:00 2015-12-31 11:30:0012 2 2015-12-31 12:04:00 2015-12-31 12:30:00 2015-12-31 12:00:0014 3 2015-12-31 08:00:00 2015-12-31 08:30:00 NULL13 3 2015-12-31 08:00:00 2015-12-31 09:00:00 2015-12-31 08:30:0015 3 2015-12-31 08:30:00 2015-12-31 09:00:00 2015-12-31 09:00:0016 3 2015-12-31 09:30:00 2015-12-31 09:30:00 2015-12-31 09:00:00
Naturally the first session by an account doesn’t have a corresponding previous session, hence prvend for the first session is NULL.
The purpose of the second step in the solution is to figure out whether the current session starts a new packed interval or not. The principal part of the solution is implemented in this step, so make sure you understand the logic well. To know whether intervals I1 and I2 intersect two conditions must be true: I2.endtime >= I1.starttime AND I2.starttime <= I1.endtime. Clearly, in the sessions up to the one before the current, the last packed interval will have the prvend value as the end time. By virtue of the window order clause in the computation of prvend, by definition, the current session’s end time is already known to be greater than or equal to the last packed interval’s start time. That’s because the current session’s start time is greater than or equal to any previous session’s start time, and the current session’s end time is greater than or equal to the current session’s start time. In other words, one of the conditions for intersection between the current session and the last packed interval is always implied. So, what remains to check explicitly is only the other condition. Namely, if the current session’s start time is less than or equal to the prvend value it doesn’t start a new packed interval, otherwise it does. Based on this logic, the following code computes a flag called isstart that is set to NULL if the current session doesn’t start a new packed interval and to 1 if it does:
WITH C1 AS( SELECT *, MAX(endtime) OVER(PARTITION BY actid ORDER BY starttime, endtime, sessionid ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prvend FROM dbo.Sessions)SELECT *FROM C1 CROSS APPLY ( VALUES(CASE WHEN starttime <= prvend THEN NULL ELSE 1 END) ) AS A(isstart);
Table 2 has the output of the step 2.
Table 2: Output of step 2
sessionid actid starttime endtime prvend isstart---------- ------ -------------------- -------------------- -------------------- --------1 1 2015-12-31 08:00:00 2015-12-31 08:30:00 NULL 12 1 2015-12-31 08:30:00 2015-12-31 09:00:00 2015-12-31 08:30:00 NULL3 1 2015-12-31 09:00:00 2015-12-31 09:30:00 2015-12-31 09:00:00 NULL4 1 2015-12-31 10:00:00 2015-12-31 11:00:00 2015-12-31 09:30:00 15 1 2015-12-31 10:30:00 2015-12-31 12:00:00 2015-12-31 11:00:00 NULL6 1 2015-12-31 11:30:00 2015-12-31 12:30:00 2015-12-31 12:00:00 NULL7 2 2015-12-31 08:00:00 2015-12-31 10:30:00 NULL 18 2 2015-12-31 08:30:00 2015-12-31 10:00:00 2015-12-31 10:30:00 NULL9 2 2015-12-31 09:00:00 2015-12-31 09:30:00 2015-12-31 10:30:00 NULL10 2 2015-12-31 11:00:00 2015-12-31 11:30:00 2015-12-31 10:30:00 111 2 2015-12-31 11:32:00 2015-12-31 12:00:00 2015-12-31 11:30:00 112 2 2015-12-31 12:04:00 2015-12-31 12:30:00 2015-12-31 12:00:00 114 3 2015-12-31 08:00:00 2015-12-31 08:30:00 NULL 113 3 2015-12-31 08:00:00 2015-12-31 09:00:00 2015-12-31 08:30:00 NULL15 3 2015-12-31 08:30:00 2015-12-31 09:00:00 2015-12-31 09:00:00 NULL16 3 2015-12-31 09:30:00 2015-12-31 09:30:00 2015-12-31 09:00:00 1
The third step in the solution is to generate a packed interval group identifier (call it grp) by computing a simple running total of the isstart flag from the beginning of the partition and until the current row. Here’s the code that implements this step:
WITH C1 AS( SELECT *, MAX(endtime) OVER(PARTITION BY actid ORDER BY starttime, endtime, sessionid ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prvend FROM dbo.Sessions)SELECT *, SUM(isstart) OVER(PARTITION BY actid ORDER BY starttime, endtime, sessionid ROWS UNBOUNDED PRECEDING) AS grpFROM C1 CROSS APPLY ( VALUES(CASE WHEN starttime <= prvend THEN NULL ELSE 1 END) ) AS A(isstart);
Table 3 has the output of step 3.
Table 3: Output of step 3
---------- ------ -------------------- -------------------- -------------------- -------- ----1 1 2015-12-31 08:00:00 2015-12-31 08:30:00 NULL 1 12 1 2015-12-31 08:30:00 2015-12-31 09:00:00 2015-12-31 08:30:00 NULL 13 1 2015-12-31 09:00:00 2015-12-31 09:30:00 2015-12-31 09:00:00 NULL 14 1 2015-12-31 10:00:00 2015-12-31 11:00:00 2015-12-31 09:30:00 1 25 1 2015-12-31 10:30:00 2015-12-31 12:00:00 2015-12-31 11:00:00 NULL 26 1 2015-12-31 11:30:00 2015-12-31 12:30:00 2015-12-31 12:00:00 NULL 27 2 2015-12-31 08:00:00 2015-12-31 10:30:00 NULL 1 18 2 2015-12-31 08:30:00 2015-12-31 10:00:00 2015-12-31 10:30:00 NULL 19 2 2015-12-31 09:00:00 2015-12-31 09:30:00 2015-12-31 10:30:00 NULL 110 2 2015-12-31 11:00:00 2015-12-31 11:30:00 2015-12-31 10:30:00 1 211 2 2015-12-31 11:32:00 2015-12-31 12:00:00 2015-12-31 11:30:00 1 312 2 2015-12-31 12:04:00 2015-12-31 12:30:00 2015-12-31 12:00:00 1 414 3 2015-12-31 08:00:00 2015-12-31 08:30:00 NULL 1 113 3 2015-12-31 08:00:00 2015-12-31 09:00:00 2015-12-31 08:30:00 NULL 115 3 2015-12-31 08:30:00 2015-12-31 09:00:00 2015-12-31 09:00:00 NULL 116 3 2015-12-31 09:30:00 2015-12-31 09:30:00 2015-12-31 09:00:00 1 2
Finally, to produce the packed intervals, the fourth and last step groups the rows from the result of step 3 by the account ID and the group identifier, and returns the minimum start time as the start time of the packed interval and the maximum end time as the end time of the packed interval. The code in Listing 3 has the complete solution.
Listing 3: Complete new solution
WITH C1 AS( SELECT *, MAX(endtime) OVER(PARTITION BY actid ORDER BY starttime, endtime, sessionid ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prvend FROM dbo.Sessions),C2 AS( SELECT *, SUM(isstart) OVER(PARTITION BY actid ORDER BY starttime, endtime, sessionid ROWS UNBOUNDED PRECEDING) AS grp FROM C1 CROSS APPLY ( VALUES(CASE WHEN starttime <= prvend THEN NULL ELSE 1 END) ) AS A(isstart))SELECT actid, MIN(starttime) AS starttime, MAX(endtime) AS endtimeFROM C2GROUP BY actid, grp;
The output of the solution is shown in Table 4.
Table 4: Output of complete solution
actid starttime endtime------ -------------------- --------------------1 2015-12-31 08:00:00 2015-12-31 09:30:002 2015-12-31 08:00:00 2015-12-31 10:30:003 2015-12-31 08:00:00 2015-12-31 09:00:001 2015-12-31 10:00:00 2015-12-31 12:30:002 2015-12-31 11:00:00 2015-12-31 11:30:003 2015-12-31 09:30:00 2015-12-31 09:30:002 2015-12-31 11:32:00 2015-12-31 12:00:002 2015-12-31 12:04:00 2015-12-31 12:30:00
The execution plan for this solution is very efficient. Figure 1 has the plan that I got (using SQL Sentry’s plan explorer) for the large sets of sample data.
Figure 1 - Plan for new solution using window aggregates
Click to see larger image
As you can see, there’s only one ordered scan of the supporting index. Both window functions rely on index order rather than on explicit sorting. I got the following performance statistics for this query on my system: CPU time = 12,545 ms, elapsed time = 4,277 ms, logical reads = 3,300. Four seconds run time isn’t too bad for the amount of data involved.
Comparison with Old Solutions
I covered the old solutions for the interval packing problem in the following article. Here I’m just going to present them and their performance so that you can compare them with the new solution. Remember that the new solution requires only one index for optimal performance. The old solutions split start and end events to separate result rows and organize them in chronological order. Therefore, they require two supportive indexes for optimal performance: one for start events and another for end events. Use the following code to create the recommended indexes:
CREATE UNIQUE INDEX idx_start ON dbo.Sessions(actid, starttime, sessionid);CREATE UNIQUE INDEX idx_end ON dbo.Sessions(actid, endtime, sessionid);
Listing 4 has one of the old solutions (call it old solution 1). This solution is based on a window aggregate that computes a running total and row numbers.
Listing 4: Old solution using window aggregate and row numbers
WITH C1 AS( SELECT sessionid, actid, starttime AS ts, +1 AS type FROM dbo.Sessions UNION ALL SELECT sessionid, actid, endtime AS ts, -1 AS type FROM dbo.Sessions),C2 AS( SELECT *, SUM(type) OVER(PARTITION BY actid ORDER BY ts, type DESC ROWS UNBOUNDED PRECEDING) - CASE WHEN type = 1 THEN 1 ELSE 0 END AS cnt FROM C1),C3 AS( SELECT *, FLOOR((ROW_NUMBER() OVER(PARTITION BY actid ORDER BY ts) + 1) / 2) AS p FROM C2 WHERE cnt = 0)SELECT actid, MIN(ts) AS starttime, max(ts) AS endtimeFROM C3GROUP BY actid, p;
The plan for old solution 1 is shown in Figure 2.
Figure 2 - Plan for old solution using window aggregate and row numbers
Click to see larger image
Here are the performance statistics that I got for old solution 1: CPU time = 6,047 ms, elapsed time = 6,549 ms, logical reads = 4,974. The new solution uses fewer reads and runs faster, although it does use more net CPU time.
Listing 5 has another old solution (call it old solution 2). This solution is based only on row numbers.
Listing 5: Old solution using only row numbers
WITH C1 AS( SELECT sessionid, actid, starttime AS ts, +1 AS type, ROW_NUMBER() OVER(PARTITION BY actid ORDER BY starttime, sessionid) AS s, NULL AS e FROM dbo.Sessions UNION ALL SELECT sessionid, actid, endtime AS ts, -1 AS type, NULL AS s, ROW_NUMBER() OVER(PARTITION BY actid ORDER BY endtime, sessionid) AS e FROM dbo.Sessions),C2 AS( SELECT *, ROW_NUMBER() OVER(PARTITION BY actid ORDER BY ts, type DESC, sessionid) AS se FROM C1),C3 AS( SELECT *, FLOOR((ROW_NUMBER() OVER(PARTITION BY actid ORDER BY ts) + 1) / 2) AS p FROM C2 CROSS APPLY ( VALUES(s - (se - s) - 1, (se - e) - e) ) AS A(cs, ce) WHERE cs = 0 OR ce = 0)SELECT actid, MIN(ts) AS starttime, MAX(ts) AS endtimeFROM C3GROUP BY actid, p;
The plan for old solution 2 is shown in Figure 3.
Figure 3 - Plan for old solution using only row numbers
Click to see larger image
Here are the performance statistics that I got for old solution 2: CPU time = 2,719 ms, elapsed time = 3,147 ms, logical reads = 4,974. The new solution runs a bit slower and requires more net CPU time, but performs fewer reads.
Conclusion
For a long time I’ve been looking for an elegant solution to the packing intervals problem that requires only one supportive index and that performs only a single pass over the data. Finally, I’ve found one. The comparison with the older solutions is interesting. With only one index versus two, the new solution has a lower negative impact on write performance and requires fewer reads. In terms of run time, it is faster than old solution 1 and a bit slower than old solution 2. This experience is proof that it’s a good idea to keep revisiting classic challenges and never consider yourself to have finished. Often there’s room for new solutions that improve some aspects of the old ones.
About the Author
You May Also Like