Packing Intervals with Priorities
In this article I provide my solution to a packing intervals problem posed by a reader, but as always, I do urge you to try solving it yourself before you look at mine. I should warn you: It took me a few days to figure out an efficient solution.
November 10, 2015
Packing intervals is a classic T-SQL task that involves packing groups of intervals that intersect into single continuous intervals. I covered the topic in detail in this SolidQ article. The specific example I used in that article involved a table called Users and a table called Sessions. The task was to identify, for each user, the continuous periods with active sessions. This meant that for each user you had to pack the intersecting session intervals into single continuous intervals. I provided two possible solutions—one based on the ROW_NUMBER function, which is supported in SQL Server 2005 and later, and another using a window aggregate with a frame, which is supported in SQL Server 2012 and later. I also explained how you can further optimize your solution by encapsulating the logic for a single group (user, in our example) in an inline table-valued function (inline TVF), and then use the CROSS APPLY operator against the Users table to apply the function to each user.
A reader named Ralf posted the following question concerning an interesting variation of the task:
“I have a similar problem, but each interval has priorities.
For example, consider the interval from User 2 from 08.00 – 10.30.
suppose that from
08.00 – 10.30 has priority 3
08.30 – 10.00 has priority 2
09.00 – 09.30 has priority 1
with prio 1 is the highest / most important.
If I consolidate this, the output should be:
08.00 – 08.30, prio 3
08.30 – 09.00, prio 2
09.00 – 09.30, prio 1
09.30 – 10.00, prio 2
10.00 – 13.30, prio 3
I tried a lot with no success. Do you have a tip?
Best regards
Ralf”
As it turned out, the task was pretty challenging, but also very interesting. In this article I provide my solution, but as always, I do urge you to try solving it yourself before you look at mine. I should warn you, though, that it took me a few days to figure out an efficient solution, with many hours of staring at a graphical depiction of the problem and the desired result, but it was so worth it!
Sample data and desired result
Use the code in Listing 1 to create the Users and Sessions tables with supporting indexes, and to fill them with a small set of sample data.
Listing 1: DDL and small set of sample data
SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.Sessions', N'U') IS NOT NULL DROP TABLE dbo.Sessions;IF OBJECT_ID(N'dbo.Users', N'U') IS NOT NULL DROP TABLE dbo.Users;GOCREATE TABLE dbo.Users( username VARCHAR(14) NOT NULL, CONSTRAINT PK_Users PRIMARY KEY(username));CREATE TABLE dbo.Sessions( id INT NOT NULL IDENTITY, username VARCHAR(14) NOT NULL, starttime DATETIME2(3) NOT NULL, endtime DATETIME2(3) NOT NULL, pty INT NOT NULL, CONSTRAINT PK_Sessions PRIMARY KEY(id), CONSTRAINT CHK_endtime_gt_starttime CHECK (endtime > starttime), CONSTRAINT CHK_priority_range CHECK (pty BETWEEN 1 AND 31));-- Indexes to support solutionsCREATE UNIQUE INDEX idx_user_start ON dbo.Sessions(username, pty, starttime, id);CREATE UNIQUE INDEX idx_user_end ON dbo.Sessions(username, pty, endtime, id);-- Sample data (small)TRUNCATE TABLE dbo.Sessions;TRUNCATE TABLE dbo.Users;INSERT INTO dbo.Users(username) VALUES('User1'), ('User2'), ('User3');INSERT INTO dbo.Sessions(username, starttime, endtime, pty) VALUES ('User1', '20160101 08:00:00.000', '20160101 08:30:00.000', 1), ('User1', '20160101 08:05:00.000', '20160101 08:35:00.000', 2), ('User1', '20160101 08:00:00.000', '20160101 08:30:00.000', 3), ('User2', '20160101 08:00:00.000', '20160101 10:30:00.000', 3), ('User2', '20160101 08:30:00.000', '20160101 10:00:00.000', 2), ('User2', '20160101 09:00:00.000', '20160101 09:30:00.000', 1), ('User2', '20160101 11:00:00.000', '20160101 12:00:00.000', 3), ('User2', '20160101 11:30:00.000', '20160101 12:30:00.000', 2), ('User2', '20160101 11:40:00.000', '20160101 12:40:00.000', 3), ('User2', '20160101 12:00:00.000', '20160101 13:00:00.000', 2), ('User3', '20160101 08:00:00.000', '20160101 09:00:00.000', 1), ('User3', '20160101 08:00:00.000', '20160101 08:30:00.000', 2), ('User3', '20160101 08:30:00.000', '20160101 09:00:00.000', 3), ('User3', '20160101 09:30:00.000', '20160101 09:31:00.000', 1);
The intervals in the Sessions table are closed-open intervals, meaning that the starttime value is included and the endtime value is excluded. The formal representation of such intervals is: [starttime, endtime). A square bracket represents a closed delimiter (inclusive), and an angle bracket represents an open delimiter (exclusive).
Figure 1 has a graphical depiction of the input and desired output for User2 as an example.
The intervals in the Sessions table are closed-open intervals, meaning that the starttime value is included and the endtime value is excluded. The formal representation of such intervals is: [starttime, endtime). A square bracket represents a closed delimiter (inclusive), and an angle bracket represents an open delimiter (exclusive).
Figure 1 has a graphical depiction of the input and desired output for User2 as an example.
For now, ignore the bluish ptybitmap values; they are part of the solution and I’ll explain them later. As you can see in the middle part of the figure, you are supposed to pack groups of intersecting intervals with the same priority. For example, notice that the four input intervals that were active during the period 11:00 – 13:00 translate to two packed intervals. As you can see in the bottom part of the figure, if you have overlapping segments of packed intervals with different priorities you should return the segment with the highest priority (lowest integer). Just like in Ralf’s example, observe that the input contains three intervals during the period 8:00 – 10:30:
starttime endtime pty---------- -------- ----08:00 10:30 3 08:30 10:00 209:00 09:30 1These input intervals translate to the following output intervals:starttime endtime pty---------- -------- ----08:00 08:30 3 08:30 09:00 209:00 09:30 109:30 10:00 210:00 10:30 3
Table 1 has the complete desired result for the small set of sample data.
Table 1: Desired result for small set of sample datausername pty starttime endtime--------- ---- -------------------- --------------------User1 1 2016-01-01 08:00:00 2016-01-01 08:30:00User1 2 2016-01-01 08:30:00 2016-01-01 08:35:00User2 1 2016-01-01 09:00:00 2016-01-01 09:30:00User2 2 2016-01-01 08:30:00 2016-01-01 09:00:00User2 2 2016-01-01 09:30:00 2016-01-01 10:00:00User2 2 2016-01-01 11:30:00 2016-01-01 13:00:00User2 3 2016-01-01 08:00:00 2016-01-01 08:30:00User2 3 2016-01-01 10:00:00 2016-01-01 10:30:00User2 3 2016-01-01 11:00:00 2016-01-01 11:30:00User3 1 2016-01-01 08:00:00 2016-01-01 09:00:00User3 1 2016-01-01 09:30:00 2016-01-01 09:31:00
Once you come up with a working solution, you will want to verify its correctness by comparing your result to the desired one in Table 1. To test the efficiency of your solution you will need a larger set of sample data. Run the code in Listing 2 to create the helper function GetNums, and the code in Listing 3 to populate the tables with a large set of sample data.
Listing 2: Helper function GetNums
IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;GOCREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLEASRETURN WITH L0 AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)), L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L5) SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n FROM Nums ORDER BY rownum;GO
Listing 3: Large set of sample data
-- 2,000 users, 5,000,000 intervalsDECLARE @num_users AS INT = 2000, @intervals_per_user AS INT = 2500, @start_period AS DATETIME2(3) = '20160101', @end_period AS DATETIME2(3) = '20160107', @max_duration_in_ms AS INT = 3600000; -- 60 nimutes TRUNCATE TABLE dbo.Sessions;TRUNCATE TABLE dbo.Users;INSERT INTO dbo.Users(username) SELECT 'User' + RIGHT('000000000' + CAST(U.n AS VARCHAR(10)), 10) AS username FROM dbo.GetNums(1, @num_users) AS U;WITH C AS( SELECT 'User' + RIGHT('000000000' + CAST(U.n AS VARCHAR(10)), 10) AS username, DATEADD(ms, ABS(CHECKSUM(NEWID())) % 86400000, DATEADD(day, ABS(CHECKSUM(NEWID())) % DATEDIFF(day, @start_period, @end_period), @start_period)) AS starttime, ABS(CHECKSUM(NEWID())) % 3 + 1 AS pty FROM dbo.GetNums(1, @num_users) AS U CROSS JOIN dbo.GetNums(1, @intervals_per_user) AS I)INSERT INTO dbo.Sessions WITH (TABLOCK) (username, starttime, endtime, pty) SELECT username, starttime, DATEADD(ms, ABS(CHECKSUM(NEWID())) % @max_duration_in_ms + 1, starttime) AS endtime, pty FROM C;
You should have everything you need to start working on the challenge. Good luck!
The solution, Step 1
I’ll describe my solution in four steps. Listing 4 has the code that implements step 1 and Table 2 has the result of this step.
Listing 4: Step 1
WITH C1 AS( SELECT id, username, starttime AS ts, pty, +1 AS type, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY starttime, id) AS s, NULL AS e FROM dbo.Sessions UNION ALL SELECT id, username, endtime AS ts, pty, -1 AS type, NULL AS s, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY endtime, id) AS e FROM dbo.Sessions),C2 AS( SELECT *, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY ts, type DESC, id) AS se FROM C1)SELECT *, SUM(ptybitval) OVER(PARTITION BY username ORDER BY ts, ptybitval ROWS UNBOUNDED PRECEDING) AS ptybitmapFROM C2 CROSS APPLY ( VALUES( s - (se - s) - 1, (se - e) - e, type * POWER(2, pty - 1) ) ) AS A(cs, ce, ptybitval)WHERE cs = 0 OR ce = 0;
Table 2: Result of step 1
id username ts pty type s e se cs ce ptybitval ptybitmap--- --------- -------------------- ---- ----- ----- ----- --- ----- ----- ---------- ----------1 User1 2016-01-01 08:00:00 1 1 1 NULL 1 0 NULL 1 1 (001)3 User1 2016-01-01 08:00:00 3 1 1 NULL 1 0 NULL 4 5 (101)2 User1 2016-01-01 08:05:00 2 1 1 NULL 1 0 NULL 2 7 (111)3 User1 2016-01-01 08:30:00 3 -1 NULL 1 2 NULL 0 -4 3 (011)1 User1 2016-01-01 08:30:00 1 -1 NULL 1 2 NULL 0 -1 2 (010)2 User1 2016-01-01 08:35:00 2 -1 NULL 1 2 NULL 0 -2 0 (000)4 User2 2016-01-01 08:00:00 3 1 1 NULL 1 0 NULL 4 4 (100)5 User2 2016-01-01 08:30:00 2 1 1 NULL 1 0 NULL 2 6 (110)6 User2 2016-01-01 09:00:00 1 1 1 NULL 1 0 NULL 1 7 (111)6 User2 2016-01-01 09:30:00 1 -1 NULL 1 2 NULL 0 -1 6 (110)5 User2 2016-01-01 10:00:00 2 -1 NULL 1 2 NULL 0 -2 4 (100)4 User2 2016-01-01 10:30:00 3 -1 NULL 1 2 NULL 0 -4 0 (000)7 User2 2016-01-01 11:00:00 3 1 2 NULL 3 0 NULL 4 4 (100)8 User2 2016-01-01 11:30:00 2 1 2 NULL 3 0 NULL 2 6 (110)9 User2 2016-01-01 12:40:00 3 -1 NULL 3 6 NULL 0 -4 2 (010)10 User2 2016-01-01 13:00:00 2 -1 NULL 3 6 NULL 0 -2 0 (000)11 User3 2016-01-01 08:00:00 1 1 1 NULL 1 0 NULL 1 1 (001)12 User3 2016-01-01 08:00:00 2 1 1 NULL 1 0 NULL 2 3 (011)12 User3 2016-01-01 08:30:00 2 -1 NULL 1 2 NULL 0 -2 1 (001)13 User3 2016-01-01 08:30:00 3 1 1 NULL 1 0 NULL 4 5 (101)13 User3 2016-01-01 09:00:00 3 -1 NULL 1 2 NULL 0 -4 1 (001)11 User3 2016-01-01 09:00:00 1 -1 NULL 1 2 NULL 0 -1 0 (000)14 User3 2016-01-01 09:30:00 1 1 2 NULL 3 0 NULL 1 1 (001)14 User3 2016-01-01 09:31:00 1 -1 NULL 2 4 NULL 0 -1 0 (000)
Besides the expressions that compute ptybitval and ptybitmap, which I’ll get to shortly, the rest of the code in this step implements the packing technique using the ROW_NUMBER function, which I described in Packing Intervals. If you’re not familiar with this packing technique, it would be best to read about it first before you continue.
This step packs intersecting intervals with the same user and priority, returning the start and end points of those packed intervals in separate rows. You can identify the output of this step for User2 graphically in the middle section in Figure 1.
As a reminder, the code in the CTE C1 returns start and end events of the input intervals in separate rows. The code marks start events with a the value +1 as the event type (column named type), meaning that this event increases the count of active intervals, and end events with -1, meaning that this event decreases the count. The code uses row numbers to compute a start event counter (column called s) and an end event counter (column called e). These counters count how many events of that type happened until that current event. NULLs are used as place holders in s for end events and in e for start events. The code in C1 unifies start and end events to generate one sequence of events.
The code in the CTE C2 uses row numbers to compute a combined event counter in chronological order (column called se). Think of se as an indication of how many events of both kinds (start and end) happened until the current event.
The outer query uses the CROSS APPLY operator to compute a number of result columns. The result column cs represents the count of active intervals right before the current start event, and the column ce represents the count of active intervals right after the current end event. The column ce is computed with the expression: s - (se - s) - 1. The logic behind this expression is that the combined event counter (se) minus the start event counter (s) tells you how many intervals ended before the current start event. If you subtract this value from how many intervals started until the current start event, you get how many intervals are active at this point, inclusive of the current start event. To know how many intervals were active just before the current start event you simply subtract 1. In a very similar way, to compute ce you use the expression: (se - e) - e.
The outer query filters only the events that mark either a start of a packed interval (where cs is 0) or an end of a packed interval (where ce is 0).
The APPLY operator computes another column called ptybitval that converts the priority to a bit value in an integer bitmap. SQL Server uses the two’s complement representation for integers, where each bit in a positive integer (other than the leftmost bit) represent the bit value: POWER(2, - 1). So, with the expression POWER(2, pty - 1) you convert the priority to a corresponding bit value. You than multiply that value by the event type to get a positive bit value for start events and a negative value for end events.
The outer query in the SELECT list then computes a column called ptybitmap as the running total of the bit values of the events per user in chronological order. So at any given point, the bits that are turned on within the bitmap mark the active priorities. A start event turns the bit representing the event’s priority on and an end event turns it off. Since the solution packs the intervals with the same user and priority before applying this running total calculation, after packing, you can’t have two events of the same kind at the same time. So the running total calculation acts as an aggregate bitwise or operation. The rightmost set bit in the bitmap represents the highest priority (remember, lowest number) among the active ones. We’ll get to how to isolate this bit in the next step. You realize that this bitmap-based scheme means that your solution is limited to as many priorities as the number of available bits in the datatype that you’re using (minus 1 for the unusable leftmost bit). In my table I use the INT type and therefore I’m limited to 31 priorities. If you need more, using BIGINT you can support up to 63 priorities.
Step 2
The code implementing step 2 is shown in Listing 5 and the output of this step is shown in Table 3.
Listing 5: Step 2
WITH C1 AS( SELECT id, username, starttime AS ts, pty, +1 AS type, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY starttime, id) AS s, NULL AS e FROM dbo.Sessions UNION ALL SELECT id, username, endtime AS ts, pty, -1 AS type, NULL AS s, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY endtime, id) AS e FROM dbo.Sessions),C2 AS( SELECT *, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY ts, type DESC, id) AS se FROM C1),C3 AS( SELECT *, SUM(ptybitval) OVER(PARTITION BY username ORDER BY ts, ptybitval ROWS UNBOUNDED PRECEDING) AS ptybitmap FROM C2 CROSS APPLY ( VALUES( s - (se - s) - 1, (se - e) - e, type * POWER(2, pty - 1) ) ) AS A(cs, ce, ptybitval) WHERE cs = 0 OR ce = 0)SELECT id, username, ts, pty, type, ptybitval, ptybitmap, rsb, prvrsb, startpty, endptyFROM C3 CROSS APPLY ( VALUES( ptybitmap & -ptybitmap, (ptybitmap - ptybitval) & -(ptybitmap - ptybitval) ) ) AS A1(rsb, prvrsb) CROSS APPLY ( VALUES( CASE WHEN type = 1 AND rsb = ptybitval THEN pty WHEN type = -1 AND rsb > -ptybitval THEN LOG(rsb, 2) + 1 END, CASE WHEN type = -1 AND (rsb = 0 OR rsb > -ptybitval) THEN pty WHEN type = 1 AND prvrsb > ptybitval THEN LOG(prvrsb, 2) + 1 END ) ) AS A2(startpty, endpty);
Table 3: Result of Step 2
id username ts pty type ptybitval ptybitmap rsb prvrsb startpty endpty--- --------- -------------------- ---- ----- ---------- ---------- ---- ------- --------- -------1 User1 2016-01-01 08:00:00 1 1 1 1 1 0 1 NULL3 User1 2016-01-01 08:00:00 3 1 4 5 1 1 NULL NULL2 User1 2016-01-01 08:05:00 2 1 2 7 1 1 NULL NULL3 User1 2016-01-01 08:30:00 3 -1 -4 3 1 1 NULL NULL1 User1 2016-01-01 08:30:00 1 -1 -1 2 2 1 2 12 User1 2016-01-01 08:35:00 2 -1 -2 0 0 2 NULL 24 User2 2016-01-01 08:00:00 3 1 4 4 4 0 3 NULL5 User2 2016-01-01 08:30:00 2 1 2 6 2 4 2 36 User2 2016-01-01 09:00:00 1 1 1 7 1 2 1 26 User2 2016-01-01 09:30:00 1 -1 -1 6 2 1 2 15 User2 2016-01-01 10:00:00 2 -1 -2 4 4 2 3 24 User2 2016-01-01 10:30:00 3 -1 -4 0 0 4 NULL 37 User2 2016-01-01 11:00:00 3 1 4 4 4 0 3 NULL8 User2 2016-01-01 11:30:00 2 1 2 6 2 4 2 39 User2 2016-01-01 12:40:00 3 -1 -4 2 2 2 NULL NULL10 User2 2016-01-01 13:00:00 2 -1 -2 0 0 2 NULL 211 User3 2016-01-01 08:00:00 1 1 1 1 1 0 1 NULL12 User3 2016-01-01 08:00:00 2 1 2 3 1 1 NULL NULL12 User3 2016-01-01 08:30:00 2 -1 -2 1 1 1 NULL NULL13 User3 2016-01-01 08:30:00 3 1 4 5 1 1 NULL NULL13 User3 2016-01-01 09:00:00 3 -1 -4 1 1 1 NULL NULL11 User3 2016-01-01 09:00:00 1 -1 -1 0 0 1 NULL 114 User3 2016-01-01 09:30:00 1 1 1 1 1 0 1 NULL14 User3 2016-01-01 09:31:00 1 -1 -1 0 0 1 NULL 1
Note to reader: the next part requires a strong coffee and a geeky mood.
The code defines a CTE called C3 based on the outer query from step 1. The outer query in step 2 is issued against C3 and uses two CROSS APPLY operators to compute a number of result columns. The first CROSS APPLY operator computes the pair of columns rsb (for rightmost set bit) and prvrsb (for previous rightmost set bit). As mentioned, the rightmost set bit represents the highest active priority (lowest number). With the two’s complement representation of an integer N, the rightmost set bit is computed as N & -N, where & is T-SQL’s bitwise AND operator. The column rsb is then computed as: ptybitmap & -ptybitmap. To know what the bitmap was before the current event was applied you simply subtract ptybitval from ptybitmap. So prvrsb is computed as: (ptybitmap - ptybitval) & -(ptybitmap - ptybitval).
The second APPLY operator computes two additional result columns called startpty and endpty. Those are based on rsb and prvrsb and therefore require a separate CROSS APPLY operator to compute them.
The startpty column:
If the current start or end event starts a new result interval, the startpty column returns the priority of that interval, otherwise a NULL.
One condition that makes the current event a start of a result interval is when the event is a start event (type = 1), and the highest active priority is the current event’s priority (rsb = ptybitval). In such a case, startpty is set to pty (the current event’s priority). Examine Figure 1 and Table 3 and see if you can identify cases that fall under this category. One such example is the start event of the interval for User2 at 8:00 with priority 3. Observe that this start event is both a start of a packed interval and a start of an output interval.
Another condition that makes the current event a start of a result interval is when the event is an end event (type = -1), and the highest remaining active priority is lower than the priority of the interval that has just ended (number is greater: rsb > -ptybitval). In such a case, startpty is set to the priority that rsb represents (computed as: LOG(rsb, 2) + 1). For example, at 9:30 the packed interval for User2 with priority 1 ends. This marks the start of an output interval with priority 2.
The endpty column:
If the current start or end event ends a result interval, the endpty column returns the priority of that interval, otherwise a NULL.
One condition that makes the current event an end of a result interval is when the event is an end event (type = -1), and either there are no remaining active intervals (rsb = 0) or the highest remaining active priority is lower than the priority of the interval that has just ended (number is greater: rsb > -ptybitval). In such a case, endpty is set to pty (the current event’s priority). Examples for such events are the end event for User2 at 10:30 of the interval with priority 3 and the end event for User2 at 9:30 of the interval with priority 1.
Another condition that makes the current event an end of a result interval is when the event is a start event (type = 1), and the highest active priority before the current event was applied is lower than the current event’s priority (number is greater: prvrsb > ptybitval). For example, at 9;00 the packed interval for User2 with priority 1 starts. This marks the end of an output interval with priority 2.
If you’ve made it this far and understood everything in this section, you can give yourself a pat on the shoulder, and perhaps switch from coffee to Lagavulin now.
Step 3
Step 3 is pretty simple. Looking at the result of step 2 in Table 3 you need to do things:
1. Unpivot the startpty and endpty values into separate rows.
2. Remove all rows where startpty and endpty are NULL.
You can achieve both things by applying the UNPIVOT operator to the result of the previous step, like so:
UNPIVOT (newpty FOR newtype IN (startpty, endpty)) AS U
You then need to create a result interval identifier (call it grp) for each consecutive pair of events with the same user and priority. For this, you use a row number (call it rn) in the following formula: (rn - 1) / 2 + 1. Here’s the T-SQL expression:
(ROW_NUMBER() OVER(PARTITION BY username, newpty ORDER BY ts) - 1) / 2 + 1 AS grp
Listing 6 has the complete code implementing step 3 and Table 4 has the output of this step.
Listing 6: Step 3
WITH C1 AS( SELECT id, username, starttime AS ts, pty, +1 AS type, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY starttime, id) AS s, NULL AS e FROM dbo.Sessions UNION ALL SELECT id, username, endtime AS ts, pty, -1 AS type, NULL AS s, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY endtime, id) AS e FROM dbo.Sessions),C2 AS( SELECT *, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY ts, type DESC, id) AS se FROM C1),C3 AS( SELECT *, SUM(ptybitval) OVER(PARTITION BY username ORDER BY ts, ptybitval ROWS UNBOUNDED PRECEDING) AS ptybitmap FROM C2 CROSS APPLY ( VALUES( s - (se - s) - 1, (se - e) - e, type * POWER(2, pty - 1) ) ) AS A(cs, ce, ptybitval) WHERE cs = 0 OR ce = 0)SELECT id, username, ts, newpty, newtype, (ROW_NUMBER() OVER(PARTITION BY username, newpty ORDER BY ts) - 1) / 2 + 1 AS grpFROM C3 CROSS APPLY ( VALUES( ptybitmap & -ptybitmap, (ptybitmap - ptybitval) & -(ptybitmap - ptybitval) ) ) AS A1(rsb, prvrsb) CROSS APPLY ( VALUES( CASE WHEN type = 1 AND rsb = ptybitval THEN pty WHEN type = -1 AND rsb > -ptybitval THEN LOG(rsb, 2) + 1 END, CASE WHEN type = -1 AND (rsb = 0 OR rsb > -ptybitval) THEN pty WHEN type = 1 AND prvrsb > ptybitval THEN LOG(prvrsb, 2) + 1 END ) ) AS A2(startpty, endpty) UNPIVOT (newpty FOR newtype IN (startpty, endpty)) AS U;
Table 4: Result of Step 3
Table 4: Result of Step 3id username ts newpty newtype grp--- --------- -------------------- ------- --------- ----1 User1 2016-01-01 08:00:00 1 startpty 11 User1 2016-01-01 08:30:00 1 endpty 11 User1 2016-01-01 08:30:00 2 startpty 12 User1 2016-01-01 08:35:00 2 endpty 16 User2 2016-01-01 09:00:00 1 startpty 16 User2 2016-01-01 09:30:00 1 endpty 15 User2 2016-01-01 08:30:00 2 startpty 16 User2 2016-01-01 09:00:00 2 endpty 16 User2 2016-01-01 09:30:00 2 startpty 25 User2 2016-01-01 10:00:00 2 endpty 28 User2 2016-01-01 11:30:00 2 startpty 310 User2 2016-01-01 13:00:00 2 endpty 34 User2 2016-01-01 08:00:00 3 startpty 15 User2 2016-01-01 08:30:00 3 endpty 15 User2 2016-01-01 10:00:00 3 startpty 24 User2 2016-01-01 10:30:00 3 endpty 27 User2 2016-01-01 11:00:00 3 startpty 38 User2 2016-01-01 11:30:00 3 endpty 311 User3 2016-01-01 08:00:00 1 startpty 111 User3 2016-01-01 09:00:00 1 endpty 114 User3 2016-01-01 09:30:00 1 startpty 214 User3 2016-01-01 09:31:00 1 endpty 2
Step 4 and Complete Solution
The fourth and last step defines a CTE called C4 (aptly named) based on the outer query from the previous step (aka, the explosive query). The final outer query groups the rows from C4 by username, newpty and grp, and for each group returns the minimum and maximum event times as the start and end times of the result intervals. Here’s the outer query:
SELECT username, newpty AS pty, MIN(ts) AS starttime, MAX(ts) AS endtimeFROM C4GROUP BY username, newpty, grp;
Listing 7 has the complete solution code. I presented the output earlier in Table 1.
Listing 7: Complete solution
WITH C1 AS( SELECT id, username, starttime AS ts, pty, +1 AS type, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY starttime, id) AS s, NULL AS e FROM dbo.Sessions UNION ALL SELECT id, username, endtime AS ts, pty, -1 AS type, NULL AS s, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY endtime, id) AS e FROM dbo.Sessions),C2 AS( SELECT *, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY ts, type DESC, id) AS se FROM C1),C3 AS( SELECT *, SUM(ptybitval) OVER(PARTITION BY username ORDER BY ts, ptybitval ROWS UNBOUNDED PRECEDING) AS ptybitmap FROM C2 CROSS APPLY ( VALUES( s - (se - s) - 1, (se - e) - e, type * POWER(2, pty - 1) ) ) AS A(cs, ce, ptybitval) WHERE cs = 0 OR ce = 0),C4 AS( SELECT *, (ROW_NUMBER() OVER(PARTITION BY username, newpty ORDER BY ts) - 1) / 2 + 1 AS grp FROM C3 CROSS APPLY ( VALUES( ptybitmap & -ptybitmap, (ptybitmap - ptybitval) & -(ptybitmap - ptybitval) ) ) AS A1(rsb, prvrsb) CROSS APPLY ( VALUES( CASE WHEN type = 1 AND rsb = ptybitval THEN pty WHEN type = -1 AND rsb > -ptybitval THEN LOG(rsb, 2) + 1 END, CASE WHEN type = -1 AND (rsb = 0 OR rsb > -ptybitval) THEN pty WHEN type = 1 AND prvrsb > ptybitval THEN LOG(prvrsb, 2) + 1 END ) ) AS A2(startpty, endpty) UNPIVOT (newpty FOR newtype IN (startpty, endpty)) AS U)SELECT username, newpty AS pty, MIN(ts) AS starttime, MAX(ts) AS endtimeFROM C4GROUP BY username, newpty, grp;
When running this solution against the large set of sample data with the 5,000,000 rows, it took 45 seconds to complete on my 5-year old laptop. That’s not too bad, but there’s still room for improvement.
Optimizing the solution with APPLY
Even though the code in Listing 1 creates supporting indexes for our solution, since there are a few window functions with different partitioning and ordering specifications, you can’t escape a couple of explicit sort operators in the plan. A sort operator scales in an extra linear manner (N Log N). Therefore, when there’s a partitioning element involved in the window functions, you can often achieve better performance by using a technique initially introduced by Adam Machanic. You use a CROSS APPLY operator against the table holding the partitions (Users in our case) and apply a query (or an inline TVF encapsulating the query) per partition. This way you split one large operation to multiple smaller operations and in the aforementioned scenario get better performance. The code in Listing 8 encapsulates our solution for a single user in an inline TVF called PackedIntervals.
Listing 8: Encapsulate logic for a single user in an inline TVF
IF OBJECT_ID(N'dbo.PackedIntervals', N'IF') IS NOT NULL DROP FUNCTION dbo.PackedIntervals;GOCREATE FUNCTION dbo.PackedIntervals(@username AS VARCHAR(14)) RETURNS TABLEASRETURN WITH C1 AS ( SELECT id, starttime AS ts, pty, +1 AS type, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY starttime, id) AS s, NULL AS e FROM dbo.Sessions WHERE username = @usernameUNION ALLSELECT id, endtime AS ts, pty, -1 AS type, NULL AS s, ROW_NUMBER() OVER(PARTITION BY username, pty ORDER BY endtime, id) AS e FROM dbo.Sessions WHERE username = @username ), C2 AS ( SELECT *, ROW_NUMBER() OVER(PARTITION BY pty ORDER BY ts, type DESC, id) AS se FROM C1 ), C3 AS ( SELECT *, SUM(ptybitval) OVER(ORDER BY ts, ptybitval ROWS UNBOUNDED PRECEDING) AS ptybitmap FROM C2 CROSS APPLY ( VALUES( s - (se - s) - 1, (se - e) - e, type * POWER(2, pty - 1) ) ) AS A(cs, ce, ptybitval) WHERE cs = 0 OR ce = 0 ), C4 AS ( SELECT *, (ROW_NUMBER() OVER(PARTITION BY newpty ORDER BY ts) - 1) / 2 + 1 AS grp FROM C3 CROSS APPLY ( VALUES( ptybitmap & -ptybitmap, (ptybitmap - ptybitval) & -(ptybitmap - ptybitval))) AS A1(rsb, prvrsb) CROSS APPLY ( VALUES( CASE WHEN type = 1 AND rsb = ptybitval THEN pty WHEN type = -1 AND rsb > -ptybitval THEN LOG(rsb, 2) + 1 END, CASE WHEN type = -1 AND (rsb = 0 OR rsb > -ptybitval) THEN pty WHEN type = 1 AND prvrsb > ptybitval THEN LOG(prvrsb, 2) + 1 END ) ) AS A2(startpty, endpty) UNPIVOT (newpty FOR newtype IN (startpty, endpty)) AS U ) SELECT newpty AS pty, MIN(ts) AS starttime, MAX(ts) AS endtime FROM C4 GROUP BY newpty, grp;GO
You then use the following query to apply the function to each user, like so:
SELECT U.username, S.pty, S.starttime, S.endtimeFROM dbo.Users AS U CROSS APPLY dbo.PackedIntervals(U.username) AS SOPTION(QUERYTRACEON 8649);
Note that on my laptop without the (undocumented and unsupported) trace flag 8649 SQL Server generates a serial plan that takes only 15 seconds to complete. That’s just a third of the run time of the original solution and already very fast. The APPLY method works even better when parallelized, so, for illustration purposes I forced a parallel plan with the trace flag. With parallelism, this query completed in 8 seconds on my laptop, producing the plan shown in Figure 2.
Conclusion
In this article I covered a variation of the classic interval packing problem where intervals have priorities and in overlapping cases you’re supposed to return the segment with the highest priority. I’d like to thank Ralf for sending this challenge. The solution I used relied mainly on window functions, including row numbers and running totals calculations. The handling of the priorities was achieved by using a bitmap representation of the priorities. Finally, using Adam Machanic’s CROSS APPLY method I further improved the solution by splitting one large task that scales extra linearly to multiple smaller tasks.
About the Author
You May Also Like