Intervals and Counts, Part 1
Identify discrete intervals and counts of overlapping source intervals
July 16, 2013
When using SQL Server, you frequently need to work with data that represents intervals of time. For example, consider intervals representing sessions, contracts, projects, and so on. I covered a few aspects of manipulation of intervals in the past, but there are still many interesting challenges to cover. Tasks related to interval manipulation are typically quite intriguing, especially because coming up with efficient solutions isn't easy.
I recently had the chance to work on a few challenges involving intervals and counts. This article is the first in a series covering those challenges and their solutions.
Related: Solving Gaps and Islands with Enhanced Window Functions
I'd like to thank Adam Machanic, who sent me the challenge that's the focus of this article. I'd also like to thank Ben Flanaghan, who created one of the fundamental techniques that I rely on in one of the solutions that I present in this article.
Sample Data
I'll use the same sample data for all of the challenges that I'll discuss in this series. The data involves a table called Sessions that holds sessions against different applications. The column app represents the application (call it the partitioning element), and the columns starttime and endtime represent the start and end points of the session's interval, respectively. The starttime and endtime columns use the DATETIME2(0) type, which has a precision of one second.
Related: Interval Queries in SQL Server
Run the code in Listing 1 to create the Sessions table and two supporting indexes, which will be used by the different solutions that I'll present in the series.
SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.Sessions', N'U') IS NOT NULL DROP TABLE dbo.Sessions;IF OBJECT_ID(N'dbo.Apps', N'U') IS NOT NULL DROP TABLE dbo.Apps;CREATE TABLE dbo.Apps( app VARCHAR(10) NOT NULL, CONSTRAINT PK_Apps PRIMARY KEY(app));CREATE TABLE dbo.Sessions( keycol INT NOT NULL, app VARCHAR(10) NOT NULL, starttime DATETIME2(0) NOT NULL, endtime DATETIME2(0) NOT NULL, CONSTRAINT PK_Sessions PRIMARY KEY(keycol), CONSTRAINT CHK_Sessios_et_st CHECK(endtime > starttime));CREATE UNIQUE INDEX idx_start ON dbo.Sessions(app, starttime, keycol);CREATE UNIQUE INDEX idx_end ON dbo.Sessions(app, endtime, keycol);
The index called idx_start is defined on the key list (app, starttime, keycol), and the index called idx_end is defined on the key list (app, endtime, keycol).
Use the code in Listing 2 to populate the Sessions table with a small set of sample data.
TRUNCATE TABLE dbo.Sessions;TRUNCATE TABLE dbo.Apps;INSERT INTO dbo.Apps(app) VALUES('app1'),('app2'),('app3');INSERT INTO dbo.Sessions(keycol, app, starttime, endtime) VALUES (2, 'app1', '20130212 08:30:00', '20130212 10:30:00'), (3, 'app1', '20130212 08:30:00', '20130212 08:45:00'), (5, 'app1', '20130212 09:00:00', '20130212 09:30:00'), (7, 'app1', '20130212 09:15:00', '20130212 10:30:00'), (11, 'app1', '20130212 09:15:00', '20130212 09:30:00'), (13, 'app1', '20130212 10:30:00', '20130212 14:30:00'), (17, 'app1', '20130212 10:45:00', '20130212 11:30:00'), (19, 'app1', '20130212 11:00:00', '20130212 12:30:00'), (23, 'app2', '20130212 08:30:00', '20130212 08:45:00'), (29, 'app2', '20130212 09:00:00', '20130212 09:30:00'), (31, 'app2', '20130212 11:45:00', '20130212 12:00:00'), (37, 'app2', '20130212 12:30:00', '20130212 14:00:00'), (41, 'app2', '20130212 12:45:00', '20130212 13:30:00'), (43, 'app2', '20130212 13:00:00', '20130212 14:00:00'), (47, 'app2', '20130212 14:00:00', '20130212 16:30:00'), (53, 'app2', '20130212 15:30:00', '20130212 17:00:00'), (61, 'app3', '20130212 08:00:00', '20130212 08:30:00'), (62, 'app3', '20130212 08:00:00', '20130212 09:00:00'), (63, 'app3', '20130212 09:00:00', '20130212 09:30:00'), (64, 'app3', '20130212 09:30:00', '20130212 10:00:00');
You can use the sample data from Listing 2 to verify the correctness of your solutions. Figure 1 provides a graphic representation of the small set of sample data.
Figure 1: Intervals
To generate a large set of sample data, you'll need the helper function GetNums, which accepts input parameters called @low and @high and returns a sequence of integers in the requested range. Use the code in Listing 3 to create the GetNums function.
IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;GOCREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLEASRETURN WITH L0 AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)), L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L5) SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n FROM Nums ORDER BY rownum;GO
Use the code in Listing 4 to populate the Sessions table with a large set of sample data.
TRUNCATE TABLE dbo.Sessions;TRUNCATE TABLE dbo.Apps;DECLARE @numrows AS INT = 2000000, -- total number of rows @numapps AS INT = 100; -- number of applicationsINSERT INTO dbo.Apps WITH(TABLOCK) (app) SELECT 'app' + CAST(n AS VARCHAR(10)) AS app FROM dbo.GetNums(1, @numapps) AS Nums;INSERT INTO dbo.Sessions WITH(TABLOCK) (keycol, app, starttime, endtime) SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS keycol, D.*, DATEADD( second, 1 + ABS(CHECKSUM(NEWID())) % (20*60), starttime) AS endtime FROM ( SELECT 'app' + CAST(1 + ABS(CHECKSUM(NEWID())) % @numapps AS VARCHAR(10)) AS app, DATEADD( second, 1 + ABS(CHECKSUM(NEWID())) % (30*24*60*60), '20130101') AS starttime FROM dbo.GetNums(1, @numrows) AS Nums ) AS D;
You can control the number of applications and the total number of rows in the sample data by setting the variables @numapps and @numrows, respectively. The code in Listing 4 fills the Apps table with 100 applications and the Sessions table with 2,000,000 rows.
The Challenge: Computing Counts During Discrete Intervals
The challenge that's the focus of this article is to identify the discrete intervals that exist in the data and determine how many source intervals each discrete interval overlaps with. Figure 2 provides a graphic representation of the desired result against the small set of sample data.
Figure 2: Counts During Discrete Intervals
Figure 3 shows the desired output of your query.
app starttime endtime cnt----- -------------------- -------------------- ----app1 2013-02-12 08:30:00 2013-02-12 08:45:00 2app1 2013-02-12 08:45:00 2013-02-12 09:00:00 1app1 2013-02-12 09:00:00 2013-02-12 09:15:00 2app1 2013-02-12 09:15:00 2013-02-12 09:30:00 4app1 2013-02-12 09:30:00 2013-02-12 10:30:00 2app1 2013-02-12 10:30:00 2013-02-12 10:45:00 1app1 2013-02-12 10:45:00 2013-02-12 11:00:00 2app1 2013-02-12 11:00:00 2013-02-12 11:30:00 3app1 2013-02-12 11:30:00 2013-02-12 12:30:00 2app1 2013-02-12 12:30:00 2013-02-12 14:30:00 1app2 2013-02-12 08:30:00 2013-02-12 08:45:00 1app2 2013-02-12 09:00:00 2013-02-12 09:30:00 1app2 2013-02-12 11:45:00 2013-02-12 12:00:00 1app2 2013-02-12 12:30:00 2013-02-12 12:45:00 1app2 2013-02-12 12:45:00 2013-02-12 13:00:00 2app2 2013-02-12 13:00:00 2013-02-12 13:30:00 3app2 2013-02-12 13:30:00 2013-02-12 14:00:00 2app2 2013-02-12 14:00:00 2013-02-12 15:30:00 1app2 2013-02-12 15:30:00 2013-02-12 16:30:00 2app2 2013-02-12 16:30:00 2013-02-12 17:00:00 1app3 2013-02-12 08:00:00 2013-02-12 08:30:00 2app3 2013-02-12 08:30:00 2013-02-12 09:00:00 1app3 2013-02-12 09:00:00 2013-02-12 09:30:00 1app3 2013-02-12 09:30:00 2013-02-12 10:00:00 1
The desired output in Figure 3 assumes that you're not interested in returning intervals with a count of zero. For example, there were zero active sessions connected to app2 between 8:45 and 9:00. You might need to return such intervals for some purposes, in which case your solution will need to accommodate this need.
Solution Based on PIVOT, LEAD, and SUM Window Aggregates
The first step in the solution is to generate one sequence of start and end events, marking start events with an event type of +1 (causes the count of active sessions to increase) and end events with -1 (count decreases). To generate such a sequence of events, you unify the results of two queries—one returning start events and the other end events, like so:
SELECT app, starttime AS ts, +1 AS typeFROM dbo.SessionsUNION ALLSELECT app, endtime AS ts, -1 AS typeFROM dbo.SessionsORDER BY app, ts;
Figure 4 shows the output of this query.
app ts type----- ------------------- -----------app1 2012-02-12 08:30:00 1app1 2012-02-12 08:30:00 1app1 2012-02-12 08:45:00 -1app1 2012-02-12 09:00:00 1app1 2012-02-12 09:15:00 1app1 2012-02-12 09:15:00 1app1 2012-02-12 09:30:00 -1app1 2012-02-12 09:30:00 -1app1 2012-02-12 10:30:00 -1app1 2012-02-12 10:30:00 -1app1 2012-02-12 10:30:00 1app1 2012-02-12 10:45:00 1app1 2012-02-12 11:00:00 1app1 2012-02-12 11:30:00 -1app1 2012-02-12 12:30:00 -1app1 2012-02-12 14:30:00 -1...
Next ,you need to group the sequence by app and ts, pivoting the counts of events by the type (remember 1 for start and -1 for end). You get one row for each distinct application and timestamp with a column called [1] for the count of start events and [-1] for the count of end events at that point. That timestamp marks when the discrete interval starts. You then use the LEAD function to retrieve the next distinct ts value, marking when the discrete interval ends. The number of overlapping sessions with the current interval is then the running total sum of the counts of start events minus counts of end events. After defining a CTE called C1 based on the previous query, the outer query looks like this:
SELECT app, ts AS starttime, LEAD(ts) OVER(PARTITION BY app ORDER BY ts) AS endtime, SUM([1]-[-1]) OVER(PARTITION BY app ORDER BY ts ROWS UNBOUNDED PRECEDING) AS cntFROM C1 PIVOT(COUNT(type) FOR type IN([1],[-1])) AS P;
The code in Listing 5 combines the queries.
WITH C1 AS( SELECT app, starttime AS ts, +1 AS type FROM dbo.Sessions UNION ALL SELECT app, endtime AS ts, -1 AS type FROM dbo.Sessions)SELECT app, ts AS starttime, LEAD(ts) OVER(PARTITION BY app ORDER BY ts) AS endtime, SUM([1]-[-1]) OVER(PARTITION BY app ORDER BY ts ROWS UNBOUNDED PRECEDING) AS cntFROM C1 PIVOT(COUNT(type) FOR type IN([1],[-1])) AS PORDER BY app, starttime;
Figure 5 shows the output of the code in Listing 5.
app starttime endtime cnt----- -------------------- -------------------- ----app1 2013-02-12 08:30:00 2013-02-12 08:45:00 2app1 2013-02-12 08:45:00 2013-02-12 09:00:00 1app1 2013-02-12 09:00:00 2013-02-12 09:15:00 2app1 2013-02-12 09:15:00 2013-02-12 09:30:00 4app1 2013-02-12 09:30:00 2013-02-12 10:30:00 2app1 2013-02-12 10:30:00 2013-02-12 10:45:00 1app1 2013-02-12 10:45:00 2013-02-12 11:00:00 2app1 2013-02-12 11:00:00 2013-02-12 11:30:00 3app1 2013-02-12 11:30:00 2013-02-12 12:30:00 2app1 2013-02-12 12:30:00 2013-02-12 14:30:00 1app1 2013-02-12 14:30:00 NULL 0app2 2013-02-12 08:30:00 2013-02-12 08:45:00 1app2 2013-02-12 08:45:00 2013-02-12 09:00:00 0app2 2013-02-12 09:00:00 2013-02-12 09:30:00 1app2 2013-02-12 09:30:00 2013-02-12 11:45:00 0app2 2013-02-12 11:45:00 2013-02-12 12:00:00 1app2 2013-02-12 12:00:00 2013-02-12 12:30:00 0app2 2013-02-12 12:30:00 2013-02-12 12:45:00 1app2 2013-02-12 12:45:00 2013-02-12 13:00:00 2app2 2013-02-12 13:00:00 2013-02-12 13:30:00 3app2 2013-02-12 13:30:00 2013-02-12 14:00:00 2app2 2013-02-12 14:00:00 2013-02-12 15:30:00 1app2 2013-02-12 15:30:00 2013-02-12 16:30:00 2app2 2013-02-12 16:30:00 2013-02-12 17:00:00 1app2 2013-02-12 17:00:00 NULL 0app3 2013-02-12 08:00:00 2013-02-12 08:30:00 2app3 2013-02-12 08:30:00 2013-02-12 09:00:00 1app3 2013-02-12 09:00:00 2013-02-12 09:30:00 1app3 2013-02-12 09:30:00 2013-02-12 10:00:00 1app3 2013-02-12 10:00:00 NULL 0
Notice that the result includes intervals with a count of zero, as well as an interval with a starttime value at the last end event and a NULL in the endtime column. Naturally you'll want to remove the rows with the NULL in the endtime column. To achieve this, you can define a CTE based on the outer query (call it C2) and apply the filter endtime IS NOT NULL in the query against C2. If you also want to remove the intervals with the count of 0, use the filter cnt > 0, which will also remove the rows with the NULL in the endtime column because the count there is always 0. Listing 6 contains the complete solution, assuming you want to keep only intervals that have overlapping sessions.
WITH C1 AS( SELECT app, starttime AS ts, +1 AS type FROM dbo.Sessions UNION ALL SELECT app, endtime AS ts, -1 AS type FROM dbo.Sessions),C2 AS( SELECT app, ts AS starttime, LEAD(ts) OVER(PARTITION BY app ORDER BY ts) AS endtime, SUM([1]-[-1]) OVER(PARTITION BY app ORDER BY ts ROWS UNBOUNDED PRECEDING) AS cnt FROM C1 PIVOT(COUNT(type) FOR type IN([1],[-1])) AS P)SELECT *FROM C2WHERE cnt > 0ORDER BY app, starttime;
This solution generated the plan shown in Figure 6 on my system against the large set of sample data and ran for 37 seconds.
Figure 6: Plan for Query in Listing 6
What's most interesting about this solution is that the two supportive indexes idx_start and idx_end are scanned in order, and throughout the plan, there's not even one explicit sort operator. The grouping, pivoting, LEAD, and SUM window functions, and even the presentation ORDER BY clause, all rely on the order preservation of the data.
Solution Based on LEAD and ROW_NUMBER Window Aggregates
The next solution also generates a chronological sequence of start and end events, plus it computes three ordinals with the ROW_NUMBER function:
An ordinal for start events (call it s) against the starts query
An ordinal for end events (call it e) against the ends query
An ordinal for start or end events (call it se) against the combined starts and ends
Here's the query that generates the sequence of events and the ordinals:
WITH C1 AS( SELECT app, starttime AS ts, +1 AS type, keycol, ROW_NUMBER() OVER(PARTITION BY app ORDER BY starttime, keycol) AS s, -- start ordinal NULL AS e FROM dbo.Sessions UNION ALL SELECT app, endtime, -1 AS type, keycol, NULL AS s, ROW_NUMBER() OVER(PARTITION BY app ORDER BY endtime, keycol) AS e -- end ordinal FROM dbo.Sessions)SELECT *, ROW_NUMBER() OVER(PARTITION BY app ORDER BY ts, type, keycol) AS se -- start or end ordinalFROM C1;
Figure 7 shows the output of this query.
app ts type keycol s e se----- ------------------- ----- ------- ----- ----- ---app1 2012-02-12 08:30:00 1 2 1 NULL 1app1 2012-02-12 08:30:00 1 3 2 NULL 2app1 2012-02-12 08:45:00 -1 3 NULL 1 3app1 2012-02-12 09:00:00 1 5 3 NULL 4app1 2012-02-12 09:15:00 1 7 4 NULL 5app1 2012-02-12 09:15:00 1 11 5 NULL 6app1 2012-02-12 09:30:00 -1 5 NULL 2 7app1 2012-02-12 09:30:00 -1 11 NULL 3 8app1 2012-02-12 10:30:00 -1 2 NULL 4 9app1 2012-02-12 10:30:00 -1 7 NULL 5 10app1 2012-02-12 10:30:00 1 13 6 NULL 11app1 2012-02-12 10:45:00 1 17 7 NULL 12app1 2012-02-12 11:00:00 1 19 8 NULL 13app1 2012-02-12 11:30:00 -1 17 NULL 6 14app1 2012-02-12 12:30:00 -1 19 NULL 7 15app1 2012-02-12 14:30:00 -1 13 NULL 8 16...
Next, the solution computes the count of active sessions after each event. Notice in the output in Figure 7 that for start events you have the start ordinal (s) and the start or end ordinal (se) available. You can compute how many end events happened until that point as e = se - s. Then you can compute the count of active sessions as cnt = s - e, or more fully as cnt = s - (se - s). In a similar way you can compute the count after end events using se and e as cnt = (se - e) - e. To get the count after both start and end events in the same target column, use cnt = COALESCE(s - (se - s), s - (se - s)). You also need to return with each event the timestamp of the next event. This can be achieved with the LEAD function. Assuming you define a CTE called C2 based on the outer query in the previous code sample, you use the following query against C2 to compute both the count and the next timestamp:
SELECT *, (COALESCE( s - (se - s), -- count of active intervals after start event (se - e) - e) -- count of active intervals after end event ) AS cnt, LEAD(ts) OVER(PARTITION BY app ORDER BY ts, type, keycol) AS nexttsFROM C2
Notice the ORDER BY list of the LEAD function. It ensures that start events are positioned after end events, and it uses keycol as a tiebreaker for determinism. The points where ts is different than nextts represent discrete intervals, and the counts associated with those points are inclusive of all events at those points, meaning that they represent the counts of overlapping sessions with those intervals. The final step is to define a CTE based on the last query (call it C3) and have the outer query filter the events where ts <> nextts—and if you're not interested in intervals with zero counts, remove those as well. Here's the outer query against C3:
SELECT app, ts AS starttime, nextts AS endtime, cntFROM C3WHERE ts <> nextts AND cnt > 0;
Listing 7 contains the complete solution.
WITH C1 AS( SELECT app, starttime AS ts, +1 AS type, keycol, ROW_NUMBER() OVER(PARTITION BY app ORDER BY starttime, keycol) AS s, -- start ordinal NULL AS e FROM dbo.Sessions UNION ALL SELECT app, endtime, -1 AS type, keycol, NULL AS s, ROW_NUMBER() OVER(PARTITION BY app ORDER BY endtime, keycol) AS e -- end ordinal FROM dbo.Sessions),C2 AS( SELECT *, ROW_NUMBER() OVER(PARTITION BY app ORDER BY ts, type, keycol) AS se -- start or end ordinal FROM C1),C3 AS( SELECT *, (COALESCE( s - (se - s), -- count of active intervals after start event (se - e) - e) -- count of active intervals after end event ) AS cnt, LEAD(ts) OVER(PARTITION BY app ORDER BY ts, type, keycol) AS nextts FROM C2)SELECT app, ts AS starttime, nextts AS endtime, cntFROM C3WHERE ts <> nextts AND cnt > 0ORDER BY app, starttime;
Figure 8 shows the plan for this solution.
Figure 8: Plan for Query in Listing 7
This plan is more efficient than the one for the previous solution in a couple of ways. It doesn't involve grouping, and it reduces the number of Window Spool operators. This query ran for 22 seconds on my system—almost half the run time of the first solution.
Related: SQL Server 2012's Window Functions, Part 1
To Be Continued
This article is the first in a series covering challenges related to manipulation of intervals involving computation of various counts. In this article, I covered how to identify discrete intervals and counts of overlapping source intervals. I showed one solution that uses PIVOT, LEAD, and SUM window aggregates, as well as a faster solution using the ROW_NUMBER and LEAD functions. Next month I'll continue the coverage of this challenge, showing further performance improvements, as well as a variation of the task.
About the Author
You May Also Like