Intervals and Counts, Part 4
Compute the count of active intervals at the beginning of every fixed interval
November 7, 2013
This article is the fourth in a series about intervals and counts. The series focuses on temporal intervals representing entities such as sessions, phone calls, and chats; it describes how to handle tasks related to such intervals. Last month I discussed how to compute the maximum count during every fixed interval of time. This month's focus is computing the count of active intervals at the beginning of every fixed interval.
Related: Intervals and Counts, Part 1; and Intervals and Counts, Part 2; and Intervals and Counts, Part 3
In this article, I use the same sample data that I used in the previous articles. Use the code in Listing 1 to create the Sessions table and fill it with a small set of sample data to validate the correctness of the solutions.
SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.Sessions', N'U') IS NOT NULL DROP TABLE dbo.Sessions;IF OBJECT_ID(N'dbo.Apps', N'U') IS NOT NULL DROP TABLE dbo.Apps;CREATE TABLE dbo.Apps( app VARCHAR(10) NOT NULL, CONSTRAINT PK_Apps PRIMARY KEY(app));CREATE TABLE dbo.Sessions( keycol INT NOT NULL, app VARCHAR(10) NOT NULL, starttime DATETIME2(0) NOT NULL, endtime DATETIME2(0) NOT NULL, CONSTRAINT PK_Sessions PRIMARY KEY(keycol), CONSTRAINT CHK_Sessios_et_st CHECK(endtime > starttime));CREATE UNIQUE INDEX idx_start ON dbo.Sessions(app, starttime, keycol) INCLUDE(endtime);CREATE UNIQUE INDEX idx_end ON dbo.Sessions(app, endtime, keycol) INCLUDE(starttime);-- Code to fill Sessions with small set of sample dataTRUNCATE TABLE dbo.Sessions;TRUNCATE TABLE dbo.Apps;INSERT INTO dbo.Apps(app) VALUES('app1'),('app2'),('app3');INSERT INTO dbo.Sessions(keycol, app, starttime, endtime) VALUES (2, 'app1', '20130212 08:30:00', '20130212 10:30:00'), (3, 'app1', '20130212 08:30:00', '20130212 08:45:00'), (5, 'app1', '20130212 09:00:00', '20130212 09:30:00'), (7, 'app1', '20130212 09:15:00', '20130212 10:30:00'), (11, 'app1', '20130212 09:15:00', '20130212 09:30:00'), (13, 'app1', '20130212 10:30:00', '20130212 14:30:00'), (17, 'app1', '20130212 10:45:00', '20130212 11:30:00'), (19, 'app1', '20130212 11:00:00', '20130212 12:30:00'), (23, 'app2', '20130212 08:30:00', '20130212 08:45:00'), (29, 'app2', '20130212 09:00:00', '20130212 09:30:00'), (31, 'app2', '20130212 11:45:00', '20130212 12:00:00'), (37, 'app2', '20130212 12:30:00', '20130212 14:00:00'), (41, 'app2', '20130212 12:45:00', '20130212 13:30:00'), (43, 'app2', '20130212 13:00:00', '20130212 14:00:00'), (47, 'app2', '20130212 14:00:00', '20130212 16:30:00'), (53, 'app2', '20130212 15:30:00', '20130212 17:00:00'), (61, 'app3', '20130212 08:00:00', '20130212 08:30:00'), (62, 'app3', '20130212 08:00:00', '20130212 09:00:00'), (63, 'app3', '20130212 09:00:00', '20130212 09:30:00'), (64, 'app3', '20130212 09:30:00', '20130212 10:00:00');
As in "Intervals and Counts, Part 3," the indexes idx_start and idx_end are slightly different than in "Intervals and Counts, Part 1" and "Intervals and Counts, Part 2." Idx_start is defined on the keylist (app, starttime, keycol), and include list (endtime), and the index idx_end is defined on the keylist (app, endtime, keycol) and the include list (starttime). In "Intervals and Counts, Part 1" and "Intervals and Counts, Part 2," the indexes had the same keylists but no include list. The solutions in both this article and the previous article ("Intervals and Counts, Part 3") need the include lists for coverage.
Run the code in Listing 2 to fill the Sessions table with a large set of sample data to check the performance of the solutions. Listing 2 also creates a helper function called GetNums, which accepts low and high values as inputs and returns a sequence of integers in the input range.
IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;GOCREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLEASRETURN WITH L0 AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)), L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L5) SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n FROM Nums ORDER BY rownum;GO-- Code to fill Sessions with large set of sample dataTRUNCATE TABLE dbo.Sessions;TRUNCATE TABLE dbo.Apps;DECLARE @numrows AS INT = 2000000, -- total number of rows @numapps AS INT = 100; -- number of applicationsINSERT INTO dbo.Apps WITH(TABLOCK) (app) SELECT 'app' + CAST(n AS VARCHAR(10)) AS app FROM dbo.GetNums(1, @numapps) AS Nums;INSERT INTO dbo.Sessions WITH(TABLOCK) (keycol, app, starttime, endtime) SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS keycol, D.*, DATEADD( second, 1 + ABS(CHECKSUM(NEWID())) % (20*60), starttime) AS endtime FROM ( SELECT 'app' + CAST(1 + ABS(CHECKSUM(NEWID())) % @numapps AS VARCHAR(10)) AS app, DATEADD( second, 1 + ABS(CHECKSUM(NEWID())) % (30*24*60*60), '20130101') AS starttime FROM dbo.GetNums(1, @numrows) AS Nums ) AS D;
The Challenge
The challenge that's the focus of this article is to compute the count of active intervals at the beginning of every fixed interval in a given period. I got this request from a customer who needed to compute the number of chats that were active at the beginning of every fixed interval of time. In our case, you need to compute the count every hour, on the hour. For our purposes, to avoid any confusion, the count should take into consideration all interval end and start events that happened on the hour in question. Figure 1 depicts graphically the small set of sample data and the desired result for the input period starting with '20130212 08:00:00' and ending before '20130212 18:00:00'.
Figure 1: Graph Showing Counts at the Beginnings of Fixed Intervals
Figure 2 shows the desired result from your solution query.
apptscnt------------------------------------------------app12013-02-12 08:00:000app12013-02-12 09:00:002app12013-02-12 10:00:002app12013-02-12 11:00:003app12013-02-12 12:00:002app12013-02-12 13:00:001app12013-02-12 14:00:001app12013-02-12 15:00:000app12013-02-12 16:00:000app12013-02-12 17:00:000app22013-02-12 08:00:000app22013-02-12 09:00:001app22013-02-12 10:00:000app22013-02-12 11:00:000app22013-02-12 12:00:000app22013-02-12 13:00:003app22013-02-12 14:00:001app22013-02-12 15:00:001app22013-02-12 16:00:002app22013-02-12 17:00:000app32013-02-12 08:00:002app32013-02-12 09:00:001app32013-02-12 10:00:000app32013-02-12 11:00:000app32013-02-12 12:00:000app32013-02-12 13:00:000app32013-02-12 14:00:000app32013-02-12 15:00:000app32013-02-12 16:00:000app32013-02-12 17:00:000
I'll describe two solutions: one using a join and another using window functions. Both solutions use a helper table called TimeStamps like the one I used in "Intervals and Counts, Part 3." Use the code in Listing 3 to create the TimeStamps table and populate it with a row for every hour in the year 2013.
-- DDL for TimeStamps tableIF OBJECT_ID(N'dbo.TimeStamps', N'U') IS NOT NULL DROP TABLE dbo.TimeStamps;CREATE TABLE dbo.TimeStamps( ts DATETIME2(0) NOT NULL CONSTRAINT PK_TimeStamps PRIMARY KEY);GO-- Populate TimeStamps tableDECLARE @s AS DATETIME2(0) = '20130101', -- inclusive @e AS DATETIME2(0) = '20140101'; -- exclusiveINSERT INTO dbo.TimeStamps WITH (TABLOCK) (ts) SELECT DATEADD(hour, n-1, @s) AS ts FROM dbo.GetNums(1, DATEDIFF(hour, @s, @e)) AS Nums;GO
Solution Using Joins
The first solution I'll present is probably the most straightforward way to solve the task, although it isn't the most efficient one. Listing 4 shows the solution code (assuming you run it against the small set of sample data with the input period starting on '20130212 08:00:00' and ending before '20130212 18:00:00').
DECLARE @app AS VARCHAR(10) = 'app1', @starttime AS DATETIME2(0) = '20130212 08:00:00', -- inclusive @endtime AS DATETIME2(0) = '20130212 18:00:00'; -- exclusiveSELECT A.app, T.ts, COUNT(S.keycol) AS cntFROM dbo.Apps AS A CROSS JOIN dbo.TimeStamps AS T LEFT OUTER JOIN dbo.Sessions AS S ON A.app = S.app AND T.ts >= S.starttime AND T.ts < S.endtimeWHERE ts >= @starttime AND ts < @endtimeGROUP BY A.app, T.tsORDER BY A.app, T.ts; -- for presentation purposes
The solution's code performs a cross join between the Apps table (call it A) and the Timestamps table (call it T) to generate a row for every application and start-of-hour. The code then performs a left outer join with the Sessions table (call it S) based on the predicate A.app = S.app AND T.ts >= S.starttime AND T.ts < S.endtime. Each left row representing a distinct combination of application and hour is matched by all sessions with the same applicationthat was active during that start-of-hour. By "active" I mean that the start-of-hour happened on or after the session's start time and before the session's end time.
The code then filters only the rows in the input period, groups the remaining rows by application and timestamp, and finally counts the number of active sessions in each group. The code in Listing 4, run against the small set of sample data, generates the desired result set shown earlier in Figure 2.
To test the performance of the solution, use the code in Listing 5 against the large set of sample data. The input period now starts on January 1, 2013, and ends before February 1, 2013.
DECLARE @app AS VARCHAR(10) = 'app1', @starttime AS DATETIME2(0) = '20130101', @endtime AS DATETIME2(0) = '20130201';SELECT A.app, T.ts, COUNT(S.keycol) AS cntFROM dbo.Apps AS A CROSS JOIN dbo.TimeStamps AS T LEFT OUTER JOIN dbo.Sessions AS S ON A.app = S.app AND T.ts >= S.starttime AND T.ts < S.endtimeWHERE ts >= @starttime AND ts < @endtimeGROUP BY A.app, T.ts;
Figure 3 shows the plan for this query (using SQL Sentry's Plan Explorer). This plan is very expensive. The bulk of the cost is in the Hash Match join implementing the left outer join and in the Hash Match aggregate implementing the local aggregate per thread. The reason the join is so expensive is because there are two range predicates involved against two different columns. From the perspective of the Sessions table, the predicates are S.starttime <= T.ts AND S.endtime > T.ts.
Figure 3: Plan for Solution in Listing 5
The code in Listing 5 took four and a half minutes to complete on my machine.
Solution Using Window Functions
The second solution I'll discuss uses window functions and is significantly more efficient than the first solution. Listing 6 shows the solution's code against the small set of sample data, for just one application (app1 in this case), for the input period starting on '20130212 08:00:00' and ending before '20130212 18:00:00'.
DECLARE @app AS VARCHAR(10) = 'app1', @starttime AS DATETIME2(0) = '20130212 08:00:00', -- inclusive @endtime AS DATETIME2(0) = '20130212 17:00:00'; -- exclusiveWITH C1 AS( SELECT endtime AS ts, -1 AS increment, 1 AS ord FROM dbo.Sessions WHERE app = @app AND starttime < @endtime AND endtime >= @starttime UNION ALL SELECT starttime AS ts, 1 AS increment, 2 AS ord FROM dbo.Sessions WHERE app = @app AND starttime < @endtime AND endtime >= @starttime UNION ALL SELECT ts, 0 AS increment, 3 AS ord FROM dbo.TimeStamps WHERE ts >= @starttime AND ts < @endtime),C2 AS( SELECT ts, increment, ord, SUM(increment) OVER(ORDER BY ts, ord ROWS UNBOUNDED PRECEDING) AS cnt FROM C1),C3 AS( SELECT *, CASE WHEN increment = 0 THEN LAG(cnt, 1, 0) OVER(ORDER BY ts, ord) END AS prv FROM C2)SELECT ts, prv AS cntFROM C3WHERE increment = 0ORDER BY ts; -- for presentation purposes
If you read "Intervals and Counts, Part 3," the CTEs C1 and C2 should look familiar to you. These two CTEs are the same as the ones I used in last month's solution. As a reminder, the code in the CTE C1 unifies three sets of events:
End events marked with a -1 increment and 1 as the ordering position (ord).
Start events marked with a +1 increment and 2 as the ordering position.
The start-of-hour timestamps in the input period marked with a 0 (neutral) increment and 3 as the ordering position, ensuring the neutral events are considered after both end and start events that happened at the same time.
The code in the CTE C2 then computes the count of active intervals after each event as the running total sum of the column increment, ordered by the event timestamp (ts) and ordering position (ord). The code in the CTE C3 uses a CASE expression and the LAG function to compute for the neutral events (increment = 0) the count of active intervals. The applicable count on a neutral event is the count after the previous event if one exists and 0 if one doesn't exist. Finally, the outer query filters only neutral events and returns the previous counts as the current applicable counts.
Just as in the previous articles in the series, you can improve parallelism handling by encapsulating the solution in an inline table-valued function and using the APPLY operator against the Apps table. Run the code in Listing 7 to create the function IntervalCounts. The function encapsulates the logic from Listing 6 for an input application and period.
IF OBJECT_ID(N'dbo.IntervalCounts', N'IF') IS NOT NULL DROP FUNCTION dbo.IntervalCounts;GOCREATE FUNCTION dbo.IntervalCounts( @app AS VARCHAR(10), @starttime AS DATETIME2(0), @endtime AS DATETIME2(0)) RETURNS TABLEASRETURNWITH C1 AS( SELECT endtime AS ts, -1 AS increment, 1 AS ord FROM dbo.Sessions WHERE app = @app AND starttime < @endtime AND endtime >= @starttime UNION ALL SELECT starttime AS ts, 1 AS increment, 2 AS ord FROM dbo.Sessions WHERE app = @app AND starttime < @endtime AND endtime >= @starttime UNION ALL SELECT ts, 0 AS increment, 3 AS ord FROM dbo.TimeStamps WHERE ts >= @starttime AND ts < @endtime),C2 AS( SELECT ts, increment, ord, SUM(increment) OVER(ORDER BY ts, ord ROWS UNBOUNDED PRECEDING) AS cnt FROM C1),C3 AS( SELECT *, CASE WHEN increment = 0 THEN LAG(cnt, 1, 0) OVER(ORDER BY ts, ord) END AS prv FROM C2)SELECT ts, prv AS cntFROM C3WHERE increment = 0;GO
Use the following query to test the solution for correctness against the small set of sample data:
SELECT A.app, IC.*FROM dbo.Apps AS A CROSS APPLY dbo.IntervalCounts(A.app, '20130212 08:00:00', '20130212 18:00:00') AS IC;
You'll get the desired result shown earlier in Figure 2, although not necessarily in the same presentation order.
Use the following code to test the solution's performance against the large set of sample data:
SELECT A.app, IC.*FROM dbo.Apps AS A CROSS APPLY dbo.IntervalCounts(A.app, '20130101', '20130201') AS IC;
Figure 4 shows the plan for this query.
Figure 4: Plan for Solution in Listing 7
As you can see, this plan is highly efficient. The plan scans the rows from the Apps clustered index. For each application, the plan performs an index seek and an ordered range scan in each of the indexes: idx_start, idx_end, and PK_TimeStamps, scanning only the relevant range for the current application. Based on the ordered range scans, the plan merges the rows and computes all window functions (ROW_NUMBER and LAG functions) without even one explicit sort, which is amazing! This solution took only 11 seconds to complete on my system against the large set of sample data—compared with four and a half minutes for the solution based on joins.
Window Functions to the Rescue Yet Again
In this article, I presented yet another task involving temporal intervals: computing counts of active intervals at the beginnings of fixed intervals in an input period. I covered two solutions: one based on joins and another based on window functions. The solution based on joins is straightforward and would be the likely intuitive attempt for most people; however, it's very inefficient. The solution based on window functions is highly efficient. Time and again, window functions manage to surprise me with their elegance and efficiency. I suspect we haven't seen the last of them, and that people will keep finding new, creative, efficient, and exciting ways to use them in future solutions.
About the Author
You May Also Like