Intervals and Counts, Part 3
Compute the maximum count of concurrent sessions during fixed intervals
October 2, 2013
There are many types of tasks that involve handling date and time intervals and computing all sorts of counts related to those intervals. This article is the third in a series on the subject. This month's focus is computing the maximum count of concurrent sessions during every fixed interval (e.g., hourly) within a certain input period. (I'll explain what this actually means shortly.)
Related: Intervals and Counts, Part 1 and Intervals and Counts, Part 2
I'll use input data similar to what I used in the previous articles in the series. Listing 1 contains the code to create the Sessions table and populate it with a small set of sample data to check the validity of the solution.
SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.Sessions', N'U') IS NOT NULL DROP TABLE dbo.Sessions;IF OBJECT_ID(N'dbo.Apps', N'U') IS NOT NULL DROP TABLE dbo.Apps;CREATE TABLE dbo.Apps( app VARCHAR(10) NOT NULL, CONSTRAINT PK_Apps PRIMARY KEY(app));CREATE TABLE dbo.Sessions( keycol INT NOT NULL, app VARCHAR(10) NOT NULL, starttime DATETIME2(0) NOT NULL, endtime DATETIME2(0) NOT NULL, CONSTRAINT PK_Sessions PRIMARY KEY(keycol), CONSTRAINT CHK_Sessios_et_st CHECK(endtime > starttime));CREATE UNIQUE INDEX idx_start ON dbo.Sessions(app, starttime, keycol) INCLUDE(endtime);CREATE UNIQUE INDEX idx_end ON dbo.Sessions(app, endtime, keycol) INCLUDE(starttime);-- Code to fill Sessions table with small set of sample dataTRUNCATE TABLE dbo.Sessions;TRUNCATE TABLE dbo.Apps;INSERT INTO dbo.Apps(app) VALUES('app1'),('app2'),('app3');INSERT INTO dbo.Sessions(keycol, app, starttime, endtime) VALUES (2, 'app1', '20130212 08:30:00', '20130212 10:30:00'), (3, 'app1', '20130212 08:30:00', '20130212 08:45:00'), (5, 'app1', '20130212 09:00:00', '20130212 09:30:00'), (7, 'app1', '20130212 09:15:00', '20130212 10:30:00'), (11, 'app1', '20130212 09:15:00', '20130212 09:30:00'), (13, 'app1', '20130212 10:30:00', '20130212 14:30:00'), (17, 'app1', '20130212 10:45:00', '20130212 11:30:00'), (19, 'app1', '20130212 11:00:00', '20130212 12:30:00'), (23, 'app2', '20130212 08:30:00', '20130212 08:45:00'), (29, 'app2', '20130212 09:00:00', '20130212 09:30:00'), (31, 'app2', '20130212 11:45:00', '20130212 12:00:00'), (37, 'app2', '20130212 12:30:00', '20130212 14:00:00'), (41, 'app2', '20130212 12:45:00', '20130212 13:30:00'), (43, 'app2', '20130212 13:00:00', '20130212 14:00:00'), (47, 'app2', '20130212 14:00:00', '20130212 16:30:00'), (53, 'app2', '20130212 15:30:00', '20130212 17:00:00'), (61, 'app3', '20130212 08:00:00', '20130212 08:30:00'), (62, 'app3', '20130212 08:00:00', '20130212 09:00:00'), (63, 'app3', '20130212 09:00:00', '20130212 09:30:00'), (64, 'app3', '20130212 09:30:00', '20130212 10:00:00');
What's a bit different in this article is the definition of the indexes idx_start and idx_end. The keylists are the same as in the previous articles in this series, but this time the indexes have an INCLUDE clause. The index idx_start includes the endtime column, and the index idx_end includes the starttime column. The new indexes work better for the solutions covered in this article.
Listing 2 contains the code to create the helper function GetNums, which generates a sequence of integers in the requested range. Listing 2 also contains the code to fill the Sessions table with a large set of sample data for performance testing purposes.
IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;GOCREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLEASRETURN WITH L0 AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)), L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L5) SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n FROM Nums ORDER BY rownum;GO-- Code to fill Sessions table with large set of sample dataTRUNCATE TABLE dbo.Sessions;TRUNCATE TABLE dbo.Apps;DECLARE @numrows AS INT = 2000000, -- total number of rows @numapps AS INT = 100; -- number of applicationsINSERT INTO dbo.Apps WITH(TABLOCK) (app) SELECT 'app' + CAST(n AS VARCHAR(10)) AS app FROM dbo.GetNums(1, @numapps) AS Nums;INSERT INTO dbo.Sessions WITH(TABLOCK) (keycol, app, starttime, endtime) SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS keycol, D.*, DATEADD( second, 1 + ABS(CHECKSUM(NEWID())) % (20*60), starttime) AS endtime FROM ( SELECT 'app' + CAST(1 + ABS(CHECKSUM(NEWID())) % @numapps AS VARCHAR(10)) AS app, DATEADD( second, 1 + ABS(CHECKSUM(NEWID())) % (30*24*60*60), '20130101') AS starttime FROM dbo.GetNums(1, @numrows) AS Nums ) AS D;
The Challenge
The challenge this month is to compute the maximum count of concurrent sessions during every fixed interval within an input period, for each application separately. For example, suppose that the input period starts at 8:00 a.m. February 12, 2013, and ends at 5:00 p.m. February 12, 2013. For each application and hour within the input period, you need to compute the maximum count of concurrent sessions. For the sake of this article, if one session ends exactly when another starts, the two aren't considered concurrent. Also suppose that a session starts at 3:00 p.m. and ends at 4:00 p.m. This session is considered active during the fixed hourly interval starting at 3:00 p.m. and ending at 4:00 p.m. but not active during the fixed interval starting at 4:00 p.m. and ending at 5:00 p.m. Figure 1 depicts graphically what you need to compute for the small set of sample data generated by the code in Listing 1.
Figure 1: Graph Showing Counts During Fixed Intervals
Figure 2 presents the result, sorted by app and startime for clarity—but you can assume that there's no presentation ordering requirement in the task. It's fine if your solution returns the result sorted differently, as long as you return the same result set.
appstarttimeendtimecnt-------------------------------------------------app12013-02-12 08:00:002013-02-12 09:00:002app12013-02-12 09:00:002013-02-12 10:00:004app12013-02-12 10:00:002013-02-12 11:00:002app12013-02-12 11:00:002013-02-12 12:00:003app12013-02-12 12:00:002013-02-12 13:00:002app12013-02-12 13:00:002013-02-12 14:00:001app12013-02-12 14:00:002013-02-12 15:00:001app12013-02-12 15:00:002013-02-12 16:00:000app12013-02-12 16:00:002013-02-12 17:00:000app22013-02-12 08:00:002013-02-12 09:00:001app22013-02-12 09:00:002013-02-12 10:00:001app22013-02-12 10:00:002013-02-12 11:00:000app22013-02-12 11:00:002013-02-12 12:00:001app22013-02-12 12:00:002013-02-12 13:00:002app22013-02-12 13:00:002013-02-12 14:00:003app22013-02-12 14:00:002013-02-12 15:00:001app22013-02-12 15:00:002013-02-12 16:00:002app22013-02-12 16:00:002013-02-12 17:00:002app32013-02-12 08:00:002013-02-12 09:00:002app32013-02-12 09:00:002013-02-12 10:00:001app32013-02-12 10:00:002013-02-12 11:00:000app32013-02-12 11:00:002013-02-12 12:00:000app32013-02-12 12:00:002013-02-12 13:00:000app32013-02-12 13:00:002013-02-12 14:00:000app32013-02-12 14:00:002013-02-12 15:00:000app32013-02-12 15:00:002013-02-12 16:00:000app32013-02-12 16:00:002013-02-12 17:00:000
The Solution
The solution I describe in this article involves using a helper table called TimeStamps. You populate the table with the start times of all fixed intervals within the period that you need to support. The code in Listing 3 creates the TimeStamps table and populates it with the start times of all fixed hourly intervals within the year 2013, assuming that you need to support the period spanning the year 2013.
-- DDL for TimeStamps tableIF OBJECT_ID(N'dbo.TimeStamps', N'U') IS NOT NULL DROP TABLE dbo.TimeStamps;CREATE TABLE dbo.TimeStamps( ts DATETIME2(0) NOT NULL CONSTRAINT PK_TimeStamps PRIMARY KEY);GO-- Populate TimeStamps tableDECLARE @s AS DATETIME2(0) = '20130101', -- inclusive @e AS DATETIME2(0) = '20140101'; -- exclusiveINSERT INTO dbo.TimeStamps WITH (TABLOCK) (ts) SELECT DATEADD(hour, n-1, @s) AS ts FROM dbo.GetNums(1, DATEDIFF(hour, @s, @e)) AS Nums;GO
Listing 4 contains the solution code provided just for the application app3, for simplicity. After I explain the solution for a single application, I'll explain how to apply it to all applications. I'll explain the steps in the solution one CTE at a time, starting with the CTE C1.
DECLARE @app AS VARCHAR(10) = 'app1', @starttime AS DATETIME2(0) = '20130212 08:00:00', -- inclusive @endtime AS DATETIME2(0) = '20130212 17:00:00'; -- exclusiveWITH C1 AS( SELECT endtime AS ts, -1 AS increment, 1 AS ord FROM dbo.Sessions WHERE app = @app AND starttime < @endtime AND endtime >= @starttime UNION ALL SELECT starttime AS ts, 1 AS increment, 2 AS ord FROM dbo.Sessions WHERE app = @app AND starttime < @endtime AND endtime >= @starttime UNION ALL SELECT ts, 0 AS increment, 3 AS ord FROM dbo.TimeStamps WHERE ts >= @starttime AND ts < @endtime),C2 AS( SELECT ts, increment, SUM(increment) OVER(ORDER BY ts, ord ROWS UNBOUNDED PRECEDING) AS cnt FROM C1),C3 AS( SELECT DATEADD( hour, DATEDIFF(hour, @starttime, ts), @starttime ) AS starttime, cnt FROM C2 WHERE increment <> -1)SELECT starttime, DATEADD(hour, 1, starttime) AS endtime, MAX(cnt) AS mxFROM C3GROUP BY starttime;
Similar to the techniques I described in "Intervals and Counts, Part 1" and "Intervals and Counts, Part 2," the solution starts by generating a chronological sequence of events. This happens in the body of the CTE C1. Previously, the chronological sequence of events was constructed by unifying only start and end events. But this time there's a twist—you need to return information about every fixed hourly interval even if no events (start or end) took place during the interval. Recall that you mark start events with a +1 increment, because such events increase the count of active sessions, and end events with a -1 increment, because such events decrease the count. To address the twist in our task, the solution adds dummy entries with the start times of all fixed hourly intervals, marked with a 0 increment so that the event won't affect the count. This way, the solution guarantees that there will be at least one event during every fixed interval.
To correctly handle cases in which different types of events happen at the same point in time, the queries in C1 assign a different ordering value (call it ord) to each type of event. End events are assigned with the ord value 1 because they need to be considered first, start events with 2, and dummy events with 3.
Remember that you're supposed to consider only intervals that fall in the specified input period; therefore, the first two queries in the body of C1 include the filter starttime < @endtime AND endtime >= @starttime (inclusive of @starttime and exclusive of @endtime), and the third query includes the filter ts >= @starttime AND ts < @endtime.
The second step in the solution is implemented by the CTE C2. The query in the body of the CTE queries C1 and computes a running total of the increment column (remember, 1s, -1s, and 0s) based on the order of ts and ord, naming the resulting column cnt:
SELECT ts, increment, SUM(increment) OVER(ORDER BY ts, ord ROWS UNBOUNDED PRECEDING) AS cntFROM C1
For all events—including dummy events—the query computes the current count after the event. By assigning a lower ord value for end events compared to start events, you ensure that if a session ends exactly when another starts, you don't consider them as concurrent.
The next step is implemented by the CTE C3. Here's the query in the body of the CTE:
SELECT DATEADD( hour, DATEDIFF(hour, @starttime, ts), @starttime ) AS starttime, cntFROM C2WHERE increment <> -1
Because the maximum count during each hour will necessarily fall either after a start event or after a dummy event (in case no start event happens during the hour), the query filters only events that aren't end events. The query also computes for each timestamp the respective start of the hour using the expression DATEADD( hour, DATEDIFF(hour, @starttime, ts), @starttime ), naming the resulting column starttime.
Finally, the outer query groups the rows from C3 by startime (the start of the hour), returning the start of the hour, end of the hour, and maximum count per group, like so:
SELECT starttime, DATEADD(hour, 1, starttime) AS endtime, MAX(cnt) AS mxFROM C3GROUP BY starttime;
The solution in Listing 4 is applied to a single input application. Next, you encapsulate this logic in an inline table function that accepts the application and period as inputs. Listing 5 provides the definition of such a function, called IntervalCounts.
IF OBJECT_ID(N'dbo.IntervalCounts', N'IF') IS NOT NULL DROP FUNCTION dbo.IntervalCounts;GOCREATE FUNCTION dbo.IntervalCounts( @app AS VARCHAR(10), @starttime AS DATETIME2(0), @endtime AS DATETIME2(0)) RETURNS TABLEASRETURNWITH C1 AS( SELECT endtime AS ts, -1 AS increment, 1 AS ord FROM dbo.Sessions WHERE app = @app AND starttime < @endtime AND endtime >= @starttime UNION ALL SELECT starttime AS ts, 1 AS increment, 2 AS ord FROM dbo.Sessions WHERE app = @app AND starttime < @endtime AND endtime >= @starttime UNION ALL SELECT ts, 0 AS increment, 3 AS ord FROM dbo.TimeStamps WHERE ts >= @starttime AND ts < @endtime),C2 AS( SELECT ts, increment, SUM(increment) OVER(ORDER BY ts, ord ROWS UNBOUNDED PRECEDING) AS cnt FROM C1),C3 AS( SELECT DATEADD( hour, DATEDIFF(hour, @starttime, ts), @starttime ) AS starttime, cnt FROM C2 WHERE increment <> -1)SELECT starttime, DATEADD(hour, 1, starttime) AS endtime, MAX(cnt) AS mxFROM C3GROUP BY starttime;GO
To apply the function to all applications from the Apps table, you use the APPLY operator. Here's an example using the small set of sample data, with an input period that starts at 8:00 a.m. February 12, 2013, and ends at 5:00 p.m.February 12, 2013:
SELECT A.app, IC.*FROM dbo.Apps AS A CROSS APPLY dbo.IntervalCounts(A.app, '20130212 08:00:00', '20130212 17:00:00') AS IC;
This code generates the desired result set shown earlier in Figure 2.
After the tables are populated with the large set of sample data, here's an example for using the function with an input period that starts January 1, 2013 (inclusive), and ends February 1, 2013 (exclusive):
SELECT A.app, IC.*FROM dbo.Apps AS A CROSS APPLY dbo.IntervalCounts(A.app, '20130101', '20130201') AS IC;
Figure 3 shows the plan for this query (using SQL Sentry's Plan Explorer).
Figure 3: Serial Plan for Solution in Listing 5
This query took 13 seconds to complete on my machine. This plan isn't bad, but there's room for improvement in a couple of areas. First, as you can see, SQL Server decided to use a serial plan. Without a doubt, parallelism can help here to shorten the run time. Second, because the grouping is based on a computation, the optimizer doesn't rely on index order to compute the aggregate; instead, it uses a hash match aggregate. Hashing requires a memory grant for the query execution (as does sorting, when the optimizer sorts before using a stream aggregate).
In "Intervals and Counts, Part 2," I showed a trick that causes the optimizer to use a parallel plan. You add an artificial cross join to the query, like so:
DECLARE @n AS BIGINT = 1;SELECT A.app, IC.*FROM dbo.Apps AS A CROSS APPLY dbo.IntervalCounts(A.app, '20130101', '20130201') AS IC CROSS JOIN (SELECT TOP (@n) * FROM dbo.Apps) AS BOPTION (OPTIMIZE FOR (@n = 100));
This time I got the parallel query plan that Figure 4 shows.
Figure 4: Parallel Plan for Solution in Listing 5
With the parallel plan, the query finished in 7 seconds on my system. As for avoiding sorting and hashing, there's a way to achieve this as well—which I discuss in the next section.
Avoiding Sorting and Hashing
To avoid sorting and hashing for the aggregate computation, you need three things:
Computed columns called fstartime and fendtime holding the floored (to the beginning of the hour) starttime and endtime values, respectively:
ALTER TABLE dbo.Sessions ADD fstarttime AS DATEADD( hour, DATEDIFF(hour, CONVERT(DATETIME2(0), '19000101', 112), starttime), CONVERT(DATETIME2(0), '19000101', 112) ), fendtime AS DATEADD( hour, DATEDIFF(hour, CONVERT(DATETIME2(0), '19000101', 112), endtime), CONVERT(DATETIME2(0), '19000101', 112) );
Indexes similar to idx_start and idx_end, but with the column holding the floored time preceding the column holding the original time. Namely, fstarttime before startime, and fendtime before endtime:
CREATE UNIQUE INDEX idx_fstart ON dbo.Sessions(app, fstarttime, starttime, keycol) INCLUDE(endtime);CREATE UNIQUE INDEX idx_fend ON dbo.Sessions(app, fendtime, endtime, keycol) INCLUDE(starttime);
Adjust the implementation of the IntervalCounts function using the new fstarttime and fendtime columns, as Listing 6 shows.
IF OBJECT_ID(N'dbo.IntervalCounts', N'IF') IS NOT NULL DROP FUNCTION dbo.IntervalCounts;GOCREATE FUNCTION dbo.IntervalCounts( @app AS VARCHAR(10), @starttime AS DATETIME2(0), @endtime AS DATETIME2(0)) RETURNS TABLEASRETURNWITH C1 AS( SELECT fendtime AS fts, endtime AS ts, -1 AS increment, 1 AS ord FROM dbo.Sessions WHERE app = @app AND fstarttime < @endtime AND fendtime >= @starttime UNION ALL SELECT fstarttime AS fts, starttime AS ts, 1 AS increment, 2 AS ord FROM dbo.Sessions WHERE app = @app AND fstarttime < @endtime AND fendtime >= @starttime UNION ALL SELECT ts AS fts, ts, 0 AS increment, 3 AS ord FROM dbo.TimeStamps WHERE ts >= @starttime AND ts < @endtime),C2 AS( SELECT fts, increment, SUM(increment) OVER(ORDER BY fts, ts, ord ROWS UNBOUNDED PRECEDING) AS cnt FROM C1)SELECT fts AS starttime, DATEADD(hour, 1, fts) AS endtime, MAX(cnt) AS mxFROM C2WHERE increment <> -1GROUP BY fts;GO
You add the floored columns to the SELECT lists of the first two queries in the body of C1, naming the resulting column fts (for floored timestamp). As for the third query, which returns the dummy events by querying the TimeStamps table, the ts column is already the beginning of the hour, so you simply add it as the fts column.
Then, you entirely skip the step that computes the floored timestamps in the previous solution in Listing 5, because the fts column in the new solution in Listing 6 already contains the floored timestamps. The CTE C2 in the new solution handles the computation of the running total increment as the current count, only this time preceding ts by fts in the window order clause. This allows the optimizer to perform ordered scans of the indexes idx_fstart and idx_fend and rely on that order in the last step implemented by the outer query computing the maximum count for each hourly group.
Run the following code to test the new solution against the large set of sample data:
SELECT A.app, IC.*FROM dbo.Apps AS A CROSS APPLY dbo.IntervalCounts(A.app, '20130101', '20130201') AS IC;
I got the plan that Figure 5 shows.
Figure 5: Plan for Solution in Listing 6
There are two interesting things to observe about this plan. First, there's no sorting or hashing used to compute the aggregate; instead, the optimizer uses a stream aggregate based on the existing order of the data from the indexes. Second, SQL Server chose a parallel plan without you needing to resort to any awkward tricks. The run time I got for this solution was 7 seconds on my system—slightly disappointing, since it's the same run time that I got for the previous solution after applying the trick that resulted in a parallel plan. However, this plan still has advantages: You don't need any tricks to obtain a parallel plan, and the query doesn't need a special memory grant for sorting and hashing.
Part 4 Still To Come
In this article, I continued my coverage of querying tasks involving intervals and counts. This time, the task was to compute the maximum count of concurrent sessions within each fixed hourly interval. I demonstrated a solution that generates a chronological sequence of events, including dummy events ensuring that the result will contain a slot for every hour in the input period. Then, I computed the counts using a window function that applies a running total aggregate. The first solution generated a serial plan. Using a trick that applies an artificial cross join, I managed to obtain a parallel plan. Finally, I showed a solution in which you add computed columns holding floored timestamps, indexes that include those columns, and a revision of the solution query to involve those columns. This allowed for a parallel plan without resorting to any awkward tricks and without any sorting and hashing to compute the maximum count aggregate for each hourly group.
Next month, I'll continue my coverage of tasks involving intervals and counts. Stay tuned for Part 4!
About the Author
You May Also Like