Calculating Concurrent Sessions, Part 1
Discover a set-based and a cursor-based solution
October 14, 2009
When you’re faced with a T-SQL querying problem, finding a set-based solution (as opposed to a cursor-based solution) is recommended for a number of reasons. First, set-based solutions are in accord with the relational model. Also, they typically require less code and less maintenance than cursor-based solutions. In addition, set-based solutions tend to perform better.
For many years, I couldn’t find a good-performing set-based solution for a particular querying problem that involves maximum concurrent sessions. I had an adequate set-based solution in terms of its logic, but it didn’t perform well. The only good-performing solution I had was cursor-based. I recently used this problem as an exercise in a class I was teaching, and one of my students—Darryl Page from the UK—had an idea that ultimately led to a good-performing set-based solution. In this article, I cover my original set-based solution, the cursor-based solution, and the performance and scaling ramifications of both solutions. Next month I’ll present the improved set-based solution based on Darryl’s insight, as well as a set-based solution that relies on standard language elements that aren’t yet implemented in SQL Server (as of SQL Server 2008) but hopefully will be supported in a future version.
The Problem
The task at hand involves a table called Sessions that keeps track of user sessions against an application. Suppose that every month you need to produce a report for the previous month’s activity, including the maximum number of concurrent sessions for each application. That is, for each application, you must calculate the maximum number of sessions that were active at the same time during the period of interest. You might need such information, for example, for billing purposes for per-user license types. Use the code in Listing 1 to create the Sessions table in the tempdb database (for test purposes) and populate it with sample data.
SET NOCOUNT ON;USE tempdb; IF OBJECT_ID('dbo.Sessions', 'U') IS NOT NULL DROP TABLE dbo.Sessions; CREATE TABLE dbo.Sessions( keycol INT NOT NULL, app VARCHAR(10) NOT NULL, usr VARCHAR(10) NOT NULL, host VARCHAR(10) NOT NULL, starttime DATETIME NOT NULL, endtime DATETIME NOT NULL, CONSTRAINT PK_Sessions PRIMARY KEY(keycol), CHECK(endtime > starttime));GO CREATE INDEX idx_nc_app_st_et ON dbo.Sessions(app, starttime, endtime);CREATE INDEX idx_nc_app_et_st ON dbo.Sessions(app, endtime, starttime); INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(2, 'app1', 'user1', 'host1', '20090212 08:30', '20090212 10:30');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(3, 'app1', 'user2', 'host1', '20090212 08:30', '20090212 08:45');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(5, 'app1', 'user3', 'host2', '20090212 09:00', '20090212 09:30');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(7, 'app1', 'user4', 'host2', '20090212 09:15', '20090212 10:30');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(11, 'app1', 'user5', 'host3', '20090212 09:15', '20090212 09:30');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(13, 'app1', 'user6', 'host3', '20090212 10:30', '20090212 14:30');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(17, 'app1', 'user7', 'host4', '20090212 10:45', '20090212 11:30');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(19, 'app1', 'user8', 'host4', '20090212 11:00', '20090212 12:30');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(23, 'app2', 'user8', 'host1', '20090212 08:30', '20090212 08:45');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(29, 'app2', 'user7', 'host1', '20090212 09:00', '20090212 09:30');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(31, 'app2', 'user6', 'host2', '20090212 11:45', '20090212 12:00');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(37, 'app2', 'user5', 'host2', '20090212 12:30', '20090212 14:00');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(41, 'app2', 'user4', 'host3', '20090212 12:45', '20090212 13:30');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(43, 'app2', 'user3', 'host3', '20090212 13:00', '20090212 14:00');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(47, 'app2', 'user2', 'host4', '20090212 14:00', '20090212 16:30');INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime) VALUES(53, 'app2', 'user1', 'host4', '20090212 15:30', '20090212 17:00');GO
For simplicity, assume that the Sessions table contains only sessions during the period of interest, so you don’t need to filter a specific date range. Note that you should explicitly define the term concurrent for your customers. For example, if a session ends at the exact point in time that another session starts, are the sessions considered concurrent? For the purposes of our task, let’s assume that two such sessions are not considered concurrent. In this case, Table 1 shows the desired output for your solution, given the sample data in Listing 1.
Table 1: Desired Output from Solution
app | mx |
---|---|
app1 | 4 |
app2 | 3 |
The indexing strategy that would help speed up both a set-based and a cursor-based solution is quite straightforward. Create one index on the key-list (app, starttime, endtime) and another on (app, endtime, starttime). Note that you must populate the Sessions table with a larger volume of sample data if you want to run performance tests. Use the code in Listing 2 to create a helper table function called GetNums that returns a sequence of integers of a requested size.
IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;GOCREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLEASRETURN WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1), L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5) SELECT TOP(@n) n FROM Nums ORDER BY n;GO
Use the code in Listing 3 to populate the Sessions table with a requested number of rows by querying the GetNums helper function. The code in Listing 3 populates the Sessions table with 10,000 rows, but you can adjust the number according to your needs by assigning your desired value to the variable @numrows. The sessions are associated with 10 different applications using randomization (1 + ABS(CHECKSUM(NEWID())) % 10), so about one tenth of the rows are associated with each application. I didn’t make the number of partitions (applications) variable, because adding partitions has a linear effect on all solutions (linear complexity), as I will show. A more interesting topic for discussion is how changing the partition size affects the different solutions’ performance. Changing the value of @numrows changes the partition size, because the number of partitions remains constant (10).
SET NOCOUNT ON;USE tempdb; TRUNCATE TABLE dbo.Sessions; DECLARE @numrows AS INT;SET @numrows = 10000; -- Change this value according to your needs INSERT INTO dbo.Sessions WITH(TABLOCK) (keycol, app, usr, host, starttime, endtime) SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS keycol, D.*, DATEADD( second, 1 + ABS(CHECKSUM(NEWID())) % (20*60), starttime) AS endtime FROM ( SELECT 'app' + CAST(1 + ABS(CHECKSUM(NEWID())) % 10 AS VARCHAR(10)) AS app, 'user1' AS usr, 'host1' AS host, DATEADD( second, 1 + ABS(CHECKSUM(NEWID())) % (30*24*60*60), '20090101') AS starttime FROM dbo.GetNums(@numrows) AS Nums WHERE n <= @numrows ) AS D;GO
Note that the code in Listing 3 uses randomization to generate sessions that start during the month of January 2009 and that last up to 20 minutes. It’s important that you not produce sample data by merely copying a small set of rows many times, because the optimizer sometimes uses methods that reuse previously calculated results when the value for which the calculation is performed doesn’t change. Therefore, a solution’s performance that is calculated with such sample data might not adequately reflect your actual system’s performance.
Original Set-Based Solution
Listing 4 shows the original set-based solution that I used to address the problem. The main idea behind this solution is that the number of concurrent sessions changes only when a session either starts or ends. Between such events, the number of concurrent sessions doesn’t change. Moreover, you can consider the start of a session as an event that increases the count of concurrent sessions and the end of a session as an event that decreases the count. Therefore, the maximum number of concurrent sessions must take place at a point when one of the sessions starts.
WITH AppTimestamps AS( SELECT app, starttime AS ts FROM dbo.Sessions),Counts AS( SELECT app, (SELECT COUNT(*) FROM dbo.Sessions AS S WHERE T.app = S.app AND T.ts >= S.starttime AND T.ts < S.endtime) AS cnt FROM AppTimestamps AS T)SELECT app, MAX(cnt) AS mxFROM CountsGROUP BY app;
The solution in Listing 4 starts by defining a common table expression (CTE) called AppTimestamps that returns the app and starttime attributes from all sessions. Then the code defines a CTE called Counts based on a query against AppTimestamps. The query against AppTimestamps invokes a scalar subquery expression aliased as cnt that for each outer row counts the number of concurrent sessions for the current application, at the current timestamp. Remember that in this case we’re not considering sessions that end at the given timestamp. Therefore, the predicate in the subquery isn’t T.ts BETWEEN S.starttime AND S.endtime; instead, it’s T.ts >= S.starttime AND T.ts < S.endtime. Finally, the outer query groups the rows from the CTE called Counts by application, then returns the maximum count for each application.
This solution is logically sound, but it’s quite slow when you run it with large partitions. In addition, rather than scaling linearly, it scales in a quadratic manner (N2). Examining the execution plan for this solution, which Figure 1 shows, reveals the performance problem.
I’ll use the letters p and r to represent the number of partitions (applications) involved and the number of rows per partition, respectively. The plan starts with a full scan of the index on (app, endtime, starttime) to retrieve the application and starttime values from all rows (for the CTE AppTimestamps). The total number of rows scanned is about p × r. Next, for each row returned from the Index Scan operator, the plan invokes a seek and partial scan of the index on (app, starttime, endtime). This work is represented by the Index Seek operator.
Observe the Seek Predicates (aka SEEK:() predicate) and Predicate (aka WHERE:() predicate) in the properties of the Index Seek operator. SQL Server Books Online explains the difference between the two: “The storage engine uses the index to process only those rows that satisfy the SEEK:() predicate. It optionally may include a WHERE:() predicate, which the storage engine will evaluate against all rows that satisfy the SEEK:() predicate (it does not use the indexes to do this).”
This means that all rows satisfying the Seek Predicates will be scanned, and SQL Server will then examine each row, evaluating the Predicate to determine whether or not to return the row in the operator’s output. So in practice, for each row returned from the Index Scan operator, the Index Seek operator will end up scanning all rows where the app value is the same as in the outer row, and the starttime value is smaller than or equal to the timestamp value in the outer row. This means that for each outer row, the Index Seek operator will scan at the leaf about r/2 rows on average. So the total number of rows scanned by both the Index Scan and Index Seek operators is: p × r + p × r × (r/2), which is equal to p × r + p × r2/2. So, if the number of rows per partition increases by a factor of f, the total number of rows processed increases by close to a factor of f2. The rest of the work in the execution plan involves calculating the aggregates (count, and later the maximum count).
I provide actual benchmark numbers in a later section, but you can see that this solution is quite inefficient. I wouldn’t use it unless I were dealing with very small partitions.
Cursor-Based Solution
Listing 5 shows my cursor-based solution.
DECLARE @app AS VARCHAR(10), @prevapp AS VARCHAR (10), @ts AS datetime, @event_type AS INT, @cnt AS INT, @mx AS INT; DECLARE @Result AS TABLE(app VARCHAR(10), mx INT); DECLARE C CURSOR FAST_FORWARD FOR SELECT app, starttime AS ts, 1 AS event_type FROM dbo.Sessions UNION ALL SELECT app, endtime, -1 FROM dbo.Sessions ORDER BY app, ts, event_type; OPEN C; FETCH NEXT FROM C INTO @app, @ts, @event_type;SELECT @prevapp = @app, @cnt = 0, @mx = 0; WHILE @@FETCH_STATUS = 0BEGIN IF @app <> @prevapp BEGIN INSERT INTO @Result VALUES(@prevapp, @mx); SELECT @prevapp = @app, @cnt = 0, @mx = 0; END SET @cnt = @cnt + @event_type; IF @cnt > @mx SET @mx = @cnt; FETCH NEXT FROM C INTO @app, @ts, @event_type;END IF @prevapp IS NOT NULL INSERT INTO @Result VALUES(@prevapp, @mx); CLOSE C; DEALLOCATE C; SELECT * FROM @Result;GO
This solution also relies on the idea that the start of a session increases the count of concurrent sessions, whereas the end of a session decreases the count. The most important part of this solution is the query that the cursor is based on:
SELECT app, starttime AS ts, 1 AS event_typeFROM dbo.SessionsUNION ALLSELECT app, endtime, -1FROM dbo.SessionsORDER BY app, ts, event_type;
The query returns all start and end events, with a +1 value representing a start event type, and a -1 value representing an end event type. The query orders the rows by application, timestamp, and event type. It’s obvious why you would want to order by application and timestamp. The reason for adding the event type (-1 or +1) to the ORDER BY list is to ensure that if one session ends when another starts you won’t consider them as concurrent (-1 sorts before +1).
The rest of the code simply scans the cursor rows in order, calculating a running total of the event type within the current application, which is the number of concurrent sessions at that point. The code keeps the maximum count in a variable, and when it’s done processing an application, the code inserts a row with the application and maximum count into a table variable. Ultimately, the code queries the table variable to return the results.
The execution plan in Figure 2 shows the query that the cursor-based solution is based on. Remember that you created two indexes to support your solutions—one on (app, starttime, endtime), and another on (app, endtime, starttime). Notice in the plan that both indexes are scanned in order, then the plan uses a Merge Join (Concatenation) operator to unify the rows returned from both Index Scan operators and return them in the requested order. This Merge Join operator relies on index ordering and therefore avoids the need to sort the data. I was surprised when I realized that the optimizer could achieve this without adding a Sort operator to the plan. This means that for p partitions with r rows each, you get a total of 2 × p × r rows scanned, and no extra cost due to sorting. Therefore, if the number of rows per partition increases by a factor of f, the number of rows processed also increases by a factor of f. Thus, the cursor-based solution has linear scaling. Keep in mind that processing each row with a cursor-based solution is more expensive than with a set-based solution. However, in our case the set-based solution has quadratic complexity, whereas the cursor-base solution has linear complexity.
Benchmark Results
I ran benchmarks to test my set-based and cursor-based solutions. I used sample data with 10 partitions and a total number of rows ranging from 10,000 (1,000 per partition) to 100,000. Figure 3 shows the results of these benchmark tests.
You can clearly see the linear complexity of the cursor-based solution and the quadratic complexity of the set-based solution. This means that this particular set-based solution is efficient only compared with the cursor-based solution, for very small partitions (about a few hundred rows per partition). The cursor-based solution is more efficient for large partitions, because it scales better.
The Best Is Yet To Come
For years, I pondered a better set-based solution to the maximum concurrent sessions problem. My original set-based solution performs inefficiently, especially with large partitions. My cursor-based solution performs reasonably well and scales linearly; however, it still has the basic cursor-based solution disadvantages (i.e., it isn’t in accord with the relational model, and it has problematic readability and maintenance). Next month I’ll cover a new and better-performing set-based solution with linear scaling. In addition, I’ll discuss a standard solution based on windowing aggregates that isn’t yet supported in SQL Server but hopefully will be in the future. In the meantime, let me know if you come up with a good-performing set-based solution to the problem.
About the Author
You May Also Like