Grouping Time Intervals
Try this set-based trick to solve a common problem
January 23, 2002
Some T-SQL problems seem to require cursor-based solutions but in fact have quite a few set-based solutions, as I've shown in recent columns. This month, let's look at a more complex example that programmers usually try to solve with cursors. You might have even had to handle this problem in one variation or another because it represents a common need: grouping time intervals. I'd like to thank Isaac Blank for sharing a beautiful solution, which I present in this article along with some other solutions.
To understand what grouping time intervals means, let's jump into a puzzle. After you've read the puzzle details and before you read the solutions, try to solve the puzzle by using the techniques I discussed in my previous two columns, "Matching Transactions," December 2001, InstantDoc ID 22956, and "Sequential to Set-Based," November 2001, InstantDoc ID 22431. Doing so will give you an appreciation of the difficulties of this type of problem.
Here's the puzzle. A table called Sessions stores starting and ending times of sessions that users open against a certain application—in this case, SQL Server. For each session, the Sessions table stores the username, session start time, and session end time. Note that those three attributes aren't necessarily unique; a user can open or close more than one session at the same time. To identify a session uniquely, I added to the table a surrogate key that gets its values from the identity property. To create and populate the Sessions table, run the script that Web Listing 1 shows.
Each row keeps track of one session. All individual sessions that together make up a consecutive time period with no breaks belong to the same session group. Your task is to write a query that returns the username, start time, and end time of each session group. You might notice a big warning sign at the beginning of the road that leads to the puzzle solution: "Danger! The road ahead is dizzying!" To keep the facts straight in your mind, consult Figure 1, which graphically shows each session as a green bar and each session group as a red arrow.
The First Solution
The first step in solving the problem is to retrieve the starting times of each session group. A certain row has the starting time of a session group if the session started after the maximum closing time among all the sessions that the same user opened earlier. This definition translates to a subquery in the WHERE clause. Listing 1 shows the T-SQL query, which uses the DISTINCT clause because several sessions can have the same start time and I don't want to return duplicates. Table 1 shows the output of Listing 1's query.
Note that three rows are missing from the output. The missing rows contain the starting times of each user's first session, which also happen to be the starting times of their session groups. Remember that the query in Listing 1 looks for rows that have a start time greater than the end time of a session opened earlier by the same user. A user's first session by definition has no earlier session; therefore, the subquery returns NULL. The logical expression NULL < starttime evaluates to unknown, and SQL Server doesn't return that row.
Now let's look at a simple trick you can use to make sure that the rows representing a user's first session are also returned. Listing 2 shows the query that returns the starting times of all the session groups. The trick is to use the ISNULL() function. If the subquery returns NULL, the query replaces the NULL with starttime -1, which is less than starttime. Cross-check the starting times in Listing 2's output, which Table 2 shows, with those in Figure 1.
The second and final step in solving this puzzle is to attach the appropriate end times to each row in the previous output, as the query in Web Listing 2 shows. Of all the user's session-group end times, the one you need to attach to a certain start time is the earliest one that ended at or after that start time. You can achieve this goal by using a subquery in the SELECT list.
To find the session-group end times, you need to nest another level inside the subquery in the SELECT list and use the technique I used in Listing 2 to obtain the start times. The innermost subquery makes sure that the end time is earlier than the minimum start time of the sessions the user opened and closed after that end time. I used the ISNULL() function here for a similar reason I used it earlier; here, it ensures that the end time of the last session the user closed is also included as a session group end time. Table 3 shows the output of Web Listing 2's query. Again, you can cross-check this output with Figure 1.
The Second Solution
Blank took a similar complex problem and made it fairly simple by an elegant use of views. The solution I present here is a slight revision of Blank's original solution. The first view, V_Start, returns the usernames and session group start times. You can create the V_Start view by running the script that Listing 3 shows.
The V_Start view returns all session start times in which no session by the same user started earlier and ended at or after that start time. In short, it returns all session-group start times. If you issue a SELECT * against the V_Start view, you'll get the output that Table 2 shows. The second view—V_End—returns usernames and session-group end times. The V_End view is similar to V_Start in that it returns all session end times, in which no session by the same user ended later and started at or before that start time. You can run the script that Listing 4 shows to create the V_End view. Issuing a SELECT * against the V_End view produces the output that Table 4 shows.
Now you just need to run the query in Listing 5 to get the desired results. The query selects all rows from the V_Start view, then uses a subquery in the SELECT list to retrieve the matching session-group end times. From V_End, the subquery returns the earliest session-group end time with the same user as in the outer query, along with an end time that's later than the start time in the outer query.
This solution differs from the first in two main respects. First, it uses a modular approach in which the initial problem is split into three subproblems: first, retrieving each session group's start time; second, retrieving each session group's end time; and third, matching session group start times to end times. The second difference is that this solution uses an existence check to track start and end times, whereas the previous solution uses subqueries that calculate aggregates. Queries that use the EXISTS() predicate are usually very efficient and easier to understand than their alternatives. In this case, however, both queries perform almost the same.
Ignoring Spaces
Some grouping-time-intervals problems have slightly more complex requirements than the ones I've discussed so far. For example, you might want to ignore spaces of up to a certain amount of time in which a user had no active session. In the session groups example, you could decide that if a certain session started no later than 5 minutes after another session ended, both sessions belong to the same session group. This rule requires minor revisions to the first solution I presented in Web Listing 2. In the subquery in the WHERE clause, you need to add 5 minutes to MAX(endtime) by using the DATEADD() function before you check whether the MAX(endtime) is earlier than the current start time. If it's earlier, the current start time is a session-group start time. Similarly, in the subquery in the SELECT list, you subtract 5 minutes from MIN(starttime) before you check whether that time is later than the current end time. Web Listing 3 shows the revised solution, with the revisions I made to Web Listing 2 highlighted in red. Note that in the output that Table 5 shows, the last three sessions User2 made are now grouped together in the same session group.
Similarly, you'd need to revise the V_Start and V_End views to implement the new requirements. You can use the DATEADD() function similarly to the way I used it in the first solution to add or subtract 5 minutes where appropriate. You can revise the views by running the script in Web Listing 4. You don't need to change the query that retrieves the data from the V_Start and V_End views. To check whether the revised views meet the new requirements, simply run the query that Listing 5 shows.
Wrap Up
I compared the performance of the two solutions for grouping time intervals and found that the first solution performs slightly better. However, the second solution is undoubtedly simpler to understand. With only a slight difference in performance, I'd choose the second solution over the first in a production environment. Next time you face a problem in T-SQL and you're thinking about using cursors, first try the techniques I demonstrated in my past three columns. And when you're programming in T-SQL, always think "set based."
About the Author
You May Also Like