Temporal Histograms
Dealing with temporal data can be tricky. Create a temporal histogram to show the distribution of events over time by first relaxing some of the requirements to solve a specific case, then enhancing the solution to make it more generic.
April 28, 2008
Recently I received a request from a customer to come up with a solution that produces temporal histograms—histograms showing the distribution of events over time periods. The problem was an interesting challenge and seemed like a generic need, so I decided to cover it in my column. First, I’ll explain the problem in terms of inputs and desired output. Next, I’ll provide a solution that handles only a specific case of the problem. Then, I’ll show you how to enhance the solution to make it more generic.
The Challenge
Suppose that you have a table called Events in your database, containing information about events in time. These events can be appointments, sessions, or anything that has start and end points in time. The Events table has three columns: event_id is the primary key, event_start is the start point in time of the event, and event_end is the end point. Run the code in Web Listing 1 to create the Events table in the tempdb database and populate it with sample data. You need to write a table function that accepts the following inputs:
@from_dt —start point of a datetime range @to_dt—end point of a datetime range @date_part—a datetime part from the enumeration: 'minute', 'hour', 'day', 'week', 'month', 'quarter', 'year' @num_parts—the number of datetime parts to be covered in each step
The table function should produce a histogram showing the number of active events during each fixed interval of time within the requested datetime range. The intervals of time, or steps, are based on the input @date_part and @num_parts.
The problem is best explained through an example. Given the inputs @from_dt = ‘20080501 00:00’, @ to_dt = ‘20080511 00:00’, @date_part = ‘day’, @ num_parts = 1, you’re supposed to produce the output in Web Table 1. Each row in the output represents a different fixed interval of time (from_dt - to_dt) within the requested datetime range. Because the requested date part is ‘day’, and the number of parts is 1, each step in the histogram represents one day. The fourth column in the output (num_events) holds the count of events from the Events table that were active during the current interval.
Regarding the histogram step boundary points, one of the requirements from my customer was to produce round points in time (round in respect to the input @ datepart), except for the extreme boundary points that must be the ones provided by the user. For example, given the inputs @from_dt = ‘20080501 12:30’, @to_dt = ‘20080510 10:00’, @date_part = ‘day’, @num_parts = 1, the step boundary points should be those in Web Table 2. Notice that the first step’s low boundary point is 2008-05-01 12:30:00.000 and the last step’s high boundary point is 2008-05-10 10:00:00.000, whereas all the other step boundary points represent whole days (in terms of day units).
Producing the Histogram Steps
You can start by creating a table function (call it fn_HistSteps) that returns the histogram steps table based on the previously mentioned input parameters. The function will return a row for each step with the step boundary points. Once defined, you can join the function with the Events table to match steps and events, group the result of the join by step, and count the number of events in each step.
So that you don’t have to deal with too many aspects of the problem at once, you can first relax some of the requirements. For example, take the @date_part and @num_parts inputs out of the equation, and solve the task for a specific interval—say, one day. After you manage to solve the problem for a specific interval, you can add the logic required to handle the requested @ date_part and @num_parts inputs.
In my solution I used an auxiliary table of numbers that you create and populate by running the code in Web Listing 2. This code creates a table called Nums with a single column called n, and populates the table with integers in the range 1 through 1,000,000.
To create the first version of the fn_HistSteps function, run the code in Web Listing 3. The function is an inline table-valued function based on a single query with multiple common table expressions (CTEs). The first CTE defined by the function’s code is called C0 and it has two columns: floor_from_dt and diff. The former is a floor of the input @from_dt value in terms of day units; that is, midnight of the input @from_dt value. The latter is the number of days in the range @from_dt - @to_dt.
The second CTE is called C1; it’s in charge of producing steps with round boundary points. This task is achieved by joining Nums and Steps, and returning all n values that are smaller than or equal to diff (number of days in the input range). The starting point of each step (from_dt) is calculated by adding n-1 days to floor_from_dt, and the ending point of each step (to_ dt) is calculated by adding n days to floor_from_dt.
The third CTE is called C2; it’s in charge of adjusting the extreme boundary points (start point of first step and end point of last step) if they need adjustment. Remember that the previous CTE (C1) produced round boundary points, although the requirement was that the extreme boundary points would be those provided by the user as the input datetime range boundary points. Note that as a result of adjusting the extreme boundary points, C2 might end up with rows representing irrational steps where to_dt isn’t greater than from_dt. Those rows will be eliminated by the outer query. The outer query simply returns all rows from C2 representing the histogram steps, excluding the irrational steps produced by a previous CTE.
To test the function, run the following code:
SELECT * FROM dbo.fn_HistSteps('20080501 00:00', '20080511 00:00') AS S ORDER BY n;
You should get the steps shown in Web Table 1, without the num_events column.
To test the function with nonround range boundary points, query it with the following inputs:
SELECT * FROM dbo.fn_HistSteps('20080501 12:30', '20080510 10:00') AS S ORDER BY n;
You should get the steps shown in Web Table 2; again, excluding the num_events column.
Now that your function works for one-day intervals, you can add logic to support a requested date part (@date_part) and number of parts (@num_parts). The revision to the function isn’t complicated. You need to substitute all expressions that currently use the date part day with a CASE expression that uses the requested date part. Also, when calculating diff, you’ll to need divide the value by @num_parts as part of the expression. To create the revised fn_HistSteps function, run the code in Web Listing 4. Note that if the function is invoked with an unrecognized part, the CASE expressions will default to ELSE NULL, the query filter will filter out all rows, and the function will return an empty set.
Now you can specify the date part and the number of parts as inputs. For example, to get a steps table for the range ‘20080501 00:00’ - ‘20080502 00:00’, with four-hour intervals, you’d query the function as follows:
SELECT * FROM dbo.fn_HistSteps('20080501 00:00', '20080502 00:00', 'hour', 4) AS S ORDER BY n;
Producing the Actual Histogram
Most of the work is now behind you; what’s left is to join the fn_HistSteps function with the Events table to match steps and events, group the results by step, and return the count of active events in each step. To check whether an event (starting at event_start and ending at event_end) overlaps with a step (starting at from_dt and ending at to_dt), you can use the following predicate:
event_start < to_dt AND event_end > from_dt
Note that you can use <= instead of <, and >= instead of > depending on how you want to treat the boundary point itself (inclusive versus exclusive). Here’s the full query that would give you a daily histogram for the range ‘20080501 00:00’ - ‘20080511 00:00’, producing the output in Web Table 1:
SELECT n, from_dt, to_dt, COUNT(event_id) AS num_events FROM dbo.fn_HistSteps('20080501 00:00', '20080511 00:00', 'day', 1) AS S LEFT OUTER JOIN dbo.Events AS E ON E.event_start < S.to_dt AND E.event_end > S.from_dt GROUP BY n, from_dt, to_dt ORDER BY n;
Notice that an OUTER JOIN is used here instead of an INNER JOIN in order to return empty steps (steps/ intervals with a num_events value of 0) as well.
The following query returns a daily histogram for the range ‘20080501 12:30’ - ‘20080510 10:00’, producing the output in Web Table 2:
SELECT from_dt, to_dt, COUNT(event_ id) AS num_events FROM dbo.fn_HistSteps('20080501 12:30', '20080510 10:00', 'day', 1) AS S LEFT OUTER JOIN dbo.Events AS E ON E.event_start < S.to_dt AND E.event_end > S.from_dt GROUP BY from_dt, to_dt ORDER BY from_dt;
And finally, the following query returns a fourhour step histogram for the range ‘20080501 00:00’ - ‘20080502 00:00’, producing the output in Web Table 3:
SELECT n, from_dt, to_dt, COUNT(event_id) AS num_events FROM dbo.fn_HistSteps('20080501 00:00', '20080502 00:00', 'hour', 4) AS S LEFT OUTER JOIN dbo.Events AS E ON E.event_start < S.to_dt AND E.event_end > S.from_dt GROUP BY n, from_dt, to_dtORDER BY n;
One Step at a Time
Dealing with temporal data can be quite tricky. In fact, I’ve written extensively about datetime manipulation (see the Learning Path). When you face such challenges in which the solution isn’t trivial, it’s important to break the problem into steps as I’ve done in this article. In addition, when handling all of a task’s requirements at once is too complicated, a useful approach is to relax some of the requirements, solve a simpler form of the problem, then reintroduce the complexity layers.
About the Author
You May Also Like