Packing Date Intervals

Itzik describes a technique to pack overlapping date intervals.

ITPro Today

April 30, 2010

18 Min Read
ITPro Today logo

SQL Server does not support a native temporal interval data type, nor does it support native relational operations on intervals, e.g., packing intervals. Packing of intervals means merging all intervals that overlap into one contiguous interval. As an example for a practical need for packing, consider the following Projects table and its sample data:

CREATE TABLE dbo.Projects

(

  projectid  INT          NOT NULL,

  title      VARCHAR(100) NOT NULL,

  start_date DATE         NOT NULL,

  end_date   DATE         NOT NULL

);

 

INSERT INTO dbo.Projects(projectid, title, start_date, end_date) VALUES

  (1, 'Project 1', '20100212', '20100220'),

  (2, 'Project 2', '20100214', '20100312'),

  (3, 'Project 3', '20100124', '20100201');

 

The task at hand is: given the start and end dates of an input period, return a row for each contiguous period where there was a project running. For example, given January 1st, 2010 as the start of the input period and December 31st as the end, the desired result is:

start_period end_period

------------ ----------

2010-01-24   2010-02-01

2010-02-12   2010-03-12

As usual with the T-SQL challenges that I provide, I urge you to first try and come up with your own solution before looking at mine. That’s the best way to develop your T-SQL skills.

As for my solution, it makes use of a table function called GetNums that accepts an integer input, and returns a virtual auxiliary table of numbers starting with 1 and ending with the input number. Here’s the code to create the function:

IF OBJECT_ID('dbo.GetNums') IS NOT NULL DROP FUNCTION dbo.GetNums;

GO

CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE

AS

RETURN

  WITH

  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),

  L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),

  L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),

  L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),

  L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),

  L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),

  Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)

  SELECT TOP (@n) n FROM Nums ORDER BY n;

GO

 

The first step in the solution is to write a query that uses the GetNums function to generate a sequence of dates within the input period:

DECLARE

  @from AS DATE = '20100101',

  @to   AS DATE = '20101231';

 

WITH Dates AS

(

  SELECT DATEADD(day, n-1, @from) AS dt

  FROM dbo.GetNums(DATEDIFF(DAY, @from, @to) + 1) AS Nums

)

SELECT dt

FROM Dates;

 

Output:

dt

----------

2010-01-01

2010-01-02

2010-01-03

...

2010-12-29

2010-12-30

2010-12-31
(365 row(s) affected)

 

The second step is to join the Projects and Dates tables, unpacking each project interval to the individual dates within it, like so:

DECLARE

  @from AS DATE = '20100101',

  @to   AS DATE = '20101231';

 

WITH Dates AS

(

  SELECT DATEADD(day, n-1, @from) AS dt

  FROM dbo.GetNums(DATEDIFF(DAY, @from, @to) + 1) AS Nums

)

SELECT D.dt

FROM dbo.Projects AS P

  JOIN Dates AS D

    ON D.dt BETWEEN P.start_date AND P.end_date;

 

Output:

dt

----------

2010-01-24

2010-01-25

2010-01-26

2010-01-27

2010-01-28

2010-01-29

2010-01-30

2010-01-31

2010-02-01

2010-02-12

2010-02-13

2010-02-14

2010-02-14

2010-02-15

2010-02-15

2010-02-16

2010-02-16

2010-02-17

2010-02-17

2010-02-18

2010-02-18

2010-02-19

2010-02-19

2010-02-20

2010-02-20

2010-02-21

2010-02-22

2010-02-23

2010-02-24

2010-02-25

2010-02-26

2010-02-27

2010-02-28

2010-03-01

2010-03-02

2010-03-03

2010-03-04

2010-03-05

2010-03-06

2010-03-07

2010-03-08

2010-03-09

2010-03-10

2010-03-11

2010-03-12

 


The third step is to generate a group identifier for each consecutive range of dates (aka, Island). The point about the group identifier is to generate a value that is the same for all entries that belong to the same island, and different than the value produced for other islands. More generically, this step involves identifying islands in a temporal sequence with possible duplicates. You can find details about different variations of islands problems and their solutions in SQL Server MVP Deep Dives (Manning, 2009), Chapter 5 - Gaps and islands. This chapter happens to be the sample chapter of the book, so you can download it for free. (Of course, we do hope that you will like what you read so much that you will decide to purchase the book, and this way help with the efforts to contribute to War Child!)

Back to our problem, the solution to our variation of the islands problem can be described in two substeps. The first involves using the DENSE_RANK function to generate a sequence of dense rank values alongside the dt sequence, like so:

DECLARE

  @from AS DATE = '20100101',

  @to   AS DATE = '20101231';

 

WITH Dates AS

(

  SELECT DATEADD(day, n-1, @from) AS dt

  FROM dbo.GetNums(DATEDIFF(DAY, @from, @to) + 1) AS Nums

)

SELECT D.dt, DENSE_RANK() OVER(ORDER BY D.dt) AS drnk

FROM dbo.Projects AS P

  JOIN Dates AS D

    ON D.dt BETWEEN P.start_date AND P.end_date;

 

Output:

dt         drnk

---------- --------------------

2010-01-24 1

2010-01-25 2

2010-01-26 3

2010-01-27 4

2010-01-28 5

2010-01-29 6

2010-01-30 7

2010-01-31 8

2010-02-01 9

2010-02-12 10

2010-02-13 11

2010-02-14 12

2010-02-14 12

2010-02-15 13

2010-02-15 13

2010-02-16 14

2010-02-16 14

2010-02-17 15

2010-02-17 15

2010-02-18 16

2010-02-18 16

2010-02-19 17

2010-02-19 17

2010-02-20 18

2010-02-20 18

2010-02-21 19

2010-02-22 20

2010-02-23 21

2010-02-24 22

2010-02-25 23

2010-02-26 24

2010-02-27 25

2010-02-28 26

2010-03-01 27

2010-03-02 28

2010-03-03 29

2010-03-04 30

2010-03-05 31

2010-03-06 32

2010-03-07 33

2010-03-08 34

2010-03-09 35

2010-03-10 36

2010-03-11 37

2010-03-12 38

 

Now look at the two sequences… the dt sequence has gaps, whereas the drnk sequence doesn’t. With this in mind, you can generate a group identifier by subtracting drnk times the interval of the temporal sequence (one day in our case) from the dt value. This way you will get the same date as a result for all members of the same island:

DECLARE

  @from AS DATE = '20100101',

  @to   AS DATE = '20101231';

 

WITH Dates AS

(

  SELECT DATEADD(day, n-1, @from) AS dt

  FROM dbo.GetNums(DATEDIFF(DAY, @from, @to) + 1) AS Nums

)

SELECT D.dt,

  DATEADD(day, -1*DENSE_RANK() OVER(ORDER BY D.dt), D.dt) AS grp

FROM dbo.Projects AS P

  JOIN Dates AS D

    ON D.dt BETWEEN P.start_date AND P.end_date;

 

Output:

dt         grp

---------- ----------

2010-01-24 2010-01-23

2010-01-25 2010-01-23

2010-01-26 2010-01-23

2010-01-27 2010-01-23

2010-01-28 2010-01-23

2010-01-29 2010-01-23

2010-01-30 2010-01-23

2010-01-31 2010-01-23

2010-02-01 2010-01-23

2010-02-12 2010-02-02

2010-02-13 2010-02-02

2010-02-14 2010-02-02

2010-02-14 2010-02-02

2010-02-15 2010-02-02

2010-02-15 2010-02-02

2010-02-16 2010-02-02

2010-02-16 2010-02-02

2010-02-17 2010-02-02

2010-02-17 2010-02-02

2010-02-18 2010-02-02

2010-02-18 2010-02-02

2010-02-19 2010-02-02

2010-02-19 2010-02-02

2010-02-20 2010-02-02

2010-02-20 2010-02-02

2010-02-21 2010-02-02

2010-02-22 2010-02-02

2010-02-23 2010-02-02

2010-02-24 2010-02-02

2010-02-25 2010-02-02

2010-02-26 2010-02-02

2010-02-27 2010-02-02

2010-02-28 2010-02-02

2010-03-01 2010-02-02

2010-03-02 2010-02-02

2010-03-03 2010-02-02

2010-03-04 2010-02-02

2010-03-05 2010-02-02

2010-03-06 2010-02-02

2010-03-07 2010-02-02

2010-03-08 2010-02-02

2010-03-09 2010-02-02

2010-03-10 2010-02-02

2010-03-11 2010-02-02

2010-03-12 2010-02-02

 

The actual date that you get as a result is not really significant; what’s important is that it meets the two requirements that we have for our group identifier: 1. The value must be the same for all members of the same island. 2. The value must be different than the values produced for other islands.

The final step in the solution is simple—group the result by the grp value, and return the minimum and maximum dt values per group:

DECLARE

  @from AS DATE = '20100101',

  @to   AS DATE = '20101231';

 

WITH Dates AS

(

  SELECT DATEADD(day, n-1, @from) AS dt

  FROM dbo.GetNums(DATEDIFF(DAY, @from, @to) + 1) AS Nums

),

Groups AS

(

  SELECT D.dt,

    DATEADD(day, -1*DENSE_RANK() OVER(ORDER BY D.dt), D.dt) AS grp

  FROM dbo.Projects AS P

    JOIN Dates AS D

      ON D.dt BETWEEN P.start_date AND P.end_date

)

SELECT MIN(dt) AS start_period, MAX(dt) AS end_period

FROM Groups

GROUP BY grp;

 

Output:

start_period end_period

------------ ----------

2010-01-24   2010-02-01

2010-02-12   2010-03-12

 

Cheers,

BG

 

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like