T-SQL Interval Graphs Challenge, Part 2
Complete solution to the drug prescription interval packing problem
April 18, 2014
In "T-SQL Interval Graphs Challenge, Part 1," I introduced a puzzle that was originally given to me by John Paul Cook, who is a registered nurse and a fellow SQL Server MVP. The puzzle involves counting drug exposure based on data modeled as an interval graph. You can find a full description of the puzzle in "Counting Drug Exposure in SAS® with Interval Graph Modeling" by Xiaoqiang Wang. In this paper, Wang explains the puzzle and how to use SAS to solve it. John's challenge to me was to use a set-based T-SQL solution to solve the puzzle.
The puzzle consists of two parts. I covered the first part of the puzzle in "T-SQL Interval Graphs Challenge, Part 1"; I'll cover the second part of the puzzle in this article. In addition to John, I'd also like to thank Alejandro Mesa, Paul White, Adam Machanic, Peter Larsson, and Geri Reshef for their input.
Last Month's Challenge
The puzzle involves data representing drug prescriptions. Run the code in Listing 1 to create the Prescriptions and Drugs tables and fill them with small sets of sample data to validate your solutions.
SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.Prescriptions', N'U') IS NOT NULL DROP TABLE dbo.Prescriptions;IF OBJECT_ID(N'dbo.Patients', N'U') IS NOT NULL DROP TABLE dbo.Patients;CREATE TABLE dbo.Patients( patientid INT NOT NULL, CONSTRAINT PK_Patients PRIMARY KEY(patientid));CREATE TABLE dbo.Prescriptions( prescriptionid INT NOT NULL IDENTITY, patientid INT NOT NULL, drugid INT NOT NULL, startdate DATE NOT NULL, numdays INT NOT NULL, enddate AS DATEADD(day, numdays, startdate), CONSTRAINT CHK_Prescriptions_ed_sd CHECK(numdays > 0));CREATE UNIQUE CLUSTERED INDEX idx_start ON dbo.Prescriptions(patientid, drugid, startdate, prescriptionid);ALTER TABLE dbo.Prescriptions ADD CONSTRAINT PK_Prescriptions PRIMARY KEY NONCLUSTERED(prescriptionid);-- code to fill tables with small set of sample dataTRUNCATE TABLE dbo.Prescriptions;TRUNCATE TABLE dbo.Patients;INSERT INTO dbo.Patients(patientid) VALUES(1);INSERT INTO dbo.Prescriptions(patientid, drugid, startdate, numdays) VALUES (1, 1, '20140101', 3), (1, 2, '20140101', 5), (1, 3, '20140102', 4), (1, 4, '20140102', 5), (1, 4, '20140103', 2), (1, 2, '20140105', 5), (1, 3, '20140106', 4), (1, 1, '20140107', 3), (1, 4, '20140108', 1), (1, 4, '20140120', 4), (1, 4, '20140122', 1), (1, 5, '20140212', 3), (1, 5, '20140215', 3), (1, 5, '20140220', 1), (1, 5, '20140223', 1), (1, 5, '20140226', 1);
Run the code in Listing 2 to create a helper function called GetNums and populate the tables with large sets of sample data to check the performance of your solutions.
IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;GOCREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLEASRETURN WITH L0 AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)), L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L5) SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n FROM Nums ORDER BY rownum;GODECLARE @numpatients AS INT = 10000, @numdrugs AS INT = 10, @numprescriptions AS INT = 10, @firststartdate AS DATE = '20140101', @laststartdate AS DATE = '20141231', @maxnumdays AS INT = 30;TRUNCATE TABLE dbo.Prescriptions;TRUNCATE TABLE dbo.Patients;INSERT INTO dbo.Patients(patientid) SELECT PT.n AS patientid FROM dbo.GetNums(1, @numpatients) AS PT;INSERT INTO dbo.Prescriptions(patientid, drugid, startdate, numdays) SELECT PT.n AS patientid, D.n AS drugid, DATEADD(day, ABS(CHECKSUM(NEWID())) % DATEDIFF(day, @firststartdate, @laststartdate), @firststartdate ) AS startdate, ABS(CHECKSUM(NEWID())) % @maxnumdays + 1 AS numdays FROM dbo.GetNums(1, @numpatients) AS PT CROSS JOIN dbo.GetNums(1, @numdrugs) AS D CROSS JOIN dbo.GetNums(1, @numprescriptions) AS PR;
The first part of the puzzle (which I covered in "T-SQL Interval Graphs Challenge, Part 1") was to pack the prescriptions of the same patient and drug in a rollover manner. Figure 1 graphically depicts the input and desired results.
Figure 1: Input and Packed Intervals
The bars represent the input prescription periods and the arrows represent the resulting packed prescriptions.
Figure 2 shows the packed subscriptions as query output.
patientid drugid startdate enddate----------- ----------- ---------- ----------1 1 2014-01-01 2014-01-041 1 2014-01-07 2014-01-101 2 2014-01-01 2014-01-111 3 2014-01-02 2014-01-101 4 2014-01-02 2014-01-101 4 2014-01-20 2014-01-251 5 2014-02-12 2014-02-181 5 2014-02-20 2014-02-211 5 2014-02-23 2014-02-241 5 2014-02-26 2014-02-27
The set representing the packed prescriptions is used as the input to the second part of the puzzle, which is the focus of this month's article. Run the code in Listing 3 to create a view called PackedPrescriptions, which is based on the solution query in "T-SQL Interval Graphs Challenge, Part 1."
IF OBJECT_ID(N'dbo.PackedPrescriptions', N'V') IS NOT NULL DROP VIEW dbo.PackedPrescriptions;GOCREATE VIEW dbo.PackedPrescriptionsASWITH C1 AS( SELECT prescriptionid, patientid, drugid, startdate, numdays, DATEADD(day, - SUM(numdays) OVER(PARTITION BY patientid, drugid ORDER BY startdate, prescriptionid ROWS UNBOUNDED PRECEDING) + numdays, startdate) AS grphelper FROM dbo.Prescriptions),C2 AS( SELECT patientid, drugid, startdate, numdays, MAX(grphelper) OVER(PARTITION BY patientid, drugid ORDER BY startdate, prescriptionid ROWS UNBOUNDED PRECEDING) AS grp FROM C1)SELECT patientid, drugid, MIN(startdate) AS startdate, DATEADD(day, SUM(numdays), MIN(startdate)) AS enddateFROM C2GROUP BY patientid, drugid, grp;GO
Recall the way the information about the packed prescriptions is provided. The start date is the first date in each packed period when the patient consumes a drug, and the end date is the date following the last consumption date. For example, patient 1 consumed 10 daily doses of drug 2 starting on January 1, 2014, ending on January 11, 2014. This is what's called a left-closed, right-open interval in mathematics, because it's inclusive of the start but exclusive of the end. The standard notation used for such an interval is [startdate, enddate).
This Month's Challenge
This month's puzzle works with the PackedPrescriptions view as the input data. The task is to identify periods in which the patient is at risk due to exposure to a certain minimum number of drugs (provided as input @cnt) of a certain class. For example, suppose that @cnt = 3. Figure 3 graphically depicts the input prescription periods as arrows and the desired output risk periods as red rectangles.
Figure 3: Risk Periods 3+
Figure 4 shows the desired output that your solution should produce.
patientid startdate enddate----------- ---------- ----------1 2014-01-02 2014-01-10
In a similar manner, Figure 5 and Figure 6 show the information assuming @cnt = 4.
Figure 5: Risk Periods 4+
patientid startdate enddate----------- ---------- ----------1 2014-01-02 2014-01-041 2014-01-07 2014-01-10
Next, I'll present my solution in steps. As usual, I recommend that you take a stab at the problem before looking at my solution.
The first step in the solution is to unpivot prescription period start and end dates to separate rows. This is achieved with the following query (the ORDER BY clause is used here just to ensure presentation ordering for clarity):
SELECT patientid, drugid, type, dtFROM dbo.PackedPrescriptions CROSS JOIN(VALUES(1),(-1)) AS ET(type) CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt)ORDER BY patientid, drugid, dt;
The query uses a cross join to generate two copies out of each source row from the view. One copy represents start events (marked with type 1), and the other copy represents end events (marked with type -1). Then, a CROSS APPLY operator is used to return the relevant date for the current row (start or end) as the new dt column. Figure 7 shows the output of this query in abbreviated form.
patientid drugid type dt----------- ----------- ----------- ----------1 1 1 2014-01-011 1 -1 2014-01-041 1 1 2014-01-071 1 -1 2014-01-101 2 1 2014-01-011 2 -1 2014-01-11...
The second step is to compute the count of drugs consumed concurrently per patient and date. This is achieved with the following query:
SELECT patientid, dt, SUM(type) AS datecnt, SUM(SUM(type)) OVER(PARTITION BY patientid ORDER BY dt ROWS UNBOUNDED PRECEDING) AS cntFROM dbo.PackedPrescriptions CROSS JOIN(VALUES(1),(-1)) AS ET(type) CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt)GROUP BY patientid, dtORDER BY patientid, dt;
The starting point is the query from the previous step. The unpivoted events are grouped by patient and dt. The grouped SUM aggregate is applied to the type column (remember, 1 is a start event and -1 is an end event) to compute the daily delta. For example, if two periods started and three ended, in total there's a delta of -1 that day. Then the windowed SUM aggregate computes a running total of the grouped SUM to get the balance to date (concurrent drugs consumed that day). Figure 8 shows the output of this step.
patientid dt datecnt cnt----------- ---------- ----------- -----------1 2014-01-01 2 21 2014-01-02 2 41 2014-01-04 -1 31 2014-01-07 1 41 2014-01-10 -3 11 2014-01-11 -1 01 2014-01-20 1 11 2014-01-25 -1 01 2014-02-12 1 11 2014-02-18 -1 01 2014-02-20 1 11 2014-02-21 -1 01 2014-02-23 1 11 2014-02-24 -1 01 2014-02-26 1 11 2014-02-27 -1 0
The third step's purpose is to compute two columns called isstart and isend, marking whether a date is a start or end of a period where the patient is at risk. Here's the query implementing the third step:
DECLARE @cnt AS INT = 3;WITH C1 AS( SELECT patientid, dt, SUM(type) AS datecnt, SUM(SUM(type)) OVER(PARTITION BY patientid ORDER BY dt ROWS UNBOUNDED PRECEDING) AS cnt FROM dbo.PackedPrescriptions CROSS JOIN(VALUES(1),(-1)) AS ET(type) CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt) GROUP BY patientid, dt)SELECT *-- ,(ROW_NUMBER() OVER(PARTITION BY patientid ORDER BY dt) - 1) / 2 + 1 AS grpFROM C1 CROSS APPLY(VALUES(CASE WHEN cnt >= @cnt AND cnt - datecnt < @cnt THEN 1 ELSE 0 END, CASE WHEN cnt < @cnt AND cnt - datecnt >= @cnt THEN 1 ELSE 0 END )) AS A(isstart, isend)--WHERE isstart = 1 OR isend = 1ORDER BY patientid, dt;
This query defines a CTE called C1 based on the query implementing the second step. The outer query computes isstart as 1 if both of the following conditions are true:
Cnt (count of concurrent drugs consumed that day) is greater than or equal to @cnt (the input count).
Cnt in the previous day (cnt - datecnt) was less than @cnt.
Otherwise, the code computes isstart as 0.
Similarly, the code computes isend as 1 if both of the following conditions are true:
Cnt is less than or equal to @cnt.
Cnt in the next day is greater than or equal to @cnt.
Otherwise, the code computes isend as 0. Figure 9 shows the output of the third step.
patientid dt datecnt cnt isstart isend--------- ---------- ------- --- ------- -----1 2014-01-01 2 2 0 01 2014-01-02 2 4 1 01 2014-01-04 -1 3 0 01 2014-01-07 1 4 0 01 2014-01-10 -3 1 0 11 2014-01-11 -1 0 0 01 2014-01-20 1 1 0 01 2014-01-25 -1 0 0 01 2014-02-12 1 1 0 01 2014-02-18 -1 0 0 01 2014-02-20 1 1 0 01 2014-02-21 -1 0 0 01 2014-02-23 1 1 0 01 2014-02-24 -1 0 0 01 2014-02-26 1 1 0 01 2014-02-27 -1 0 0 0
Next, step 4 is to filter only start and end events, and compute a target period identifier (grp) for each pair of consecutive start and end events. You might have noticed that the code implementing step 3 already contains the elements implementing step 4 commented out. Uncomment both the filter and the calculation of grp and run the code again, once with @cnt = 3 and again with @cnt = 4. Figure 10 shows the result of this code with @cnt = 3.
patientid dt datecnt cnt isstart isend grp--------- ---------- ------- --- ------- ----- ---1 2014-01-02 2 4 1 0 11 2014-01-10 -3 1 0 1 1
Figure 11 shows the result of this code with @cnt = 4.
patientid dt datecnt cnt isstart isend grp--------- ---------- ------- ----------- ------- ----- ---1 2014-01-02 2 4 1 0 11 2014-01-04 -1 3 0 1 11 2014-01-07 1 4 1 0 21 2014-01-10 -3 1 0 1 2
Step 5 completes the solution by pivoting the filtered events to return each period where the patient is at risk in one row. To achieve this you define a CTE called C2 based on the outer query from step 4. Then in the outer query of this step you group the rows from C2 by patientid and grp, and return MIN(dt) as the start date and MAX(dt) as the end date. Listing 4 contains the complete solution code.
DECLARE @cnt AS INT = 3;WITH C1 AS( SELECT patientid, dt, SUM(type) AS datecnt, SUM(SUM(type)) OVER(PARTITION BY patientid ORDER BY dt ROWS UNBOUNDED PRECEDING) AS cnt FROM dbo.PackedPrescriptions CROSS JOIN(VALUES(1),(-1)) AS ET(type) CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt) GROUP BY patientid, dt),C2 AS( SELECT *, (ROW_NUMBER() OVER(PARTITION BY patientid ORDER BY dt) - 1) / 2 + 1 AS grp FROM C1 CROSS APPLY(VALUES(CASE WHEN cnt >= @cnt AND cnt - datecnt < @cnt THEN 1 ELSE 0 END, CASE WHEN cnt < @cnt AND cnt - datecnt >= @cnt THEN 1 ELSE 0 END )) AS A(isstart, isend) WHERE isstart = 1 OR isend = 1)SELECT patientid, MIN(dt) AS startdate, MAX(dt) AS enddateFROM C2GROUP BY patientid, grp;GO
Complete Solution
The complete solution ran for 9 seconds on my system against the large set of sample data. This is pretty good performance, considering the complexity of the task. About 3 seconds can be attributed to the first part of solution (which I covered in "T-SQL Interval Graphs Challenge, Part 1") handling the packing logic. The remaining 6 seconds can be attributed to the second part of the solution (which I covered in this article) identifying the periods at risk. Of course, there's always room for improvement. Perhaps you managed to come up with an even faster solution!
About the Author
You May Also Like