Solution to the T-SQL Puzzle - Grouping consecutive rows with a common element
Itzik talks about the solutions to the T-SQL puzzle and announces winners.
September 21, 2006
A couple of weeks ago I posted the T-SQL challenge
“Grouping consecutive rows with a common element”
(http://www.sqlmag.com/article/articleid/93462/sql_server_blog_93462.html).
I got many interesting solutions via e-mail. We were supposed to give away
two prizes: one winner was supposed to be chosen based on the fastest
correct solution, and another winner was supposed to be chosen randomly
out of those who provided correct solutions.
The fastest solutions ran for about 4-5 seconds against the big table, but since
I got several solutions that ran that fast (Maciej Pilecki, Dieter Noeth, Michael
Yang and Sania Chan), I had to randomly choose a winner out of those. So
the winner under this category is Dieter Noeth; you will get one of my
T-SQL books, courtesy of MSPress.
As for the category of correct solutions regardless of performance, I got
correct solutions from: Steve Kass, Maciej Pilecki, Chris Gin, Dieter Noeth,
Tung H. Nguyen, Steve Dassin, Michael Yang, Craig Bennett,
Alejandro Mesa, Rajeev Lahoty and Sania Chan (apologies if I missed
anyone). The winner chosen randomly in this category is Craig Bennett; you
will get a SQL Server Magazine hat. I will send your e-mail addresses to
SQL Server Magazine’s editors, and they will get in touch with you to send
you the prizes.
Here I’ll present two of the approaches used in the fast solutions.
One approach was to calculate for each row from the Attendance table a
grouping element (call it grp) that uniquely identifies a consecutive group of
rows. The expression that calculates grp is:
ROW_NUMBER() OVER(PARTITION BY student ORDER BY dt, slot) - ROW_NUMBER() OVER(PARTITION BY student, attend ORDER BY dt, slot) AS grp
The trick here is that each set of consecutive rows with the same attendance
status gets a unique value that you can later use for grouping. As for the fact
that precedence among rows is based on the combination of columns dt, slot,
the trick is to concatenate the two into one value. This can be achieved with
different techniques. One technique is to use binary concatenation:
CAST(dt AS BINARY(8)) + CAST(slot AS BINARY(4)) AS dt_slot
Another technique is to add the slot value to dt as a time unit (e.g., second):
DATEADD(second, slot, DT) AS dt_slot
Once dt and slot are concatenated, of course you can calculate the min and
max in each group, and then break the min and max apart to the two elements.
Here’s the complete solution using binary concatenation:
WITH CTE_GrpAS( SELECT student, attend, CAST(dt AS BINARY(8)) + CAST(slot AS BINARY(4)) AS dt_slot, ROW_NUMBER() OVER(PARTITION BY student ORDER BY dt, slot) - ROW_NUMBER() OVER(PARTITION BY student, attend ORDER BY dt, slot) AS grp FROM dbo.Attendance),CTE_RAWAS( SELECT student, MIN(dt_slot) AS min_dt_slot, MAX(dt_slot) AS max_dt_slot, attend, COUNT(*) AS cnt FROM CTE_Grp GROUP BY student, attend, grp)SELECT student, CAST(SUBSTRING(min_dt_slot, 1, 8) AS DATETIME) AS from_dt, CAST(SUBSTRING(min_dt_slot, 9, 4) AS INT) AS from_slot, CAST(SUBSTRING(max_dt_slot, 1, 8) AS DATETIME) AS to_dt, CAST(SUBSTRING(max_dt_slot, 9, 4) AS INT) AS to_slot, attend, cntFROM CTE_RAW;
And here’s the solution that uses datetime based concatenation:
WITH CTE_GrpAS( SELECT student, attend, DATEADD(second, slot, DT) AS dt_slot, ROW_NUMBER() OVER(PARTITION BY student ORDER BY dt, slot) - ROW_NUMBER() OVER(PARTITION BY student, attend ORDER BY dt, slot) AS grp FROM dbo.Attendance),CTE_RAWAS( SELECT student, MIN(dt_slot) AS min_dt_slot, MAX(dt_slot) AS max_dt_slot, attend, COUNT(*) AS cnt FROM CTE_Grp GROUP BY student, attend, grp)SELECT student, DATEADD(day, DATEDIFF(day, 0, min_dt_slot), 0) AS from_dt, DATEPART(second, min_dt_slot) AS from_slot, DATEADD(day, DATEDIFF(day, 0, max_dt_slot), 0) AS to_dt, DATEPART(second, max_dt_slot) AS to_slot, attend, cntFROM CTE_RAW;
Another approach was to:
- Assign row numbers to the rows in the Attendance table based on dt, slot
ordering (partitioned by student)
- Join two instances of the previous result-set to match for each student’s
row the next row (row with the same student and a row number greater by
one)
- Filter rows where the attendance status changes
Here you need to be careful not to lose the first/last row for each student. To
get around this problem some solutions used an outer join. An interesting
technique used by Michael Yang was to add sentinel rows; one before the
first student’s row and another after the last student’s row.
- Assign row numbers to the remaining rows based on dt, slot ordering
(partitioned by student)
- Join two instances of the previous result-set based on a match in student
and an offset of one between the row numbers
- Collect the from and to elements from each row, and calculate the count as
the offset between the originally calculated row numbers
Here’s the complete solution using the sentinel rows approach:
WITH AddIn as -- sentinel rows( SELECT student, dt, 1 AS slot, -1 AS attend, CASE dt WHEN '19000101' THEN 0 ELSE cnt + 1 END AS pseudodate FROM (SELECT student, COUNT(*) AS cnt FROM dbo.Attendance GROUP BY student) AS D1, (SELECT CAST('19000101' AS DATETIME) AS dt UNION ALL SELECT CAST('99991231' AS DATETIME)) AS D2),Pseudodates AS -- assign row numbers( SELECT student, dt, slot, attend, ROW_NUMBER() OVER(PARTITION BY student ORDER BY dt, slot) AS pseudodate FROM dbo.Attendance UNION ALL SELECT student, dt, slot, attend, pseudodate FROM AddIn),-- match current/next rows-- keep only pairs where status changes-- assign new row numbersCurNxt AS( SELECT Cur.student AS student1, Cur.dt AS dt1, Cur.slot AS slot1, Cur.attend AS attend, Cur.pseudodate AS pseudo1, Nxt.dt AS dt2, Nxt.slot AS slot2, Nxt.pseudodate AS pseudo2, ROW_NUMBER() OVER(PARTITION BY Cur.student ORDER BY Cur.pseudodate) AS seq FROM Pseudodates AS Cur JOIN Pseudodates AS Nxt ON Cur.student = Nxt.student AND Cur.pseudodate = Nxt.pseudodate - 1 AND Cur.attend != Nxt.attend)-- match rows with offset of one between row numbers-- collect from/to elements-- calculate countSELECT F.student1 AS student, F.dt2 AS from_dt, F.slot2 AS from_slot, T.dt1 AS to_dt, T.slot1 AS to_slot, T.attend AS attend, T.pseudo1 - F.pseudo1 AS cntFROM CurNxt AS F JOIN CurNxt T ON F.student1 = T.student1 AND T.seq - 1 = F.seq;
Cheers,
--
BG
About the Author
You May Also Like