Solutions to the Travel Puzzle
Itzik covers solutions to the T-SQL challenge provided in his previous post.
August 20, 2010
In my previous post I provided a T-SQL challenge called Travel Puzzle that involved calculating the total miles travelled domestically in each country by each traveler. You can find the details of the puzzle here: http://www.sqlmag.com/blogs/puzzled-by-t-sql/tabid/1023/entryid/13023/TSQL-Travel-Puzzle.aspx. Here I’m going to cover the solutions. Thanks to all those who posted solutions including Jermy, Peter Larsson (Peso), Marcello Poletti (Marc), Regan Wick, and Colleen Morrow.
The solutions use four different strategies:
1. Identifying islands (Itzik 1, Peso)
2. Using a subquery that retrieves information from the “next” row (Regan, Colleen)
3. Using a function that returns the “next” row and a recursive query (Marc)
4. Matching adjacent rows using row numbers (Jermy, Itzik 2)
Now for the details…
1. Identifying islands (Itzik 1 and Peso)
The key part of the solution is to calculate two row numbers that are partitioned by travelerid—one ordered by travelstart, and the other ordered by destination, travelstart. The difference between the two is constant and unique per traveler and destination. You can then group the rows by traveler, destination and that diff, calculate the difference between the maximum miles traveled and minimum miles traveled, and this way get the total miles traveled in each leg of the itinerary. Finally, group the result by traveler and calculate the total miles traveled in all legs.
Here’s my solution implementing this approach along with performance measures:
-- Itzik 1
-- CPU time = 9501 ms, elapsed time = 8554 ms, logical reads 4978
WITH C1 AS
(
SELECT travelerid, destination, milestraveled,
ROW_NUMBER() OVER(PARTITION BY travelerid
ORDER BY travelstart) -
ROW_NUMBER() OVER(PARTITION BY travelerid
ORDER BY destination, travelstart) AS grp
FROM dbo.Travel
),
C2 AS
(
SELECT travelerid, destination,
MAX(milestraveled) - MIN(milestraveled) AS miles
FROM C1
GROUP BY travelerid, destination, grp
)
SELECT travelerid, destination, SUM(miles) AS totalmiles
FROM C2
GROUP BY travelerid, destination;
Peso implemented a very similar idea, only in the second row number he specified the destination attribute as a partitioning (as opposed to ordering) element. The effect is the same, though. Here’s Peso’s solution:
-- Peso
-- CPU time = 10950 ms, elapsed time = 8347 ms, logical reads 4978
WITH cteTravel(TravelerID, Destination, TotalMiles)
AS
(
SELECT TravelerID,
Destination,
MAX(MilesTraveled) - MIN(MilesTraveled) AS TotalMiles
FROM (
SELECT TravelerID,
Destination,
MilesTraveled,
ROW_NUMBER() OVER (PARTITION BY TravelerID ORDER BY TravelStart) AS Yak,
ROW_NUMBER() OVER (PARTITION BY TravelerID, Destination ORDER BY TravelStart) AS Yik
FROM dbo.Travel
) AS x
GROUP BY TravelerID,
Destination,
Yak - Yik
)
SELECT TravelerID,
Destination,
SUM(TotalMiles) AS TotalMiles
FROM cteTravel
GROUP BY TravelerID,
Destination;
The benefit in these solutions is that the data is scanned only once. One of the row number calculations can benefit from an ordered scan of an index. However, there’s no avoiding of one sort operation in the plan.
2. Using a subquery that retrieves information from the “next” row (Regan, Colleen)
The solutions in this category use a subquery retrieving the interesting information (milestraveled) from the “next” row; that is, in respect to the current row, the first row with the same travelerid value, a greater travelstart value, based on travelstart ordering, provided that the destination is the same. Regan implemented this approach using APPLY and a TOP query, like so:
-- Regan
-- CPU time = 11372 ms, elapsed time = 4552 ms, logical reads 3197603
SELECT
dbo.Travel.travelerid
,dbo.Travel.destination
,SUM(nextstop.milestraveled - dbo.Travel.milestraveled) as totalmiles
FROM
dbo.Travel
CROSS APPLY
(
SELECT TOP 1
milestraveled
,destination
FROM
dbo.Travel Travel2
WHERE
Travel2.travelerid = dbo.Travel.travelerid
AND Travel2.travelstart > dbo.Travel.travelstart
ORDER BY
Travel2.travelstart
) nextstop
WHERE
dbo.Travel.destination = nextstop.destination
GROUP BY
dbo.Travel.travelerid
,dbo.Travel.destination;
As you can see, this solution is surprisingly fast despite the very high number of logical reads. The reason for this is that most of the reads are only logical in a test system where this is the only query running, and the plan for this query utilizes parallelism very efficiently. In an environment with multiple concurrent queries, a query with a high number of reads will often tend to incur a bigger performance hit than one with a low number. Nevertheless, it is surprising to see how fast it runs in an isolated environment albeit the high number of reads.
Colleen’s solution is based on similar principals only using a join, subquery and aggregate, instead of APPLY, subquery and TOP. Here’s Colleen’s solution:
-- Colleen
-- CPU time = 18798 ms, elapsed time = 7188 ms, logical reads 3202667
select travelerid, departed, sum(miles) from
(select t1.travelerid, t1.destination as departed, t2.destination as arrived, t2.milestraveled - t1.milestraveled as miles
from travel t1 join travel t2 on t1.travelerid = t2.travelerid
and t2.travelstart = (select min(t3.travelstart) from travel t3 where t3.travelstart > t1.travelstart
and t3.travelerid = t1.travelerid)
) a
where departed = arrived
group by travelerid, departed;
In this category Regan’s solution performs better.
3. Using a function that returns the “next” row and a recursive query (Marc)
The third category is based on Marc’s phantasy of having a PREVIUS operator in the language; right Marc? :)
The implementation of a currently supported alternative to this construct is to create an inline table function that returns the “next” or “previous” row using TOP logic based on travelstart ordering for the same traveler, and then use a recursive query that keeps iterating through adjacent rows. Here’s the complete solution:
-- Marc
-- CPU time = 45427 ms, elapsed time = 49588 ms, logical reads 5999999 (Worktable) + logical reads 3009928 (Travel)
create function fi_NextTravel(@travelerid int, @travelstart datetime2(0))
returns table as return
select top 1 *
from dbo.Travel
where travelerid=@travelerid and travelstart>@travelstart
order by travelstart
go
with r as(
select *, Ml=0 from dbo.Travel where milestraveled=0
union all
select n.*, case n.destination when r.destination then n.milestraveled-r.milestraveled else 0 end
from r
cross apply dbo.fi_NextTravel(r.travelerid, r.travelstart) n
)
select travelerid, destination, milestraveld=sum(Ml) from r
group by travelerid, destination
option(maxrecursion 0)
go
As you can see, this solution is very slow mainly due to the high overhead of the recursive query.
4. Matching adjacent rows using row numbers (Jermy, Itzik 2)
Solutions based on this category start by assigning row numbers partitioned by travelerid, ordered by travelstart. You then join two instances of the set with the row numbers, matching those by travelerid and a difference of 1 between the row numbers. You filter only those rows where the destination is the same in both sides. Here’s Jermy’s implementation of the solution:
-- Jermy
-- no perf info--stopped query after a couple of hours
WITH CTE_Ord AS (
SELECT ROW_NUMBER() OVER(PARTITION BY t.travelerid ORDER BY t.travelstart) AS Ord
,t.travelerid
,t.destination
,t.travelstart
,t.milestraveled
FROM dbo.Travel AS T
)
SELECT COFrom.travelerid,COFrom.destination,SUM(COTo.milestraveled - COFrom.milestraveled) milestraveled
FROM CTE_Ord AS COFrom
INNER JOIN CTE_Ord AS COTo
ON COFrom.travelerid = COTo.travelerid
AND COFrom.destination = COTo.destination
AND COFrom.Ord = COTo.Ord - 1
GROUP BY COFrom.travelerid,COFrom.destination;
You would expect this solution to perform reasonably, but when running it on my machine it just kept going and going, and eventually I stopped it after a few hours. I discussed the optimization of this query with fellow MVPs and would like to thank Peter Larsson and Rob Farley for their inputs. The plan shows a merge join algorithm used to process the join. The culprit seems to be the merge operator’s split of processing of predicates to Where (join columns) and Residual. The data is served to the merge operator after sorting ordered by destination and travelid. The Where (join) property of the operator therefore indicates filtering that is based on this order: ([tempdb].[dbo].[Travel].destination, [tempdb].[dbo].[Travel].travelerid) = ([tempdb].[dbo].[Travel].destination, [tempdb].[dbo].[Travel].travelerid). This merge join is processed as a many-to-many one, where for each row in one side, the operator matches all rows with the same travelerid and destination doing very massive rewinds. It then processes the predicate appearing in the Residual property of the operator on top doing the remaining filtering: [tempdb].[dbo].[Travel].[travelerid] as [T].[travelerid]=[tempdb].[dbo].[Travel].[travelerid] as [T].[travelerid] AND [tempdb].[dbo].[Travel].[destination] as [T].[destination]=[tempdb].[dbo].[Travel].[destination] as [T].[destination] AND [Expr1002]=([Expr1005]-(1)). The last part is the row number matching that does the majority of the filtering. The difference between the Where (join columns) and Residual predicates seems in a sense similar to the difference between a Seek predicate and Predicate in index operations. BTW, when forcing a hash join algorithm with a join hint, the query runs for about 5 seconds on my machine.
I used a similar solution to Jermy’s, but when observing the performance issue, I moved the predicate that matches the destination in both sides of the join to a CASE expression within the aggregate function, and this trick paid off big time. The optimizer still used a merge join algorithm, but did both destination and row number matching in the Where (join columns) predicate. The plan did not incur any sort operations either, but rather relied on the ordered scan of the index used to calculate the row numbers also for the merge join, and then handled the aggregate using hashing. Here’s the solution with the performance numbers:
-- Itzik 2
-- CPU time = 11515 ms, elapsed time = 3958 ms, logical reads 9956
WITH C AS
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY travelerid ORDER BY travelstart) AS rownum
FROM dbo.Travel
)
SELECT Cur.travelerid, Cur.destination,
SUM(CASE
WHEN Cur.destination = Prv.destination
THEN Cur.milestraveled - Prv.milestraveled
END) AS totalmiles
FROM C AS Cur
JOIN C AS Prv
ON Cur.travelerid = Prv.travelerid
AND Cur.rownum = Prv.rownum + 1
GROUP BY Cur.travelerid, Cur.destination;
Wishful Thinking
I’d like to point out one more solution that is not supported at the moment in SQL Server (as of 2008 R2). But if support for window functions is enhanced in a future version; specifically for the LAG and LEAD functions, you will be able to achieve this task like so:
SELECT travelerid, destination,
SUM(CASE
WHEN destination = LAG(destination) OVER(PARTITION BY travelerid ORDER BY travelstart)
THEN milestravled - LAG(milestraveled) OVER(PARTITION BY travelerid ORDER BY travelstart)
END) AS totalmiles
FROM dbo.Travel
GROUP BY travelerid, destination;
Such a solution has all the potential for good optimization, but naturally until we see it in SQL Server we can’t tell.
Cheers, and happy querying!
BG
About the Author
You May Also Like