Solution to Catch That Bug! Random Row per Group
Itzik presnts solutions to last week's challenge Catch That Bug! Random Row per Group.
October 21, 2006
Last week I presented the following puzzle:
Catch That Bug! Random Row per Group
(http://www.sqlmag.com/Article/ArticleID/93895/93895.html).
As some of you figured, the reason that the query returns varying numbers of
rows in different invocations is because the subquery is reevaluated once per
order (can be observed in the execution plan), while you want the subquery to
reevaluate once per employee. Every time the subquery is invoked a new
random OrderID is produced, and it may or may not be equal to the OrderID
in the outer row. In some cases no order is found for a given employee, and
in some cases more than one order is found. In short, you need to come up
with a solution where either the execution plan evaluates a random order only
once per employee, or one where random sort values are assigned only once
(in one shot) for the set of Orders.
As an aside, the expression CHECKSUM(NEWID()) generates values with
better random distribution than NEWID() does, but that’s a different story.
SQL Server 2005 has language elements that allow elegant and efficient
solutions that overcome the problem including the APPLY table operator and
the ROW_NUMBER function. But I’ll start with solutions that work in SQL
Server 2000 first.
The following query produces a random OrderID per employee:
SELECT (SELECT TOP 1 OrderID FROM dbo.Orders AS O WHERE O.EmployeeID = E.EmployeeID ORDER BY CHECKSUM(NEWID())) AS RandomOrderFROM dbo.Employees AS E;
Now filter the orders from the Orders table where the OrderID is IN the set
of OrderIDs returned by the previous query:
SELECT EmployeeID, OrderID, OrderDate, CustomerIDFROM dbo.OrdersWHERE OrderID IN (SELECT (SELECT TOP 1 OrderID FROM dbo.Orders AS O WHERE O.EmployeeID = E.EmployeeID ORDER BY CHECKSUM(NEWID())) AS RandomOrder FROM dbo.Employees AS E)ORDER BY EmployeeID;
You get a plan where the innermost subquery (returning a random OrderID
for the outer employee) is evaluated only once per employee. You might not
feel comfortable with such a solution, since its correctness relies on the
execution plan that the optimizer chooses. The execution plan might be
different with different data distribution, indexes, etc. To be on the safe side,
you can first calculate a random sort value for each order (in one shot) and
materialize the result set in a temporary table:
SELECT EmployeeID, OrderID, CHECKSUM(NEWID()) AS RndINTO #OrdersRndFROM dbo.Orders;
Then you can safely use the logic that I presented with the original query:
SELECT EmployeeID, OrderID, OrderDate, CustomerIDFROM dbo.Orders AS O1WHERE OrderID IN (SELECT TOP 1 OrderID FROM #OrdersRnd AS O2 WHERE O2.EmployeeID = O1.EmployeeID ORDER BY Rnd)ORDER BY EmployeeID;
Not an efficient solution, but a safe one.
As for SQL Server 2005, things are simpler. You can use the APPLY
operator to return a random order per employee like so:
SELECT A.*FROM dbo.Employees AS E CROSS APPLY (SELECT TOP (1) EmployeeID, OrderID, OrderDate, CustomerID FROM dbo.Orders AS O WHERE O.EmployeeID = E.EmployeeID ORDER BY CHECKSUM(NEWID())) AS AORDER BY EmployeeID;
The APPLY operator pretty much lends itself to being evaluated only once
per outer row.
Another solution is to calculate row numbers partitioned by employee, sorted
by randomly generated values (CHECKSUM(NEWID()))) . Encapsulate the
query that generates random values in a CTE, and have the outer query filter
only rows where the row number is equal to 1:
WITH Orders_RN AS( SELECT EmployeeID, OrderID, OrderDate, CustomerID, ROW_NUMBER() OVER(PARTITION BY EmployeeID ORDER BY CHECKSUM(NEWID())) AS RowNum FROM dbo.Orders)SELECT *FROM Orders_RNWHERE RowNum = 1ORDER BY EmployeeID;
Here the random values are generated only once. The last two solutions that
use APPLY and ROW_NUMBER are also more flexible than the original
solution in the sense that you can request more than one random order per
employee:
SELECT A.*FROM dbo.Employees AS E CROSS APPLY (SELECT TOP (3) EmployeeID, OrderID, OrderDate, CustomerID FROM dbo.Orders AS O WHERE O.EmployeeID = E.EmployeeID ORDER BY CHECKSUM(NEWID())) AS AORDER BY EmployeeID;
WITH Orders_RN AS( SELECT EmployeeID, OrderID, OrderDate, CustomerID, ROW_NUMBER() OVER(PARTITION BY EmployeeID ORDER BY CHECKSUM(NEWID())) AS RowNum FROM dbo.Orders)SELECT *FROM Orders_RNWHERE RowNum
Cheers,
--
BG
About the Author
You May Also Like