Solution to Catch That Bug! Random Row per Group

Itzik presnts solutions to last week's challenge Catch That Bug! Random Row per Group.

Itzik Ben-Gan

October 21, 2006

3 Min Read
ITPro Today logo in a gray background | ITPro Today

Last week I presented the following puzzle:
Catch That Bug! Random Row per Group
(http://www.sqlmag.com/Article/ArticleID/93895/93895.html).
As some of you figured, the reason that the query returns varying numbers of
rows in different invocations is because the subquery is reevaluated once per
order (can be observed in the execution plan), while you want the subquery to
reevaluate once per employee. Every time the subquery is invoked a new
random OrderID is produced, and it may or may not be equal to the OrderID
in the outer row. In some cases no order is found for a given employee, and
in some cases more than one order is found. In short, you need to come up
with a solution where either the execution plan evaluates a random order only
once per employee, or one where random sort values are assigned only once
(in one shot) for the set of Orders.
As an aside, the expression CHECKSUM(NEWID()) generates values with
better random distribution than NEWID() does, but that’s a different story.

SQL Server 2005 has language elements that allow elegant and efficient
solutions that overcome the problem including the APPLY table operator and
the ROW_NUMBER function. But I’ll start with solutions that work in SQL
Server 2000 first.

The following query produces a random OrderID per employee:

SELECT (SELECT TOP 1 OrderID        FROM dbo.Orders AS O        WHERE O.EmployeeID = E.EmployeeID        ORDER BY CHECKSUM(NEWID())) AS RandomOrderFROM dbo.Employees AS E;


Now filter the orders from the Orders table where the OrderID is IN the set
of OrderIDs returned by the previous query:

SELECT EmployeeID, OrderID, OrderDate, CustomerIDFROM dbo.OrdersWHERE OrderID IN    (SELECT (SELECT TOP 1 OrderID             FROM dbo.Orders AS O             WHERE O.EmployeeID = E.EmployeeID             ORDER BY CHECKSUM(NEWID())) AS RandomOrder     FROM dbo.Employees AS E)ORDER BY EmployeeID;


You get a plan where the innermost subquery (returning a random OrderID
for the outer employee) is evaluated only once per employee. You might not
feel comfortable with such a solution, since its correctness relies on the
execution plan that the optimizer chooses. The execution plan might be
different with different data distribution, indexes, etc. To be on the safe side,
you can first calculate a random sort value for each order (in one shot) and
materialize the result set in a temporary table:

SELECT EmployeeID, OrderID, CHECKSUM(NEWID()) AS RndINTO #OrdersRndFROM dbo.Orders;


Then you can safely use the logic that I presented with the original query:

SELECT EmployeeID, OrderID, OrderDate, CustomerIDFROM dbo.Orders AS O1WHERE OrderID IN    (SELECT TOP 1 OrderID     FROM #OrdersRnd AS O2     WHERE O2.EmployeeID = O1.EmployeeID     ORDER BY Rnd)ORDER BY EmployeeID;


Not an efficient solution, but a safe one.

As for SQL Server 2005, things are simpler. You can use the APPLY
operator to return a random order per employee like so:

SELECT A.*FROM dbo.Employees AS E  CROSS APPLY    (SELECT TOP (1) EmployeeID, OrderID, OrderDate, CustomerID     FROM dbo.Orders AS O     WHERE O.EmployeeID = E.EmployeeID     ORDER BY CHECKSUM(NEWID())) AS AORDER BY EmployeeID;


The APPLY operator pretty much lends itself to being evaluated only once
per outer row.

Another solution is to calculate row numbers partitioned by employee, sorted
by randomly generated values (CHECKSUM(NEWID()))) . Encapsulate the
query that generates random values in a CTE, and have the outer query filter
only rows where the row number is equal to 1:

WITH Orders_RN AS(  SELECT EmployeeID, OrderID, OrderDate, CustomerID,    ROW_NUMBER() OVER(PARTITION BY EmployeeID                      ORDER BY CHECKSUM(NEWID())) AS RowNum  FROM dbo.Orders)SELECT *FROM Orders_RNWHERE RowNum = 1ORDER BY EmployeeID;


Here the random values are generated only once. The last two solutions that
use APPLY and ROW_NUMBER are also more flexible than the original
solution in the sense that you can request more than one random order per
employee:

SELECT A.*FROM dbo.Employees AS E  CROSS APPLY    (SELECT TOP (3) EmployeeID, OrderID, OrderDate, CustomerID     FROM dbo.Orders AS O     WHERE O.EmployeeID = E.EmployeeID     ORDER BY CHECKSUM(NEWID())) AS AORDER BY EmployeeID;


WITH Orders_RN AS(  SELECT EmployeeID, OrderID, OrderDate, CustomerID,    ROW_NUMBER() OVER(PARTITION BY EmployeeID                      ORDER BY CHECKSUM(NEWID())) AS RowNum  FROM dbo.Orders)SELECT *FROM Orders_RNWHERE RowNum 
Cheers,
--
BG
 
Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like