T-SQL Challenge – Efficient Partitioned TOP
Itzik provides a T-SQL challenge where you need to write a query that returns a top row per partition.
September 28, 2009
This challenge involves writing a query against a table called T1 which you create and populate by running the following code:
SET NOCOUNT ON;
USE tempdb;
IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1;
GO
CREATE TABLE dbo.T1
(
id INT NOT NULL IDENTITY PRIMARY KEY,
grp VARCHAR(10) NOT NULL,
col1 INT NOT NULL,
col2 INT NOT NULL
);
GO
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, -2147483648);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, -1);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, 0);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, 1);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, 2147483647);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 0, 0);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 0, 1);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 0, 2147483647);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10, -2147483648);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10, -1);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10, 0);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 100);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 0);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 200);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 30);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 50);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -100);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, 0);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -200);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -30);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -50);
GO
Your task is to write a query that returns for each unique grp value one row, with the highest col1 value, and lowest col2 value among the rows with the highest col1 value. The desired result for the given sample data is:
grp col1 col2
---------- ----------- -----------
A 10 -2147483648
B 42 0
The TOP option doesn’t support an OVER clause, but if it did, the solution might have looked like this:
SELECT TOP (1) OVER(PARTITION BY grp ORDER BY col1 DESC, col2)
grp, col1, col2
FROM dbo.T1;
The above code of course doesn’t work in SQL Server (unfortunately), and is given here as conceptual pseudo code. For now, you can apply similar logic by using the ROW_NUMBER function like so:
WITH C AS
(
SELECT grp, col1, col2,
ROW_NUMBER() OVER(PARTITION BY grp ORDER BY col1 DESC, col2) AS rownum
FROM dbo.T1
)
SELECT grp, col1, col2
FROM C
WHERE rownum = 1;
This solution produces the desired result; however, it doesn’t perform well when T1 has a large number of rows and there’s no index on (grp, col1 DESC, col2). In such a case, the execution plan for this solution involves sorting which is very expensive. You can see for yourself by running this solution against a large set of rows. You can use the following code to populate T1 with 10,000,000 rows:
-- Large set of sample data
-- DDL for GetNums Function
IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL
DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)
SELECT n FROM Nums WHERE n <= @n;
GO
-- Populate T1 with 10,000,000 rows in 10,000 groups
TRUNCATE TABLE dbo.T1;
INSERT INTO dbo.T1 WITH(TABLOCK) (grp, col1, col2)
SELECT
'Group' + CAST(ABS(CHECKSUM(NEWID()))%10000 + 1 AS VARCHAR(10)) AS grp,
CHECKSUM(NEWID()) % 100 AS col1,
CHECKSUM(NEWID()) AS col2
FROM dbo.GetNums(10000000) AS Nums;
Remember that in reality you won’t always be able to create the ideal indexes to support every possible request in your system, and sometimes you will choose not to create new indexes taking into consideration the negative impact those will have on modifications.
So, the challenge is: can you think of a better performing solution than the one I presented earlier based on the ROW_NUMBER function, with the given large set of sample data, and assuming you cannot create any new indexes?
Note that you are not guaranteed that the values in col1 and col2 are positive; these columns can contain negative, zero, and positive values.
Please post your solution as a comment to this entry plus e-mail it to me ([email protected]).
I’ll post an entry with the solutions next week.
Cheers and good luck!
BG
About the Author
You May Also Like