Aggregate Over Columns

Itzik discusses solutions to a task involving calculating aggregates over a set of column values (as opposed to over a set of rows).

ITPro Today

March 17, 2010

12 Min Read
ITPro Today logo in a gray background | ITPro Today

Normally when you write queries that aggregate data you apply aggregate functions to sets of rows. This article deals with a less common scenario where you need to aggregate column values from the same row. As an example, consider the following table and sample data:

SET NOCOUNT ON;

USE tempdb;

 

IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1;

GO

 

CREATE TABLE dbo.T1

(

  keycol INT NOT NULL,

  col1   INT NULL,

  col2   INT NULL,

  col3   INT NULL,

  col4   INT NULL,

  col5   INT NULL

);

 

INSERT INTO dbo.T1(keycol, col1, col2, col3, col4, col5) VALUES

  (1, 17, NULL, 2, 5, NULL),

  (2, 31, 59, NULL, 11, 13)

 

Suppose that you need to write a query that returns, for each row, the keycol value as well as the maximum value out of col1, col2, col3, col4 and col5. You want the aggregate calculation to ignore NULLs just like aggregate functions over a set behave.

To test the performance of your solution you want to use bigger volumes of data, naturally. You can use the following code to fill the table with 2,000,000 rows:

TRUNCATE TABLE dbo.T1;

 

INSERT INTO dbo.T1 WITH (TABLOCK) (keycol, col1, col2, col3, col4, col5)

  SELECT n,

    CHECKSUM(NEWID()),

    CHECKSUM(NEWID()),

    CHECKSUM(NEWID()),

    CHECKSUM(NEWID()),

    CHECKSUM(NEWID())

  FROM dbo.GetNums(2000000) AS Nums;

I ran my performance tests on a machine with 8 logical CPUs, with the data being completely cached (measured the run time of a second run of each solution), with results discarded. To discard results, choose Query Options from the Query menu, then check the Discard results after execution option under Results | Grid, like so:



Solution Based on CTEs and CASE Expressions

One possible solution is to use CASE expressions to compare two values at a time and return the max. You can use CTEs to allow reuse of the column alias assigned to the result of the calculation between the last pair. The solution might look like this:

WITH

C1 AS

(

  SELECT keycol, col3, col4, col5,

    CASE WHEN col1 > col2 OR col2 IS NULL THEN col1 ELSE col2 END AS mx

  FROM dbo.T1

),

C2 AS

(

  SELECT keycol, col4, col5,

    CASE WHEN mx > col3 OR col3 IS NULL THEN mx ELSE col3 END AS mx

  FROM C1

),

C3 AS

(

  SELECT keycol, col5,

    CASE WHEN mx > col4 OR col4 IS NULL THEN mx ELSE col4 END AS mx

  FROM C2

),

C4 AS

(

  SELECT keycol,

    CASE WHEN mx > col5 OR col5 IS NULL THEN mx ELSE col5 END AS mx

  FROM C3

)

SELECT keycol, mx

FROM C4;

 

As you can see, the solution isn’t too elegant, and the more columns are involved in the calculation, the more convoluted it becomes. In terms of performance, this solution ran for about 10 seconds on my machine (warm cache, results discarded). Here’s the plan I got for this query:


The data is scanned once, and then the Compute Scalar computes for each row the max value by evaluating the nested CASE expressions after expansion of the CTEs. Even though the Compute Scalar operator appears to have 2% of the estimated query cost associated with it, it’s actually quite CPU intensive, and in practice involves far more than just 2%. If you look at the properties of the operator, you will find the following expression:

[Expr1006] = Scalar Operator(CASE WHEN CASE WHEN CASE WHEN CASE WHEN [tempdb].[dbo].[T1].[col1]>[tempdb].[dbo].[T1].[col2] OR [tempdb].[dbo].[T1].[col2] IS NULL THEN [tempdb].[dbo].[T1].[col1] ELSE [tempdb].[dbo].[T1].[col2] END>[tempdb].[dbo].[T1].[col3] OR [tempdb].[dbo].[T1].[col3] IS NULL THEN CASE WHEN [tempdb].[dbo].[T1].[col1]>[tempdb].[dbo].[T1].[col2] OR [tempdb].[dbo].[T1].[col2] IS NULL THEN [tempdb].[dbo].[T1].[col1] ELSE [tempdb].[dbo].[T1].[col2] END ELSE [tempdb].[dbo].[T1].[col3] END>[tempdb].[dbo].[T1].[col4] OR [tempdb].[dbo].[T1].[col4] IS NULL THEN CASE WHEN CASE WHEN [tempdb].[dbo].[T1].[col1]>[tempdb].[dbo].[T1].[col2] OR [tempdb].[dbo].[T1].[col2] IS NULL THEN [tempdb].[dbo].[T1].[col1] ELSE [tempdb].[dbo].[T1].[col2] END>[tempdb].[dbo].[T1].[col3] OR [tempdb].[dbo].[T1].[col3] IS NULL THEN CASE WHEN [tempdb].[dbo].[T1].[col1]>[tempdb].[dbo].[T1].[col2] OR [tempdb].[dbo].[T1].[col2] IS NULL THEN [tempdb].[dbo].[T1].[col1] ELSE [tempdb].[dbo].[T1].[col2] END ELSE [tempdb].[dbo].[T1].[col3] END ELSE [tempdb].[dbo].[T1].[col4] END>[tempdb].[dbo].[T1].[col5] OR [tempdb].[dbo].[T1].[col5] IS NULL THEN CASE WHEN CASE WHEN CASE WHEN [tempdb].[dbo].[T1].[col1]>[tempdb].[dbo].[T1].[col2] OR [tempdb].[dbo].[T1].[col2] IS NULL THEN [tempdb].[dbo].[T1].[col1] ELSE [tempdb].[dbo].[T1].[col2] END>[tempdb].[dbo].[T1].[col3] OR [tempdb].[dbo].[T1].[col3] IS NULL THEN CASE WHEN [tempdb].[dbo].[T1].[col1]>[tempdb].[dbo].[T1].[col2] OR [tempdb].[dbo].[T1].[col2] IS NULL THEN [tempdb].[dbo].[T1].[col1] ELSE [tempdb].[dbo].[T1].[col2] END ELSE [tempdb].[dbo].[T1].[col3] END>[tempdb].[dbo].[T1].[col4] OR [tempdb].[dbo].[T1].[col4] IS NULL THEN CASE WHEN CASE WHEN [tempdb].[dbo].[T1].[col1]>[tempdb].[dbo].[T1].[col2] OR [tempdb].[dbo].[T1].[col2] IS NULL THEN [tempdb].[dbo].[T1].[col1] ELSE [tempdb].[dbo].[T1].[col2] END>[tempdb].[dbo].[T1].[col3] OR [tempdb].[dbo].[T1].[col3] IS NULL THEN CASE WHEN [tempdb].[dbo].[T1].[col1]>[tempdb].[dbo].[T1].[col2] OR [tempdb].[dbo].[T1].[col2] IS NULL THEN [tempdb].[dbo].[T1].[col1] ELSE [tempdb].[dbo].[T1].[col2] END ELSE [tempdb].[dbo].[T1].[col3] END ELSE [tempdb].[dbo].[T1].[col4] END ELSE [tempdb].[dbo].[T1].[col5] END)

This query could benefit from parallelism, only since there’s no filtering or anything that the optimizer thinks that would be worthwhile to split to multiple threads, it generated a serial plan. One way to try and get the optimizer to do a parallel scan that would result in multiple threads processing the rows is to add a dummy filter, like so:

WITH

C1 AS

(

  SELECT keycol, col3, col4, col5,

    CASE WHEN col1 > col2 OR col2 IS NULL THEN col1 ELSE col2 END AS mx

  FROM dbo.T1

),

C2 AS

(

  SELECT keycol, col4, col5,

    CASE WHEN mx > col3 OR col3 IS NULL THEN mx ELSE col3 END AS mx

  FROM C1

),

C3 AS

(

  SELECT keycol, col5,

    CASE WHEN mx > col4 OR col4 IS NULL THEN mx ELSE col4 END AS mx

  FROM C2

),

C4 AS

(

  SELECT keycol,

    CASE WHEN mx > col5 OR col5 IS NULL THEN mx ELSE col5 END AS mx

  FROM C3

)

SELECT keycol, mx

FROM C4

WHERE keycol % 1 = 0;

 

This time the plan is parallel and the query runs for about four seconds on my machine:

 


Either way, the solution is convoluted, and in this respect not a good one.

Solution Based on Unpivoting

The next solution I’ll cover is based on unpivoting the column values into rows, and then applying the calculation as a regular GROUP BY query with an aggregate function. This approach is probably the most intuitive way for most people to handle the task. Here’s the solution code:

SELECT keycol, MAX(val) AS mx

FROM dbo.T1

  UNPIVOT(val FOR col IN(col1, col2, col3, col4, col5)) AS U

GROUP BY keycol;

 

As you can see, the solution is far more elegant than the previous one. However, if you look at the execution plan for this query you will notice that SQL Server generates 10,000,000 rows for the unpivoted form of the data (2,000,000 x 5), and then aggregates those:


As you can imagine, this work has a cost. Even though parallelism is utilized very efficiently in this plan, the query ran for about 8 seconds on my machine.

Solution Based on Table-Value Constructor (Enhanced VALUES Clause)

The last solution I’ll cover is one that is both elegant and efficient. It relies on the enhanced functionality of the VALUES clause. You use the VALUES clause to define a derived table with a row per each column value, then use a MAX aggregate function against the derived table in a subquery. Here’s the complete solution:

SELECT keycol,

  (SELECT MAX(val) FROM (VALUES(col1),(col2),(col3),(col4),(col5)) AS D(val)) AS mx

FROM dbo.T1;

 

The solution is doubtless very elegant. As for performance, here’s the plan I got for this query:


SQL Server uses a constant operator to generate a table from each set of five values per row, and calculates the aggregate against that set. Parallelism is used very efficiently as well. This solution ran for about 3 seconds on my machine, making it both elegant and efficient (compared to the alternative).

Cheers,

BG

 

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like