Aggregates With an Outer Reference
Itzik talks about a restriction in T-SQL related to aggregate functions with an outer reference and provides a workaround.
April 27, 2011
Before getting started I’d like to thank Simon Sabin, Paul White and Umachandar Jayachandran (UC) for their input in a discussion we held on the topic.
Related: Partial Aggregate
T-SQL doesn’t support a certain form of aggregate expressions that include outer references. As an example, consider the following query:
USE AdventureWorks2008R2;SELECT CustomerID, SalesOrderID, SubTotal, (SELECT AVG(O2.SubTotal - O1.SubTotal) FROM Sales.SalesOrderHeader AS O2 WHERE O2.CustomerID = O1.CustomerID AND O2.SalesOrderID O1.SalesOrderID) AS AvgDiffFROM Sales.SalesOrderHeader AS O1;
This query attempts to return, for each order, the average of all differences between the current order value and the values of all other orders by the same customer. However, when you try running this query, you get the following error:
Msg 8124, Level 16, State 1, Line 4Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.
The source for the restriction is in the SQL-92 standard, and SQL Server inherited it from the Sybase codebase. The problem is that SQL Server needs to figure out which query will compute the aggregate.
As you can see, the error message doesn’t say that outer references in aggregate expressions are banned altogether, but those that involve both outer and inner references are (even though the message text isn’t very clear about that). Here’s a simple example showing an aggregate expression with an outer reference that is allowed since you’re not mixing things:
SELECT (SELECT AVG(O1.SubTotal) - AVG(O2.SubTotal) FROM Sales.SalesOrderHeader AS O2) AS AvgDiffFROM Sales.SalesOrderHeader AS O1WHERE CustomerID = 29825;
Since there are only outer references in the first aggregate expression, it’s clear to SQL Server that the entire expression (AVG(O1.SubTotal)) should be resolved against the outer query, which is logically equivalent to:
SELECT AVG(O1.SubTotal) - (SELECT AVG(O2.SubTotal) FROM Sales.SalesOrderHeader AS O2) AS AvgDiffFROM Sales.SalesOrderHeader AS O1WHERE CustomerID = 29825;
Since there are only inner references in the second aggregate expression, It’s also clear to SQL Server that the entire expression (AVG(O2.SubTotal)) should be resolved against the inner query. But when you try mixing both outer and inner references, that’s when SQL Server gets confused and complains:
SELECT (SELECT AVG(O1.SubTotal - O2.SubTotal) FROM Sales.SalesOrderHeader AS O2) AS AvgDiffFROM Sales.SalesOrderHeader AS O1WHERE CustomerID = 29825;Msg 8124, Level 16, State 1, Line 2Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.
As mentioned, the message text is a bit misleading. It sounds like if you have an outer reference, only one outer column reference is allowed in the aggregate expression. But as the following supported example shows, that’s not really the case:
SELECT (SELECT AVG(O1.SubTotal + O1.TaxAmt) - AVG(O2.SubTotal + O2.TaxAmt) FROM Sales.SalesOrderHeader AS O2) AS AvgDiffFROM Sales.SalesOrderHeader AS O1WHERE CustomerID = 29825;
In short, the mixing of inner and outer references in aggregate expressions is what SQL Server seems to have a problem with.
Now let’s get back to the unsupported query I presented initially:
SELECT CustomerID, SalesOrderID, SubTotal, (SELECT AVG(O2.SubTotal - O1.SubTotal) FROM Sales.SalesOrderHeader AS O2 WHERE O2.CustomerID = O1.CustomerID AND O2.SalesOrderID O1.SalesOrderID) AS AvgDiffFROM Sales.SalesOrderHeader AS O1;
To remind you, this query attempts to return, for each order, the average of all differences between the current order value and the values of all other orders by the same customer. You want SQL Server to resolve the O2.SubTotal reference against the inner query and the O1.SubTotal reference against the outer query and apply the AVG function to the differences, but SQL Server won’t allow you to mix those. However, SQL Server will allow mixing those in expressions that are not aggregate expressions, e.g., predicates with correlations.
A simple workaround is to add another instance of the table as an inner instance and correlate that instance to the outer table by the table’s key. This way you can refer to elements from the outer row implicitly by referring to the respective elements from the new inner instance. Then the original inner instance can be joined to the new inner instance instead of being correlated to the outer one. But implicitly, it’s as if the original inner instance was correlated to the outer one. And then you can apply your aggregate expression mixing elements from the two inner instances—original and new. I bet this sounds confusing. Hopefully the code that implements this workaround will make things clearer:
SELECT CustomerID, SalesOrderID, SubTotal, (SELECT AVG(O2.SubTotal - T.SubTotal) FROM Sales.SalesOrderHeader AS T JOIN Sales.SalesOrderHeader AS O2 ON T.SalesOrderID = O1.SalesOrderID AND O2.CustomerID = T.CustomerID AND O2.SalesOrderID T.SalesOrderID) AS AvgDiffFROM Sales.SalesOrderHeader AS O1;
This query is supported and addresses the original request.
You will face the same problem when using the APPLY operator, where the right table expression includes aggregate expressions that mix inner and outer references. Here’s an example for an unsupported query:
SELECT O1.CustomerID, O1.SalesOrderID, O1.SubTotal, A.AvgDiff, A.MinDiff, A.MaxDiffFROM Sales.SalesOrderHeader AS O1 CROSS APPLY (SELECT AVG(O2.SubTotal - O1.SubTotal) AS AvgDiff, MIN(O2.SubTotal - O1.SubTotal) AS MinDiff, MAX(O2.SubTotal - O1.SubTotal) AS MaxDiff FROM Sales.SalesOrderHeader AS O2 WHERE O2.CustomerID = O1.CustomerID AND O2.SalesOrderID O1.SalesOrderID) AS A;
The workaround is the same:
SELECT O1.CustomerID, O1.SalesOrderID, O1.SubTotal, A.AvgDiff, A.MinDiff, A.MaxDiffFROM Sales.SalesOrderHeader AS O1 CROSS APPLY (SELECT AVG(O2.SubTotal - T.SubTotal) AS AvgDiff, MIN(O2.SubTotal - T.SubTotal) AS MinDiff, MAX(O2.SubTotal - T.SubTotal) AS MaxDiff FROM Sales.SalesOrderHeader AS T JOIN Sales.SalesOrderHeader AS O2 ON T.SalesOrderID = O1.SalesOrderID AND O2.CustomerID = T.CustomerID AND O2.SalesOrderID T.SalesOrderID) AS A;
If you have other workarounds of your own you are welcome to share those.
Cheers,
BG
About the Author
You May Also Like