Logical Query Processing Part 7: GROUP BY and HAVING update from October 2016
This article continues the series about logical query processing, which describes the logical, or conceptual, interpretation of queries.
October 13, 2016
This article continues the series about logical query processing, which describes the logical, or conceptual, interpretation of queries. Part 1 provided an overview of the topic and also a sample database called TSQLV4. It also provided sample queries which I’ll referred to as simple sample query and complex sample query. I’ll use the same sample database and queries in this article. Parts 2, 3, 4 and 5 covered the logical query processing aspects of the first major query clause—the FROM clause. Part 6 [BG: add link here] covered the second major query clause—the WHERE clause. In this article I continue the series by describing the logical query processing aspects of the third and fourth major query clauses—GROUP BY and HAVING, respectively.
Logical query processing flow chart including the GROUP BY and HAVING clauses
The optional GROUP BY clause is processed in the third logical query processing step, and the optional HAVING clause in the fourth. The third step, which processes the GROUP BY clause, operates on the virtual table returned by the second step, which processes the WHERE clause. The third step arranges the rows from the input table in groups based on the grouping set that you define in the GROUP BY clause. Then the fourth step filters groups based on the predicate that you specify in the HAVING clause. Only groups for which the predicate evaluates to true are returned, whereas groups for which the predicate evaluates to false or unknown are discarded. Figure 1 shows a flow chart with the logical query processing of the FROM, WHERE, GROUP BY and HAVING clauses.
Figure 1: Logical query processing flow chart – GROUP BY and HAVING
I’ll use simple sample query and complex sample query from part 1 in the series to demonstrate the inputs and outputs of the steps that process the GROUP BY and HAVING clauses. The following code includes the GROUP BY and HAVING clauses of our simple sample query:
SELECT ...FROM Sales.Customers AS C LEFT OUTER JOIN Sales.Orders AS O ON C.custid = O.custidWHERE C.country = N'Spain'GROUP BY C.custidHAVING COUNT( O.orderid ) <= 3;
Following is the state of the data after handling the WHERE clause (step 2), and before handling the GROUP BY clause (step 3):
C.custid O.orderid----------- -----------8 103268 108018 1097022 NULL29 1036629 1042629 1056829 1088729 1092830 1030330 1055030 1062930 1087230 1087430 1088830 1091130 1094830 1100930 1103769 1028169 1028269 1030669 1091769 11013
This input set consists of 24 rows (showing here only the columns C.custid and O.orderid). The grouping set defined in our query is (C.custid), and since there are five distinct C.custid values in the input set, the result of the GROUP BY phase organizes the input rows in five groups, like so:
|---------|------------------|| Groups | Rows ||---------|---------|--------||C.custid |C.custid |orderid ||---------|---------|--------|| |8 |10326 ||8 |8 |10801 || |8 |10970 ||---------|---------|--------||22 |22 |NULL ||---------|---------|--------|| |29 |10366 || |29 |10426 ||29 |29 |10568 || |29 |10887 || |29 |10928 ||---------|---------|--------|| |30 |10303 || |30 |10550 || |30 |10629 || |30 |10872 ||30 |30 |10874 || |30 |10888 || |30 |10911 || |30 |10948 || |30 |11009 || |30 |11037 ||---------|---------|--------|| |69 |10281 || |69 |10282 ||69 |69 |10306 || |69 |10917 || |69 |11013 ||---------|---------|--------|
The HAVING phase applies the predicate COUNT( O.orderid ) <= 3 to filter groups. Only two out of the five input groups qualify, so the output of this step consists of only those two groups:
|---------|------------------|| Groups | Rows ||---------|---------|--------||C.custid |C.custid |orderid ||---------|---------|--------|| |8 |10326 ||8 |8 |10801 || |8 |10970 ||---------|---------|--------||22 |22 |NULL ||---------|---------|--------|
The following code has the GROUP BY and HAVING parts of our complex sample query:
SELECT ...FROM Sales.Customers AS C LEFT OUTER JOIN ( Sales.Orders AS O INNER JOIN Sales.OrderDetails AS OD ON O.orderid = OD.orderid AND O.orderdate >= '20160101' ) ON C.custid = O.custid CROSS APPLY ( VALUES( CONCAT(C.country, N'.' + C.region, N'.' + C.city), OD.qty * OD.unitprice * (1 - OD.discount) ) ) AS A(custlocation, val)WHERE A.custlocation IN (N'Spain.Madrid', N'France.Paris', N'USA.WA.Seattle')GROUP BY C.custid, A.custlocationHAVING COUNT( DISTINCT O.orderid ) <= 3;
The WHERE phase returned the following result, which consists of 27 rows:
C.custid A.custlocation O.orderid OD.productid A.val--------- --------------- ---------- ------------- -------------8 Spain.Madrid 10970 52 224.000000022 Spain.Madrid NULL NULL NULL57 France.Paris NULL NULL NULL69 Spain.Madrid 10917 30 25.890000069 Spain.Madrid 10917 60 340.000000069 Spain.Madrid 11013 23 90.000000069 Spain.Madrid 11013 42 56.000000069 Spain.Madrid 11013 45 190.000000069 Spain.Madrid 11013 68 25.000000074 France.Paris 10907 75 108.500000074 France.Paris 10964 18 375.000000074 France.Paris 10964 38 1317.500000074 France.Paris 10964 69 360.000000074 France.Paris 11043 11 210.000000089 USA.WA.Seattle 10861 17 1638.000000089 USA.WA.Seattle 10861 18 1250.000000089 USA.WA.Seattle 10861 21 400.000000089 USA.WA.Seattle 10861 33 87.500000089 USA.WA.Seattle 10861 62 147.900000089 USA.WA.Seattle 10904 58 198.750000089 USA.WA.Seattle 10904 62 1725.500000089 USA.WA.Seattle 11032 36 665.000000089 USA.WA.Seattle 11032 38 6587.500000089 USA.WA.Seattle 11032 59 1650.000000089 USA.WA.Seattle 11066 16 52.350000089 USA.WA.Seattle 11066 19 386.400000089 USA.WA.Seattle 11066 34 490.0000000
This set is used as the input to the GROUP BY phase, which in this query groups the rows by the grouping set (C.custid, A.custlocation). The step organizes the rows from the input set in the following six groups:
|---------|----------------------------------------------------------------|| Groups | Rows ||---------|---------|---------------|----------|-------------|-------------||C.custid |C.custid |A.custlocation |O.orderid |OD.productid |A.val ||---------|---------|---------------|----------|-------------|-------------||8 |8 |Spain.Madrid |10970 |52 |224.0000000 ||---------|---------|---------------|----------|-------------|-------------||22 |22 |Spain.Madrid |NULL |NULL |NULL ||---------|---------|---------------|----------|-------------|-------------||57 |57 |France.Paris |NULL |NULL |NULL ||---------|---------|---------------|----------|-------------|-------------|| |69 |Spain.Madrid |10917 |30 |25.8900000 || |69 |Spain.Madrid |10917 |60 |340.0000000 ||69 |69 |Spain.Madrid |11013 |23 |90.0000000 || |69 |Spain.Madrid |11013 |42 |56.0000000 || |69 |Spain.Madrid |11013 |45 |190.0000000 || |69 |Spain.Madrid |11013 |68 |25.0000000 ||---------|---------|---------------|----------|-------------|-------------|| |74 |France.Paris |10907 |75 |108.5000000 || |74 |France.Paris |10964 |18 |375.0000000 ||74 |74 |France.Paris |10964 |38 |1317.5000000 || |74 |France.Paris |10964 |69 |360.0000000 || |74 |France.Paris |11043 |11 |210.0000000 ||---------|---------|---------------|----------|-------------|-------------|| |89 |USA.WA.Seattle |10861 |17 |1638.0000000 || |89 |USA.WA.Seattle |10861 |18 |1250.0000000 || |89 |USA.WA.Seattle |10861 |21 |400.0000000 || |89 |USA.WA.Seattle |10861 |33 |87.5000000 || |89 |USA.WA.Seattle |10861 |62 |147.9000000 || |89 |USA.WA.Seattle |10904 |58 |198.7500000 ||89 |89 |USA.WA.Seattle |10904 |62 |1725.5000000 || |89 |USA.WA.Seattle |11032 |36 |665.0000000 || |89 |USA.WA.Seattle |11032 |38 |6587.5000000 || |89 |USA.WA.Seattle |11032 |59 |1650.0000000 || |89 |USA.WA.Seattle |11066 |16 |52.3500000 || |89 |USA.WA.Seattle |11066 |19 |386.4000000 || |89 |USA.WA.Seattle |11066 |34 |490.0000000 ||---------|---------|---------------|----------|-------------|-------------|
The query then filters groups based on the HAVING predicate COUNT( DISTINCT O.orderid ) <= 3, returning five out of the original six groups:
|---------|----------------------------------------------------------------|| Groups | Rows ||---------|---------|---------------|----------|-------------|-------------||C.custid |C.custid |A.custlocation |O.orderid |OD.productid |A.val ||---------|---------|---------------|----------|-------------|-------------||8 |8 |Spain.Madrid |10970 |52 |224.0000000 ||---------|---------|---------------|----------|-------------|-------------||22 |22 |Spain.Madrid |NULL |NULL |NULL ||---------|---------|---------------|----------|-------------|-------------||57 |57 |France.Paris |NULL |NULL |NULL ||---------|---------|---------------|----------|-------------|-------------|| |69 |Spain.Madrid |10917 |30 |25.8900000 || |69 |Spain.Madrid |10917 |60 |340.0000000 ||69 |69 |Spain.Madrid |11013 |23 |90.0000000 || |69 |Spain.Madrid |11013 |42 |56.0000000 || |69 |Spain.Madrid |11013 |45 |190.0000000 || |69 |Spain.Madrid |11013 |68 |25.0000000 ||---------|---------|---------------|----------|-------------|-------------|| |74 |France.Paris |10907 |75 |108.5000000 || |74 |France.Paris |10964 |18 |375.0000000 ||74 |74 |France.Paris |10964 |38 |1317.5000000 || |74 |France.Paris |10964 |69 |360.0000000 || |74 |France.Paris |11043 |11 |210.0000000 ||---------|---------|---------------|----------|-------------|-------------|
If you’re wondering why the HAVING predicate computes a distinct count of order IDs, that’s because the query joins the Sales.Orders table with the Sales.OrderDetails table, resulting in a row per order line—not a row per order. The HAVING filter in our query is supposed to return only groups that have an order count—not an order line count—that is less than or equal to three.
Detail is discarded
The GROUP by phase organizes the rows in groups, and from that point in the logical processing of the query, your expressions don’t have direct access to the detailed state of the data anymore. This applies to all subsequent phases, which are the steps that process the HAVING (step 4), SELECT (step 5) and ORDER BY (step 6) clauses of the query. If in those clauses you want to refer to a column from the input tables, you can do so directly only if that column is part of the query’s grouping set. In order to refer to a column that is not part of the query’s grouping set, that column must be contained in a group aggregate function. For example, the following query is invalid since in the SELECT list you refer to the column orderid, and this column is neither part of your grouping set, nor is it contained in a group aggregate function:
SELECT custid, orderidFROM Sales.OrdersGROUP BY custid;
You get the following error:
Msg 8120, Level 16, State 1, Line 48Column 'Sales.Orders.orderid' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
The reasoning behind this restriction is that within a single group you can have rows with different order ID values, but the result will consist of only one row per group. However, applying an aggregate function to a column (or an expression based on a column) guarantees that there will be only one result value per group. For example, the following query computes the maximum order ID per customer, and therefore is a perfectly valid query:
SELECT custid, MAX(orderid) AS lastorderFROM Sales.OrdersGROUP BY custid;
Theoretically SQL could allow exceptions to the above rule in cases where based on constraints it could infer that certain columns can have only one distinct value per group, even if they’re not part of the grouping set. However, SQL doesn’t support such implied inference. As an example, consider the following query:
SELECT C.custid, C.companyname, MAX(O.orderid) AS lastorderFROM Sales.Customers AS C INNER JOIN Sales.Orders AS O ON C.custid = O.custidGROUP BY C.custid;
There’s a primary key constraint defined on the custid column in the Sales.Customers table, enforcing the uniqueness of the custid values. This means that all other columns in the table are functionally dependent on the custid column. So, in the result of the join between Sales.Customers and Sales.Orders, all rows with the same custid value will always share the same values in all of the rest of the columns. So in the above query, SQL could theoretically infer that since C.custid is part of the grouping set, there will always be precisely one C.companyname value associated with each group. However, as mentioned, SQL (and the same applies to T-SQL) doesn’t support this sort of inference, and you get the following error when trying to execute this query:
Msg 8120, Level 16, State 1, Line 56Column 'Sales.Customers.companyname' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
There are two main workarounds that SQL supports. One is to add the C.companyname column to the grouping set, like so (call this Query 1):
SELECT C.custid, C.companyname, MAX(O.orderid) AS lastorderFROM Sales.Customers AS C INNER JOIN Sales.Orders AS O ON C.custid = O.custidGROUP BY C.custid, C.companyname;
Another is to apply an artificial aggregate function like MIN or MAX to the C.companyname column, like so (call this Query 2):
SELECT C.custid, MAX(C.companyname) AS companyname, MAX(O.orderid) AS lastorderFROM Sales.Customers AS C INNER JOIN Sales.Orders AS O ON C.custid = O.custidGROUP BY C.custid;
If you’re wondering which solution is more efficient, examine the query plans shown in Figure 2:
Figure 2: Plans for Query 1 and Query 2
Observe that in the plan for Query 1, the optimizer applied a pushdown of the grouping and aggregation work right after scanning the data from the Orders table, prior to applying the join with the Customers table. It figures that since the join is an equijoin, grouping by O.custid is the same as grouping by C.custid, and then the pushdown can be applied while preserving the original query meaning. This leaves the join with less work to do compared to processing the grouping and aggregation after the join. In the plan for Query 2, the optimizer doesn’t realize that the aggregate against the C.companyname column is artificial, and handles the grouping and aggregation after the join. Personally, I find Query 1 also to be the clearer and more natural of the two. Since Query 1 is both more efficient and more natural, it’s generally my preferred choice between the two.
Aggregating data without discarding the detail
Compared to getting the raw, or detailed, state of the data, grouping gives you new insights in the form of aggregated calculations. However, at the same time grouping also discards the detail, as demonstrated earlier. For example, consider the following query:
SELECT custid, SUM(val) AS custtotalFROM Sales.OrderValuesGROUP BY custid;
The source view has a row per order, but since the query groups the data by custid, the output contains a row per customer, with the customer’s total order values:
custid custtotal------- ----------1 4273.002 1402.953 7023.98...(89 row(s) affected)
But what if you don’t want to discard the detail? For example, suppose that you want to return the detailed information about the order and in addition you want to compute the percentage of the current order value out of the customer total. You need the percentage calculation to divide the detailed order value by the aggregate of all order values for the same customer. You could achieve this by writing a query that groups the data by custid and computes the customer total, then define a table expression based on that query, and use an outer query to join the table expression with the Sales.OrderValues view to match the detail with the aggregates. But this adds complexity to the code. Alternatively, you can compute the aggregate using a window function instead of a group function, like so:
SELECT custid, orderid, val, val / SUM(val) OVER(PARTITION BY custid) AS pctcustFROM Sales.OrderValues;
Unlike grouping, which discards the detail, windowing doesn’t discard the detail. Th aggregate is computed against a window of rows that is derived from the underlying query result set, and is defined by the function’s OVER clause. Since the underlying query result is established only when logical query processing reaches the SELECT phase (after processing the FROM, WHERE, GROUP BY and HAVING phases), window functions are allowed only in the SELECT and ORDER BY clauses of the query. If you specify an empty OVER clause, the function operates on the complete underlying query result, and you would get the grand total. If you add a window partition clause, like in the above query, the function operates on a restricted partition, which in our example restricts the rows to only the ones where the custid value is the same as in the current row. In other words, our window function computes the customer total. The window function returns its result without discarding the detail, and hence you can combine detail elements from the row with the result of the window aggregate function. Our query divides the current order value by the customer total to compute the percentage: val / SUM(val) OVER(PARTITION BY custid). This query generates the following output:
custid orderid val pctcust------- -------- -------- -----------------------------1 10643 814.50 0.190615492628130119354083781 10692 878.00 0.205476246197051252047741631 10702 330.00 0.077229113035338169904048671 10835 845.80 0.197940556985724315469225361 10952 471.20 0.110273812309852562602387081 11011 933.50 0.218464778843903580622513452 10926 514.40 0.366655974910011048148544132 10759 320.00 0.228090808653195053280587332 10625 479.75 0.341958017035532271285505542 10308 88.80 0.063295199401261627285362983 10365 403.20 0.057403352515240647040566743 10507 749.06 0.106643242150461704048132253 10535 1940.85 0.276317700221242087819156663 10677 813.37 0.115799019928872234829825823 10573 2082.00 0.296413144684352745879116963 10682 375.50 0.053459719418335473620369073 10856 660.00 0.09396382108149510676283246...(830 row(s) affected)
Curiously, grouping and windowing aren't mutually exclusive. You can apply window functions against grouped data. Just remember that grouping happens in step 3 in logical query processing, and windowing happens post grouping, in the SELECT (step 5) or ORDER BY (step 6) phases. The basic rule that you want to remember is that unlike group aggregate functions which you can apply to detail elements as their inputs, window functions can only be applied to elements that would have otherwise normally been allowed in the SELECT clause. For instance, suppose that you want to write a grouped query where you compute customer total order values, and in addition, the percent of the customer total out of the grand total. You might attempt to use the following query:
SELECT custid, SUM(val) AS custtotal, SUM(val) / SUM(val) OVER() AS pctFROM Sales.OrderValuesGROUP BY custid;
The reference to the val column in the group aggregate SUM functions is perfectly valid, but the reference to the val column in the window aggregate SUM function is invalid, just like such a reference would have been invalid directly in the SELECT clause. You get the following error:
Msg 8120, Level 16, State 1, Line 137Column 'Sales.OrderValues.val' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
You might try an alternative where you apply the window function to the alias custtotal:
SELECT custid, SUM(val) AS custtotal, SUM(val) / SUM(custtotal) OVER() AS pctFROM Sales.OrderValuesGROUP BY custid;
But this attempt also fails since such a reference would have been invalid directly in the SELECT clause due to the set-based treatment of expressions in the same logical phase. You get the following error:
Msg 207, Level 16, State 1, Line 145Invalid column name 'custtotal'.
What is valid, surprisingly (or not), is to apply the window function to the group function, like so:
SELECT custid, SUM(val) AS custtotal, SUM(val) / SUM(SUM(val)) OVER() AS pctFROM Sales.OrderValuesGROUP BY custid;
This time, the query runs successfully and you get the following output:
custid custtotal pct------- ---------- ---------1 4273.00 0.0033752 1402.95 0.0011083 7023.98 0.005549...(89 row(s) affected)
Similarly, suppose that you wanted to group the data from the Sales.OrderValues view by custid and orderdate, and in addition to computing the daily total, you want to compute the running total values from the beginning of the customer’s activity until the current date. As shown in the previous example, you can combine grouping and windowing to achieve this task, like so:
SELECT custid, orderdate, SUM(val) AS daytotal, SUM(SUM(val)) OVER(PARTITION BY custid ORDER BY orderdate ROWS UNBOUNDED PRECEDING) AS runtotalFROM Sales.OrderValuesGROUP BY custid, orderdate;
This query generates the following output:
custid orderdate daytotal runtotal------- ---------- --------- ---------1 2015-08-25 814.50 814.501 2015-10-03 878.00 1692.501 2015-10-13 330.00 2022.501 2016-01-15 845.80 2868.301 2016-03-16 471.20 3339.501 2016-04-09 933.50 4273.002 2014-09-18 88.80 88.802 2015-08-08 479.75 568.552 2015-11-28 320.00 888.552 2016-03-04 514.40 1402.953 2014-11-27 403.20 403.203 2015-04-15 749.06 1152.263 2015-05-13 1940.85 3093.113 2015-06-19 2082.00 5175.113 2015-09-22 813.37 5988.483 2015-09-25 375.50 6363.983 2016-01-28 660.00 7023.98...(823 row(s) affected)
HAVING versus WHERE
The HAVING clause serves a similar filtering purpose to the WHERE clause, where it filters data based on a predicate. The key difference between the two is that the WHERE clause filters rows before grouping, whereas the HAVING clause filters whole groups after grouping. So at the group level, the HAVING clause determines whether to keep the group or discard it based on the outcome of the predicate (if true, keep, if false or unknown, discard). As an example, suppose that you need to query the Sales.OrderValues view, and filter only orders that were placed on or after May 1st, 2016. You want to group the remaining orders by employee, and filter only groups having three orders or fewer. For qualifying groups, you want to return the employee ID and the order count. Here, the filter against the orderdate column has to be applied in the WHERE clause because it’s supposed to be treated as a row filter: WHERE orderdate >= '20160501'. Since orderdate is not part of the query’s grouping set, you can’t refer to it in the HAVING clause, unless it is contained within an aggregate function. But then it’s not like you want to filter groups where the outcome of an aggregate against the orderdate column satisfies some condition; you want to filter rows representing orders placed on or after a certain date. Conversely, after grouping, you want to apply a group filter that keeps only employee groups having three or fewer orders. Since this filter involves an aggregate calculation of the count of orders and needs to be applied at the group level, it has to be specified in the HAVING clause: HAVING COUNT(*) <= 3. Here’s the complete solution query:
SELECT empid, COUNT(*) AS numordersFROM Sales.OrderValuesWHERE orderdate >= '20160501'GROUP BY empidHAVING COUNT(*) <= 3;
This query generates the following output:
empid numorders----------- -----------2 24 27 28 3
If you need to apply a filter based on a column that is part of the query’s grouping set, e.g., empid > 0 in our query, you can choose whether to apply it as a row filter in the WHERE clause or as a group filter in the HAVING clause. The outcome is the same. For most people, in such a case it is more natural to apply the filter in the WHERE clause. SQL Server’s optimizer knows this and very likely will create the same plan in both cases.
GROUP BY ALL
T-SQL supports a nonstandard feature called GROUP BY ALL. It’s a pretty interesting feature, but you should be aware that the official documentation for SQL Server has a note advising to refrain from using it since Microsoft is planning to drop support for it at some point. Here I wanted to describe the logical query processing aspects of this feature and provide a recommended alternative.
In essence, the GROUP BY ALL option preserves empty groups that were filtered out by the WHERE clause. Any aggregate that you apply to those groups operates on an empty set. A COUNT(*) aggregate against such groups returns 0. As an example, consider the following query (call it Query 3):
SELECT empid, COUNT(*) AS numordersFROM Sales.OrderValuesWHERE orderdate >= '20160501'GROUP BY ALL empidHAVING COUNT(*) <= 3;
The filter in the WHERE clause eliminates completely all rows for customers 3, 5, 6 nd 9. Earlier you saw that the result of a similar query without the GROUP BY ALL option didn’t return the groups for these customers. But since the GROUP BY ALL option preserves empty groups, the output of this query includes the groups for these customers with a count of 0:
empid numorders----------- -----------2 23 04 25 06 07 28 39 0
As mentioned, Microsoft announced their intention to drop support for this feature in the future. Here’s the alternative to the previous query that should keep working in the future (call this Query 4):
SELECT empid, COUNT(tokeep) AS numordersFROM Sales.OrderValues CROSS APPLY ( VALUES( CASE WHEN orderdate >= '20160501' THEN 1 END ) ) AS A(tokeep)GROUP BY empidHAVING COUNT(tokeep) <= 3;
Instead of filtering the rows in the WHERE clause, the query uses a CASE expression based on the same predicate to determine whether to keep the value (tokeep column is 1) or to ignore it (tokeep column is NULL). You then apply your aggregate calculation to the tokeep column. Not only that this solution uses only supported elements that are likely to stay supported, it is also more efficient than the GROUP BY ALL solution. Figure 3 shows the query plans for Query 3 (GROUP BY ALL solution) and Query 4 (solution with CASE expression) using SQL Sentry’s plan explorer.
Figure 3: Plans for Query 3 and Query 4
The plan for the GROUP BY ALL solution (Query 3) repeats the scanning of the input data twice—once (the top branch of the plan) to return all rows without applying the WHERE filter along with a constant flag NULL, and a second time (bottom branch of the plan) with applying the WHERE filter along with a constant flag 0. The plan then concatenates the results, groups the rows by empid, computes the aggregates COUNT(flag), and finally applies the HAVING filter. Conversely, the plan for the solution based on the CASE expression (Query 4) scans the input data only once.
Grouping sets
Traditional group queries compute aggregates for a single grouping set (a single set of expressions that you group by). However, sometimes you need to write queries that produce aggregates for multiple grouping sets, usually for reporting purposes. For example, suppose that you want to query the Sales.Orders table, and compute daily, monthly, yearly, and grand order counts. You could achieve this by writing a separate grouped query for each grouping set, and then unify the results, like so:
WITH C AS( SELECT orderyear, ordermonth, orderday, COUNT(*) AS numorders FROM Sales.Orders CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) ) AS A(orderyear, ordermonth, orderday) GROUP BY orderyear, ordermonth, orderday UNION ALL SELECT orderyear, ordermonth, NULL AS orderday, COUNT(*) AS numorders FROM Sales.Orders CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate)) ) AS A(orderyear, ordermonth) GROUP BY orderyear, ordermonth UNION ALL SELECT orderyear, NULL AS ordermonth, NULL AS orderday, COUNT(*) AS numorders FROM Sales.Orders CROSS APPLY ( VALUES(YEAR(orderdate)) ) AS A(orderyear) GROUP BY orderyear UNION ALL SELECT NULL AS orderyear, NULL AS ordermonth, NULL AS orderday, COUNT(*) AS numorders FROM Sales.Orders)SELECT orderyear, ordermonth, orderday, numordersFROM CORDER BY CASE WHEN orderyear IS NOT NULL THEN 0 ELSE 1 END, orderyear, CASE WHEN ordermonth IS NOT NULL THEN 0 ELSE 1 END, ordermonth, CASE WHEN orderday IS NOT NULL THEN 0 ELSE 1 END, orderday;
Observe that the code uses NULLs as place holders for elements that are not part of the current grouping set but are relevant in other grouping sets. The expressions in the ORDER BY clause guarantee hierarchical presentation order (daily counts followed by month total, monthly counts followed by year total, and finally the grand total). This query generates the following output, shown here in abbreviated form:
orderyear ordermonth orderday numorders----------- ----------- ----------- -----------2014 7 4 12014 7 5 12014 7 8 22014 7 9 12014 7 10 12014 7 11 12014 7 12 12014 7 15 12014 7 16 12014 7 17 12014 7 18 12014 7 19 22014 7 22 12014 7 23 12014 7 24 12014 7 25 12014 7 26 12014 7 29 12014 7 30 12014 7 31 12014 7 NULL 22...2016 4 29 32016 4 30 42016 4 NULL 742016 5 1 32016 5 4 32016 5 5 42016 5 6 42016 5 NULL 142016 NULL NULL 270NULL NULL NULL 830(507 row(s) affected)
SQL (and so does T-SQL) Supports a much more elegant solution for our need using a clause called GROUPING SETS, which you specify in the GROUP BY clause, and in which you list multiple grouping sets, like so:
SELECT orderyear, ordermonth, orderday, COUNT(*) AS numordersFROM Sales.Orders CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) ) AS A(orderyear, ordermonth, orderday)GROUP BY GROUPING SETS( (orderyear, ordermonth, orderday), (orderyear, ordermonth), (orderyear), ())ORDER BY GROUPING(orderyear), orderyear, GROUPING(ordermonth), ordermonth, GROUPING(orderday), orderday;
The GROUPING function returns a 0 for an input element if it’s part of the grouping set and 1 if it’s not (when it’s an aggregate). In our query it simplifies the expressions in the ORDER BY clause that guarantee hierarchical presentation ordering. From a logical query processing perspective, the GROUPING SETS clause allows you to define multiple grouping sets, and as a result associate each row returned from the WHERE pahse possibly with multiple groups instead of just one.
Back to the last query, when you need to compute all grouping sets that represent leading combination in a hierarchy, like our time hierarchy, instead of explicitly listing all those grouping sets in the GROUPING SETS clause, you can use an abbreviated syntax with the ROLLUP clause, like so:
SELECT orderyear, ordermonth, orderday, COUNT(*) AS numordersFROM Sales.Orders CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) ) AS A(orderyear, ordermonth, orderday)GROUP BY ROLLUP(orderyear, ordermonth, orderday)ORDER BY GROUPING(orderyear), orderyear, GROUPING(ordermonth), ordermonth, GROUPING(orderday), orderday;
This ROLLUP clause is equivalent to the previous query’s GROUPING SETS clause. It defines all grouping sets that represent leading combination of the input expressions.
For more information about grouping sets, see Grouping Sets Part 1 and Grouping Sets Part 2.
Conclusion
In this article I covered the logical query processing aspects of the GROUP BY and HAVING clauses. I provided the logical query processing flow chart with the FROM, WHERE, GROUP BY and HAVING clauses. I explained that grouping discards the detail and provided an alternative in the form of windowing, which doesn’t discard the detail. I explained the difference between the WHERE and HAVING predicates. I also covered the proprietary GROUP BY ALL option and provided an alternative that is more efficient and that is likely to be supported in the future, unlike GROUP BY ALL. Finally, I showed how to define multiple grouping sets in a single query using the GROUPSING SETS and ROLLUP clauses. Next month I’ll continue the discussion about logical query processing, covering the SELECT and ORDER BY clauses.
About the Author
You May Also Like