Logical Query Processing Part 7: GROUP BY and HAVING update from October 2016

This article continues the series about logical query processing, which describes the logical, or conceptual, interpretation of queries.

ITPro Today

October 13, 2016

23 Min Read
ITPro Today logo in a gray background | ITPro Today

This article continues the series about logical query processing, which describes the logical, or conceptual, interpretation of queries. Part 1 provided an overview of the topic and also a sample database called TSQLV4. It also provided sample queries which I’ll referred to as simple sample query and complex sample query. I’ll use the same sample database and queries in this article. Parts 2, 3, 4 and 5 covered the logical query processing aspects of the first major query clause—the FROM clause. Part 6 [BG: add link here] covered the second major query clause—the WHERE clause. In this article I continue the series by describing the logical query processing aspects of the third and fourth major query clauses—GROUP BY and HAVING, respectively.

Logical query processing flow chart including the GROUP BY and HAVING clauses

The optional GROUP BY clause is processed in the third logical query processing step, and the optional HAVING clause in the fourth. The third step, which processes the GROUP BY clause, operates on the virtual table returned by the second step, which processes the WHERE clause. The third step arranges the rows from the input table in groups based on the grouping set that you define in the GROUP BY clause. Then the fourth step filters groups based on the predicate that you specify in the HAVING clause. Only groups for which the predicate evaluates to true are returned, whereas groups for which the predicate evaluates to false or unknown are discarded. Figure 1 shows a flow chart with the logical query processing of the FROM, WHERE, GROUP BY and HAVING clauses.

Figure 1: Logical query processing flow chart – GROUP BY and HAVING

I’ll use simple sample query and complex sample query from part 1 in the series to demonstrate the inputs and outputs of the steps that process the GROUP BY and HAVING clauses. The following code includes the GROUP BY and HAVING clauses of our simple sample query:

SELECT ...FROM Sales.Customers AS C  LEFT OUTER JOIN Sales.Orders AS O    ON C.custid = O.custidWHERE C.country = N'Spain'GROUP BY C.custidHAVING COUNT( O.orderid ) <= 3;

Following is the state of the data after handling the WHERE clause (step 2), and before handling the GROUP BY clause (step 3):

C.custid    O.orderid----------- -----------8       103268       108018       1097022      NULL29      1036629      1042629      1056829      1088729      1092830      1030330      1055030      1062930      1087230      1087430      1088830      1091130      1094830      1100930      1103769      1028169      1028269      1030669      1091769      11013

This input set consists of 24 rows (showing here only the columns C.custid and O.orderid). The grouping set defined in our query is (C.custid), and since there are five distinct C.custid values in the input set, the result of the GROUP BY phase organizes the input rows in five groups, like so:

|---------|------------------|| Groups  |       Rows       ||---------|---------|--------||C.custid |C.custid |orderid ||---------|---------|--------||     |8    |10326   ||8    |8    |10801   ||     |8    |10970   ||---------|---------|--------||22       |22       |NULL    ||---------|---------|--------||     |29       |10366   ||     |29       |10426   ||29       |29       |10568   ||     |29       |10887   ||     |29       |10928   ||---------|---------|--------||     |30       |10303   ||     |30       |10550   ||     |30       |10629   ||     |30       |10872   ||30       |30       |10874   ||     |30       |10888   ||     |30       |10911   ||     |30       |10948   ||     |30       |11009   ||     |30       |11037   ||---------|---------|--------||     |69       |10281   ||     |69       |10282   ||69       |69       |10306   ||     |69       |10917   ||     |69       |11013   ||---------|---------|--------|

The HAVING phase applies the predicate COUNT( O.orderid ) <= 3 to filter groups. Only two out of the five input groups qualify, so the output of this step consists of only those two groups:

|---------|------------------|| Groups  |       Rows       ||---------|---------|--------||C.custid |C.custid |orderid ||---------|---------|--------||     |8    |10326   ||8    |8    |10801   ||     |8    |10970   ||---------|---------|--------||22       |22       |NULL    ||---------|---------|--------|

The following code has the GROUP BY and HAVING parts of our complex sample query:

SELECT ...FROM Sales.Customers AS C  LEFT OUTER JOIN  ( Sales.Orders AS O      INNER JOIN Sales.OrderDetails AS OD        ON O.orderid = OD.orderid        AND O.orderdate >= '20160101' )    ON C.custid = O.custid  CROSS APPLY ( VALUES( CONCAT(C.country, N'.' + C.region, N'.' + C.city),            OD.qty * OD.unitprice * (1 - OD.discount) )       ) AS A(custlocation, val)WHERE A.custlocation IN (N'Spain.Madrid', N'France.Paris', N'USA.WA.Seattle')GROUP BY C.custid, A.custlocationHAVING COUNT( DISTINCT O.orderid ) <= 3;

The WHERE phase returned the following result, which consists of 27 rows:

C.custid  A.custlocation  O.orderid  OD.productid  A.val--------- --------------- ---------- ------------- -------------8     Spain.Madrid    10970      52        224.000000022    Spain.Madrid    NULL       NULL      NULL57    France.Paris    NULL       NULL      NULL69    Spain.Madrid    10917      30        25.890000069    Spain.Madrid    10917      60        340.000000069    Spain.Madrid    11013      23        90.000000069    Spain.Madrid    11013      42        56.000000069    Spain.Madrid    11013      45        190.000000069    Spain.Madrid    11013      68        25.000000074    France.Paris    10907      75        108.500000074    France.Paris    10964      18        375.000000074    France.Paris    10964      38        1317.500000074    France.Paris    10964      69        360.000000074    France.Paris    11043      11        210.000000089    USA.WA.Seattle  10861      17        1638.000000089    USA.WA.Seattle  10861      18        1250.000000089    USA.WA.Seattle  10861      21        400.000000089    USA.WA.Seattle  10861      33        87.500000089    USA.WA.Seattle  10861      62        147.900000089    USA.WA.Seattle  10904      58        198.750000089    USA.WA.Seattle  10904      62        1725.500000089    USA.WA.Seattle  11032      36        665.000000089    USA.WA.Seattle  11032      38        6587.500000089    USA.WA.Seattle  11032      59        1650.000000089    USA.WA.Seattle  11066      16        52.350000089    USA.WA.Seattle  11066      19        386.400000089    USA.WA.Seattle  11066      34        490.0000000

This set is used as the input to the GROUP BY phase, which in this query groups the rows by the grouping set (C.custid, A.custlocation). The step organizes the rows from the input set in the following six groups:

|---------|----------------------------------------------------------------|| Groups  |                Rows                ||---------|---------|---------------|----------|-------------|-------------||C.custid |C.custid |A.custlocation |O.orderid |OD.productid |A.val    ||---------|---------|---------------|----------|-------------|-------------||8    |8    |Spain.Madrid   |10970     |52       |224.0000000  ||---------|---------|---------------|----------|-------------|-------------||22       |22       |Spain.Madrid   |NULL      |NULL     |NULL     ||---------|---------|---------------|----------|-------------|-------------||57       |57       |France.Paris   |NULL      |NULL     |NULL     ||---------|---------|---------------|----------|-------------|-------------||     |69       |Spain.Madrid   |10917     |30       |25.8900000   ||     |69       |Spain.Madrid   |10917     |60       |340.0000000  ||69       |69       |Spain.Madrid   |11013     |23       |90.0000000   ||     |69       |Spain.Madrid   |11013     |42       |56.0000000   ||     |69       |Spain.Madrid   |11013     |45       |190.0000000  ||     |69       |Spain.Madrid   |11013     |68       |25.0000000   ||---------|---------|---------------|----------|-------------|-------------||     |74       |France.Paris   |10907     |75       |108.5000000  ||     |74       |France.Paris   |10964     |18       |375.0000000  ||74       |74       |France.Paris   |10964     |38       |1317.5000000 ||     |74       |France.Paris   |10964     |69       |360.0000000  ||     |74       |France.Paris   |11043     |11       |210.0000000  ||---------|---------|---------------|----------|-------------|-------------||     |89       |USA.WA.Seattle |10861     |17       |1638.0000000 ||     |89       |USA.WA.Seattle |10861     |18       |1250.0000000 ||     |89       |USA.WA.Seattle |10861     |21       |400.0000000  ||     |89       |USA.WA.Seattle |10861     |33       |87.5000000   ||     |89       |USA.WA.Seattle |10861     |62       |147.9000000  ||     |89       |USA.WA.Seattle |10904     |58       |198.7500000  ||89       |89       |USA.WA.Seattle |10904     |62       |1725.5000000 ||     |89       |USA.WA.Seattle |11032     |36       |665.0000000  ||     |89       |USA.WA.Seattle |11032     |38       |6587.5000000 ||     |89       |USA.WA.Seattle |11032     |59       |1650.0000000 ||     |89       |USA.WA.Seattle |11066     |16       |52.3500000   ||     |89       |USA.WA.Seattle |11066     |19       |386.4000000  ||     |89       |USA.WA.Seattle |11066     |34       |490.0000000  ||---------|---------|---------------|----------|-------------|-------------|

The query then filters groups based on the HAVING predicate COUNT( DISTINCT O.orderid ) <= 3, returning five out of the original six groups:

|---------|----------------------------------------------------------------|| Groups  |                Rows                ||---------|---------|---------------|----------|-------------|-------------||C.custid |C.custid |A.custlocation |O.orderid |OD.productid |A.val    ||---------|---------|---------------|----------|-------------|-------------||8    |8    |Spain.Madrid   |10970     |52       |224.0000000  ||---------|---------|---------------|----------|-------------|-------------||22       |22       |Spain.Madrid   |NULL      |NULL     |NULL     ||---------|---------|---------------|----------|-------------|-------------||57       |57       |France.Paris   |NULL      |NULL     |NULL     ||---------|---------|---------------|----------|-------------|-------------||     |69       |Spain.Madrid   |10917     |30       |25.8900000   ||     |69       |Spain.Madrid   |10917     |60       |340.0000000  ||69       |69       |Spain.Madrid   |11013     |23       |90.0000000   ||     |69       |Spain.Madrid   |11013     |42       |56.0000000   ||     |69       |Spain.Madrid   |11013     |45       |190.0000000  ||     |69       |Spain.Madrid   |11013     |68       |25.0000000   ||---------|---------|---------------|----------|-------------|-------------||     |74       |France.Paris   |10907     |75       |108.5000000  ||     |74       |France.Paris   |10964     |18       |375.0000000  ||74       |74       |France.Paris   |10964     |38       |1317.5000000 ||     |74       |France.Paris   |10964     |69       |360.0000000  ||     |74       |France.Paris   |11043     |11       |210.0000000  ||---------|---------|---------------|----------|-------------|-------------|

If you’re wondering why the HAVING predicate computes a distinct count of order IDs, that’s because the query joins the Sales.Orders table with the Sales.OrderDetails table, resulting in a row per order line—not a row per order. The HAVING filter in our query is supposed to return only groups that have an order count—not an order line count—that is less than or equal to three.

Detail is discarded

The GROUP by phase organizes the rows in groups, and from that point in the logical processing of the query, your expressions don’t have direct access to the detailed state of the data anymore. This applies to all subsequent phases, which are the steps that process the HAVING (step 4), SELECT (step 5) and ORDER BY (step 6) clauses of the query. If in those clauses you want to refer to a column from the input tables, you can do so directly only if that column is part of the query’s grouping set. In order to refer to a column that is not part of the query’s grouping set, that column must be contained in a group aggregate function. For example, the following query is invalid since in the SELECT list you refer to the column orderid, and this column is neither part of your grouping set, nor is it contained in a group aggregate function:

SELECT custid, orderidFROM Sales.OrdersGROUP BY custid;

You get the following error:

Msg 8120, Level 16, State 1, Line 48Column 'Sales.Orders.orderid' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

The reasoning behind this restriction is that within a single group you can have rows with different order ID values, but the result will consist of only one row per group. However, applying an aggregate function to a column (or an expression based on a column) guarantees that there will be only one result value per group. For example, the following query computes the maximum order ID per customer, and therefore is a perfectly valid query:

SELECT custid, MAX(orderid) AS lastorderFROM Sales.OrdersGROUP BY custid;

Theoretically SQL could allow exceptions to the above rule in cases where based on constraints it could infer that certain columns can have only one distinct value per group, even if they’re not part of the grouping set. However, SQL doesn’t support such implied inference. As an example, consider the following query:

SELECT C.custid, C.companyname, MAX(O.orderid) AS lastorderFROM Sales.Customers AS C  INNER JOIN Sales.Orders AS O    ON C.custid = O.custidGROUP BY C.custid;

There’s a primary key constraint defined on the custid column in the Sales.Customers table, enforcing the uniqueness of the custid values. This means that all other columns in the table are functionally dependent on the custid column. So, in the result of the join between Sales.Customers and Sales.Orders, all rows with the same custid value will always share the same values in all of the rest of the columns. So in the above query, SQL could theoretically infer that since C.custid is part of the grouping set, there will always be precisely one C.companyname value associated with each group. However, as mentioned, SQL (and the same applies to T-SQL) doesn’t support this sort of inference, and you get the following error when trying to execute this query:

Msg 8120, Level 16, State 1, Line 56Column 'Sales.Customers.companyname' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

There are two main workarounds that SQL supports. One is to add the C.companyname column to the grouping set, like so (call this Query 1):

SELECT C.custid, C.companyname, MAX(O.orderid) AS lastorderFROM Sales.Customers AS C  INNER JOIN Sales.Orders AS O    ON C.custid = O.custidGROUP BY C.custid, C.companyname;

Another is to apply an artificial aggregate function like MIN or MAX to the C.companyname column, like so (call this Query 2):

SELECT C.custid, MAX(C.companyname) AS companyname, MAX(O.orderid) AS lastorderFROM Sales.Customers AS C  INNER JOIN Sales.Orders AS O    ON C.custid = O.custidGROUP BY C.custid;

If you’re wondering which solution is more efficient, examine the query plans shown in Figure 2:

Figure 2: Plans for Query 1 and Query 2

Observe that in the plan for Query 1, the optimizer applied a pushdown of the grouping and aggregation work right after scanning the data from the Orders table, prior to applying the join with the Customers table. It figures that since the join is an equijoin, grouping by O.custid is the same as grouping by C.custid, and then the pushdown can be applied while preserving the original query meaning. This leaves the join with less work to do compared to processing the grouping and aggregation after the join. In the plan for Query 2, the optimizer doesn’t realize that the aggregate against the C.companyname column is artificial, and handles the grouping and aggregation after the join. Personally, I find Query 1 also to be the clearer and more natural of the two. Since Query 1 is both more efficient and more natural, it’s generally my preferred choice between the two.

Aggregating data without discarding the detail

Compared to getting the raw, or detailed, state of the data, grouping gives you new insights in the form of aggregated calculations. However, at the same time grouping also discards the detail, as demonstrated earlier. For example, consider the following query:

SELECT custid, SUM(val) AS custtotalFROM Sales.OrderValuesGROUP BY custid;

The source view has a row per order, but since the query groups the data by custid, the output contains a row per customer, with the customer’s total order values:

 

custid  custtotal------- ----------1       4273.002       1402.953       7023.98...(89 row(s) affected)

But what if you don’t want to discard the detail? For example, suppose that you want to return the detailed information about the order and in addition you want to compute the percentage of the current order value out of the customer total. You need the percentage calculation to divide the detailed order value by the aggregate of all order values for the same customer. You could achieve this by writing a query that groups the data by custid and computes the customer total, then define a table expression based on that query, and use an outer query to join the table expression with the Sales.OrderValues view to match the detail with the aggregates. But this adds complexity to the code. Alternatively, you can compute the aggregate using a window function instead of a group function, like so:

SELECT custid, orderid, val,  val / SUM(val) OVER(PARTITION BY custid) AS pctcustFROM Sales.OrderValues;

Unlike grouping, which discards the detail, windowing doesn’t discard the detail. Th aggregate is computed against a window of rows that is derived from the underlying query result set, and is defined by the function’s OVER clause. Since the underlying query result is established only when logical query processing reaches the SELECT phase (after processing the FROM, WHERE, GROUP BY and HAVING phases), window functions are allowed only in the SELECT and ORDER BY clauses of the query. If you specify an empty OVER clause, the function operates on the complete underlying query result, and you would get the grand total. If you add a window partition clause, like in the above query, the function operates on a restricted partition, which in our example restricts the rows to only the ones where the custid value is the same as in the current row. In other words, our window function computes the customer total. The window function returns its result without discarding the detail, and hence you can combine detail elements from the row with the result of the window aggregate function. Our query divides the current order value by the customer total to compute the percentage: val / SUM(val) OVER(PARTITION BY custid). This query generates the following output:

custid  orderid  val      pctcust------- -------- -------- -----------------------------1       10643    814.50   0.190615492628130119354083781       10692    878.00   0.205476246197051252047741631       10702    330.00   0.077229113035338169904048671       10835    845.80   0.197940556985724315469225361       10952    471.20   0.110273812309852562602387081       11011    933.50   0.218464778843903580622513452       10926    514.40   0.366655974910011048148544132       10759    320.00   0.228090808653195053280587332       10625    479.75   0.341958017035532271285505542       10308    88.80    0.063295199401261627285362983       10365    403.20   0.057403352515240647040566743       10507    749.06   0.106643242150461704048132253       10535    1940.85  0.276317700221242087819156663       10677    813.37   0.115799019928872234829825823       10573    2082.00  0.296413144684352745879116963       10682    375.50   0.053459719418335473620369073       10856    660.00   0.09396382108149510676283246...(830 row(s) affected)

Curiously, grouping and windowing aren't mutually exclusive. You can apply window functions against grouped data. Just remember that grouping happens in step 3 in logical query processing, and windowing happens post grouping, in the SELECT (step 5) or ORDER BY (step 6) phases. The basic rule that you want to remember is that unlike group aggregate functions which you can apply to detail elements as their inputs, window functions can only be applied to elements that would have otherwise normally been allowed in the SELECT clause. For instance, suppose that you want to write a grouped query where you compute customer total order values, and in addition, the percent of the customer total out of the grand total. You might attempt to use the following query:

SELECT custid, SUM(val) AS custtotal, SUM(val) / SUM(val) OVER() AS pctFROM Sales.OrderValuesGROUP BY custid;

The reference to the val column in the group aggregate SUM functions is perfectly valid, but the reference to the val column in the window aggregate SUM function is invalid, just like such a reference would have been invalid directly in the SELECT clause. You get the following error:

Msg 8120, Level 16, State 1, Line 137Column 'Sales.OrderValues.val' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

You might try an alternative where you apply the window function to the alias custtotal:

SELECT custid, SUM(val) AS custtotal, SUM(val) / SUM(custtotal) OVER() AS pctFROM Sales.OrderValuesGROUP BY custid;

 

But this attempt also fails since such a reference would have been invalid directly in the SELECT clause due to the set-based treatment of expressions in the same logical phase. You get the following error:

Msg 207, Level 16, State 1, Line 145Invalid column name 'custtotal'.

What is valid, surprisingly (or not), is to apply the window function to the group function, like so:

SELECT custid, SUM(val) AS custtotal, SUM(val) / SUM(SUM(val)) OVER() AS pctFROM Sales.OrderValuesGROUP BY custid;

This time, the query runs successfully and you get the following output:

 

custid  custtotal  pct------- ---------- ---------1       4273.00    0.0033752       1402.95    0.0011083       7023.98    0.005549...(89 row(s) affected)

Similarly, suppose that you wanted to group the data from the Sales.OrderValues view by custid and orderdate, and in addition to computing the daily total, you want to compute the running total values from the beginning of the customer’s activity until the current date. As shown in the previous example, you can combine grouping and windowing to achieve this task, like so:

SELECT custid, orderdate, SUM(val) AS daytotal,  SUM(SUM(val)) OVER(PARTITION BY custid             ORDER BY orderdate             ROWS UNBOUNDED PRECEDING) AS runtotalFROM Sales.OrderValuesGROUP BY custid, orderdate;

This query generates the following output:

custid  orderdate  daytotal  runtotal------- ---------- --------- ---------1       2015-08-25 814.50    814.501       2015-10-03 878.00    1692.501       2015-10-13 330.00    2022.501       2016-01-15 845.80    2868.301       2016-03-16 471.20    3339.501       2016-04-09 933.50    4273.002       2014-09-18 88.80     88.802       2015-08-08 479.75    568.552       2015-11-28 320.00    888.552       2016-03-04 514.40    1402.953       2014-11-27 403.20    403.203       2015-04-15 749.06    1152.263       2015-05-13 1940.85   3093.113       2015-06-19 2082.00   5175.113       2015-09-22 813.37    5988.483       2015-09-25 375.50    6363.983       2016-01-28 660.00    7023.98...(823 row(s) affected)

 

HAVING versus WHERE

The HAVING clause serves a similar filtering purpose to the WHERE clause, where it filters data based on a predicate. The key difference between the two is that the WHERE clause filters rows before grouping, whereas the HAVING clause filters whole groups after grouping. So at the group level, the HAVING clause determines whether to keep the group or discard it based on the outcome of the predicate (if true, keep, if false or unknown, discard). As an example, suppose that you need to query the Sales.OrderValues view, and filter only orders that were placed on or after May 1st, 2016. You want to group the remaining orders by employee, and filter only groups having three orders or fewer. For qualifying groups, you want to return the employee ID and the order count. Here, the filter against the orderdate column has to be applied in the WHERE clause because it’s supposed to be treated as a row filter: WHERE orderdate >= '20160501'. Since orderdate is not part of the query’s grouping set, you can’t refer to it in the HAVING clause, unless it is contained within an aggregate function. But then it’s not like you want to filter groups where the outcome of an aggregate against the orderdate column satisfies some condition; you want to filter rows representing orders placed on or after a certain date. Conversely, after grouping, you want to apply a group filter that keeps only employee groups having three or fewer orders. Since this filter involves an aggregate calculation of the count of orders and needs to be applied at the group level, it has to be specified in the HAVING clause:  HAVING COUNT(*) <= 3. Here’s the complete solution query:

SELECT empid, COUNT(*) AS numordersFROM Sales.OrderValuesWHERE orderdate >= '20160501'GROUP BY empidHAVING COUNT(*) <= 3;

 

This query generates the following output:

 

empid       numorders----------- -----------2       24       27       28       3

If you need to apply a filter based on a column that is part of the query’s grouping set, e.g., empid > 0 in our query, you can choose whether to apply it as a row filter in the WHERE clause or as a group filter in the HAVING clause. The outcome is the same. For most people, in such a case it is more natural to apply the filter in the WHERE clause. SQL Server’s optimizer knows this and very likely will create the same plan in both cases.

GROUP BY ALL

T-SQL supports a nonstandard feature called GROUP BY ALL. It’s a pretty interesting feature, but you should be aware that the official documentation for SQL Server has a note advising to refrain from using it since Microsoft is planning to drop support for it at some point. Here I wanted to describe the logical query processing aspects of this feature and provide a recommended alternative.

In essence, the GROUP BY ALL option preserves empty groups that were filtered out by the WHERE clause. Any aggregate that you apply to those groups operates on an empty set. A COUNT(*) aggregate against such groups returns 0. As an example, consider the following query (call it Query 3):

SELECT empid, COUNT(*) AS numordersFROM Sales.OrderValuesWHERE orderdate >= '20160501'GROUP BY ALL empidHAVING COUNT(*) <= 3;

The filter in the WHERE clause eliminates completely all rows for customers 3, 5, 6 nd 9. Earlier you saw that the result of a similar query without the GROUP BY ALL option didn’t return the groups for these customers. But since the GROUP BY ALL option preserves empty groups, the output of this query includes the groups for these customers with a count of 0:

empid       numorders----------- -----------2       23       04       25       06       07       28       39       0

As mentioned, Microsoft announced their intention to drop support for this feature in the future. Here’s the alternative to the previous query that should keep working in the future (call this Query 4):

SELECT empid, COUNT(tokeep) AS numordersFROM Sales.OrderValues  CROSS APPLY ( VALUES( CASE WHEN orderdate >= '20160501' THEN 1 END ) )    AS A(tokeep)GROUP BY empidHAVING COUNT(tokeep) <= 3;

Instead of filtering the rows in the WHERE clause, the query uses a CASE expression based on the same predicate to determine whether to keep the value (tokeep column is 1) or to ignore it (tokeep column is NULL). You then apply your aggregate calculation to the tokeep column. Not only that this solution uses only supported elements that are likely to stay supported, it is also more efficient than the GROUP BY ALL solution. Figure 3 shows the query plans for Query 3 (GROUP BY ALL solution) and Query 4 (solution with CASE expression) using SQL Sentry’s plan explorer.

Figure 3: Plans for Query 3 and Query 4

 

The plan for the GROUP BY ALL solution (Query 3) repeats the scanning of the input data twice—once (the top branch of the plan) to return all rows without applying the WHERE filter along with a constant flag NULL, and a second time (bottom branch of the plan) with applying the WHERE filter along with a constant flag 0. The plan then concatenates the results, groups the rows by empid, computes the aggregates COUNT(flag), and finally applies the HAVING filter. Conversely, the plan for the solution based on the CASE expression (Query 4) scans the input data only once.

Grouping sets

Traditional group queries compute aggregates for a single grouping set (a single set of expressions that you group by). However, sometimes you need to write queries that produce aggregates for multiple grouping sets, usually for reporting purposes. For example, suppose that you want to query the Sales.Orders table, and compute daily, monthly, yearly, and grand order counts. You could achieve this by writing a separate grouped query for each grouping set, and then unify the results, like so:

WITH C AS(  SELECT orderyear, ordermonth, orderday, COUNT(*) AS numorders  FROM Sales.Orders    CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) )  AS A(orderyear, ordermonth, orderday)  GROUP BY orderyear, ordermonth, orderday  UNION ALL  SELECT orderyear, ordermonth, NULL AS orderday, COUNT(*) AS numorders  FROM Sales.Orders    CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate)) )  AS A(orderyear, ordermonth)  GROUP BY orderyear, ordermonth  UNION ALL  SELECT orderyear, NULL AS ordermonth, NULL AS orderday, COUNT(*) AS numorders  FROM Sales.Orders    CROSS APPLY ( VALUES(YEAR(orderdate)) ) AS A(orderyear)  GROUP BY orderyear  UNION ALL  SELECT NULL AS orderyear, NULL AS ordermonth, NULL AS orderday, COUNT(*) AS numorders  FROM Sales.Orders)SELECT orderyear, ordermonth, orderday, numordersFROM CORDER BY  CASE WHEN orderyear IS NOT NULL THEN 0 ELSE 1 END, orderyear,  CASE WHEN ordermonth IS NOT NULL THEN 0 ELSE 1 END, ordermonth,  CASE WHEN orderday IS NOT NULL THEN 0 ELSE 1 END, orderday;

Observe that the code uses NULLs as place holders for elements that are not part of the current grouping set but are relevant in other grouping sets. The expressions in the ORDER BY clause guarantee hierarchical presentation order (daily counts followed by month total, monthly counts followed by year total, and finally the grand total). This query generates the following output, shown here in abbreviated form:

orderyear   ordermonth  orderday    numorders----------- ----------- ----------- -----------2014    7       4       12014    7       5       12014    7       8       22014    7       9       12014    7       10      12014    7       11      12014    7       12      12014    7       15      12014    7       16      12014    7       17      12014    7       18      12014    7       19      22014    7       22      12014    7       23      12014    7       24      12014    7       25      12014    7       26      12014    7       29      12014    7       30      12014    7       31      12014    7       NULL    22...2016    4       29      32016    4       30      42016    4       NULL    742016    5       1       32016    5       4       32016    5       5       42016    5       6       42016    5       NULL    142016    NULL    NULL    270NULL    NULL    NULL    830(507 row(s) affected)

SQL (and so does T-SQL) Supports a much more elegant solution for our need using a clause called GROUPING SETS, which you specify in the GROUP BY clause, and in which you list multiple grouping sets, like so:

SELECT orderyear, ordermonth, orderday, COUNT(*) AS numordersFROM Sales.Orders  CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) )    AS A(orderyear, ordermonth, orderday)GROUP BY GROUPING SETS(  (orderyear, ordermonth, orderday),  (orderyear, ordermonth),  (orderyear),  ())ORDER BY  GROUPING(orderyear), orderyear,  GROUPING(ordermonth), ordermonth,  GROUPING(orderday), orderday;

The GROUPING function returns a 0 for an input element if it’s part of the grouping set and 1 if it’s not (when it’s an aggregate). In our query it simplifies the expressions in the ORDER BY clause that guarantee hierarchical presentation ordering. From a logical query processing perspective, the GROUPING SETS clause allows you to define multiple grouping sets, and as a result associate each row returned from the WHERE pahse possibly with multiple groups instead of just one.

Back to the last query, when you need to compute all grouping sets that represent leading combination in a hierarchy, like our time hierarchy, instead of explicitly listing all those grouping sets in the GROUPING SETS clause, you can use an abbreviated syntax with the ROLLUP clause, like so:

SELECT orderyear, ordermonth, orderday, COUNT(*) AS numordersFROM Sales.Orders  CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) )    AS A(orderyear, ordermonth, orderday)GROUP BY ROLLUP(orderyear, ordermonth, orderday)ORDER BY  GROUPING(orderyear), orderyear,  GROUPING(ordermonth), ordermonth,  GROUPING(orderday), orderday;

This ROLLUP clause is equivalent to the previous query’s GROUPING SETS clause. It defines all grouping sets that represent leading combination of the input expressions.

For more information about grouping sets, see Grouping Sets Part 1 and Grouping Sets Part 2.

Conclusion

In this article I covered the logical query processing aspects of the GROUP BY and HAVING clauses. I provided the logical query processing flow chart with the FROM, WHERE, GROUP BY and HAVING clauses. I explained that grouping discards the detail and provided an alternative in the form of windowing, which doesn’t discard the detail. I explained the difference between the WHERE and HAVING predicates. I also covered the proprietary GROUP BY ALL option and provided an alternative that is more efficient and that is likely to be supported in the future, unlike GROUP BY ALL. Finally, I showed how to define multiple grouping sets in a single query using the GROUPSING SETS and ROLLUP clauses. Next month I’ll continue the discussion about logical query processing, covering the SELECT and ORDER BY clauses.

 

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like