Logical Query Processing: The FROM Clause and PIVOT
This month I continue coverage of the FROM clause--specifically, the logical query processing aspects of the PIVOT table operator.
April 12, 2016
This article continues the discussion about logical query processing. In Part 1 I provided an overview of the concept. I then started to cover the FROM clause, which is the first major clause that is evaluated logically. In Part 2 I covered joins and in Part 3 I covered the APPLY table operator. This month I continue coverage of the FROM clause--specifically, the logical query processing aspects of the PIVOT table operator.
I’ll use the same sample database called TSQLV4 that I used in the previous parts. If you don’t have it installed already, you can find the source code that creates it here. Make sure you’re connected to the TSQLV4 database when you run the code samples in this article:
USE TSQLV4;
Pivoting Data
To demonstrate pivoting I’ll use a view called Sales.MyOrders, which you create by running the following code:
IF OBJECT_ID(N'Sales.MyOrders', N'V') IS NOT NULL DROP VIEW Sales.MyOrders;-- In SQL Server 2016 use: DROP VIEW IF EXISTS Sales.MyOrders;GOCREATE VIEW Sales.MyOrdersASSELECT orderid, empid, YEAR(orderdate) AS orderyear, val, qtyFROM Sales.OrderValuesWHERE custid = 5;GOSELECT *FROM Sales.MyOrders;
The query showing the contents of the view returns the output shown in Table 1.
Table 1: Contents of Sales.MyOrders view
orderid empid orderyear val qty-------- ------ ---------- -------- ----10278 8 2014 1488.80 6410280 2 2014 613.20 6210384 3 2014 2222.40 4310444 3 2015 1031.70 6310445 3 2015 174.90 2110524 1 2015 3192.65 8710572 3 2015 1501.09 8710626 1 2015 1503.60 5210654 5 2015 601.83 3810672 9 2015 3815.25 2710689 1 2015 472.50 3510733 1 2015 1459.00 6110778 3 2015 96.50 1010837 9 2016 1064.50 9210857 8 2016 2048.21 7510866 5 2016 1096.20 6710875 4 2016 709.55 6110924 3 2016 1835.70 56
Pivoting data is a common need where you rotate data from a state of rows to a state of columns. It’s often done for reporting purposes, but there are quite a few other use cases, like handling custom aggregates, solving relational division problems, and others. As an example, suppose that you want to query the MyOrders view, returning a row for each employee ID, a column for each order year, and the sum of all order values in each employee and year intersection. Table 2 shows the desired result.
Table 2: Desired result for pivoting task
empid 2014 2015 2016------ -------- -------- --------1 NULL 6627.75 NULL2 613.20 NULL NULL3 2222.40 2804.19 1835.704 NULL NULL 709.555 NULL 601.83 1096.208 1488.80 NULL 2048.219 NULL 3815.25 1064.50
Historically, people used to write grouped queries with CASE expressions to handle such pivoting tasks. Recognizing the common need for pivoting, at some point Microsoft introduced a dedicated proprietary table operator called PIVOT.
The PIVOT Operator
The perspective that Microsoft used when designing the PIVOT operator is quite interesting. Instead of needing a whole query to handle the pivoting task, you use a table operator. A table operator is just a step in the FROM clause. In the PIVOT operator’s case, it’s a step that rotates data from a state of rows into a state of columns. Like any other table operator, the PIVOT operator can operate on the result virtual table produced by preceding table operators like JOIN and APPLY. The outcome of the PIVOT operator is a virtual table by itself, and as such, can be used as the input to subsequent table operators.
Just like with the JOIN and APPLY operators, the logical query processing of the PIVOT operator can be broken down to a sequence of logical steps:
1-P1. Group (grouping element: empid)
1-P2. Spread (spreading element: orderyear IN ([2014], [2015], [2016]))
1-P3. Aggregate (aggregation element: SUM(val))
The leading 1 represents the major step number of the FROM clause, then P stands for PIVOT, and then the ordinals 1, 2 and 3 represent the minor step number within the operator. The following sections describe what happens in each step and the relevance of the associated element.
1-P1. Group
The first step in the PIVOT operator is to group the rows from the input. In our example, the grouping element is the employee ID (empid column), since we want a result row per employee. The grouping element is often referred to as the on rows element, since it defines what each result row represents. Table 3 shows the result of step 1-P1 with our example.
Table 3: Result of step 1-P1
empid orderyear val--------------------------1 2015 3192.651 2015 1503.601 2015 472.501 2015 1459.00--------------------------2 2014 613.20--------------------------3 2014 2222.403 2015 1031.703 2015 174.903 2015 1501.093 2015 96.503 2016 1835.70--------------------------4 2016 709.55--------------------------5 2015 601.835 2016 1096.20--------------------------8 2014 1488.808 2016 2048.219 2015 3815.259 2016 1064.50--------------------------
1-P2. Spread
The second step in the PIVOT operator is to spread the data. The previous grouping step defines what you want on rows; similarly, the current spreading step defines what you want on columns. Hence the spreading element is often referred to as the on cols element. You spread some data element from a single source column (in our case the val column) based on the spreading element (in our case the orderyear column) to multiple target columns (in our case [2014], [2015], [2016]). The part in the PIVOT operator’s syntax that defines the spreading is as follows:
IN (target_columns)
Note that the spreading element has to be an un-manipulated column name. If you need it to be the result of a calculation, use a table expression that defines a column alias for the expression, and use the column alias in the PIVOT operator. As mentioned, in our case the spreading element is the orderyear column from the source view.
As for the target columns, those are based on the distinct values in the spreading column in the source virtual table. Currently, the view MyOrders has three distinct order years: 2014, 2015 and 2016. These values become target column names, and the syntax of the PIVOT operator requires you to refer to them as such. If the values happen to already be regular identifiers, you don’t need to delimit them. If they are irregular identifiers, e.g., if they start with a digit, have a space, are a reserved keyword, you’ll need to delimit them with either the T-SQL specific square brackets, or the standard double quotes. Since order years start with a digit, you will need the delimiters, as in [2014], [2015], [2016].
So in our example, the part of the PIVOT operator that defines spreading is as follows:
orderyear IN ([2014], [2015], [2016])
Note that the PIVOT operator requires you to hard code the target column names. It doesn’t support a subquery that returns the distinct values. If you want a more dynamic solution that doesn’t require you to hard code the values, you will need to construct the PIVOT query as a character string after querying the distinct values from the data and concatenating them, and then use dynamic SQL to execute the result query string. I’ll show such an example later in the article.
Table 4 shows the result of step 1-P2 with our example.
Table 4: Result of step 1-P2
empid orderyear val 2014 2015 2016-----------------------------------------------------1 2015 3192.65 NULL 3192.65 NULL1 2015 1503.60 NULL 1503.60 NULL1 2015 472.50 NULL 472.50 NULL1 2015 1459.00 NULL 1459.00 NULL-----------------------------------------------------2 2014 613.20 613.20 NULL NULL-----------------------------------------------------3 2014 2222.40 2222.40 NULL NULL3 2015 1031.70 NULL 1031.70 NULL3 2015 174.90 NULL 174.90 NULL3 2015 1501.09 NULL 1501.09 NULL3 2015 96.50 NULL 96.50 NULL3 2016 1835.70 NULL NULL 1835.70-----------------------------------------------------4 2016 709.55 NULL NULL 709.55-----------------------------------------------------5 2015 601.83 NULL 601.83 NULL5 2016 1096.20 NULL NULL 1096.20-----------------------------------------------------8 2014 1488.80 1488.80 NULL NULL8 2016 2048.21 NULL NULL 2048.21-----------------------------------------------------9 2015 3815.25 NULL 3815.25 NULL9 2016 1064.50 NULL NULL 1064.50-----------------------------------------------------1-P3. Aggregate
The third step in the PIVOT operator is to aggregate some measure that you want to return in each row-column intersection. In our case the measure that we want to aggregate is the val column and the aggregate function that we want to apply to it is SUM. So we provide the PIVOT operator with the aggregation element: SUM(val). Just like with the spreading element, the aggregation element must be based on an un-manipulated column name. The result of this step is the final result of the PIVOT operator, which in our example is the result shown earlier in Table 2.
Thinking of the elements involved in the different steps (1: group, 2: spread, 3: aggregate), following is the syntax of the PIVOT operator:
SELECT *FROM table_expression PIVOT( 3 FOR 2 ) AS alias;
Curiously, observe that the design of this operator requires you to specify the aggregation and spreading elements, but not the grouping element. The grouping element is implied—it comprises of all columns from the input table that you did not specify in the aggregation and spreading elements. What this means is that you will want to work with a table expression as the input to the PIVOT operator, and within the table expression select only the columns representing the grouping, spreading and aggregation elements and no others. This way you won’t end up with unwanted columns in the grouping element. Working with a table expression also gives you the opportunity to assign aliases to expressions that compute the spreading and aggregation elements, if they need to be results of calculations.
Here’s the complete solution query for our pivoting task using the PIVOT operator:
WITH C AS( SELECT empid, orderyear, val FROM Sales.MyOrders) SELECT *FROM C PIVOT( SUM(val)FOR orderyear IN ([2014], [2015], [2016]) ) AS P;
The result of this query is the desired result shown earlier in Table 2.
Figure 1 illustrates the logical query processing steps involved in the PIVOT operator and shows where the associated elements in our example task appear in the code.
Figure 1: PIVOT operator
Figure 01 - PIVOT operator
Figure 2 shows a more complete picture of the logical query processing of all three table operators that we discussed so far in the series: JOIN, APPLY and PIVOT.
Figure 2: Logical query processing flow chart - the FROM clause
Figure 02 - Logical query processing flow chart - the FROM clause
Examine the execution plan for our solution query in Figure 3, focusing on the properties of the Stream Aggregate operator.
Figure 3: Plan for PIVOT operator
Figure 03 - Plan for PIVOT operator
Observe that the operator correctly identifies the implied grouping element as the empid column. Also observe how the operator handles the spreading and aggregation parts. For each distinct spreading value (order year in our case), it defines a value based on an aggregate function applied to the result of a CASE expression. For example, for the order year 2016, it defines the following value:
Expr1040: Scalar Operator(SUM(CASE WHEN orderyear=(2016) THEN val ELSE NULL END))
Since both columns orderyear and val are actually results of computations in the source view, the plan actually refers to expression names Expr1005 and Expr1004, respectively. I replaced those expression names with the more meaningful column names for clarity.
Explicit Grouped Query
As you’ve seen, the PIVOT operator is designed as a table operator. Another commonly used solution for pivoting is to write an explicit grouped query. You handle the grouping step with a GROUP BY clause, the spreading step with CASE expressions, and the aggregation step with an aggregate function applied to the results of the CASE expressions. Here’s how you would handle our task using this solution:
SELECT empid, SUM(CASE WHEN orderyear = 2014 THEN val END) AS [2014], SUM(CASE WHEN orderyear = 2015 THEN val END) AS [2015], SUM(CASE WHEN orderyear = 2016 THEN val END) AS [2016]FROM Sales.MyOrdersGROUP BY empid;
You realize that the PIVOT operator does pretty much the same thing behind the scenes as you’ve seen in the query plan shown earlier in Figure 3. The main advantage of the PIVOT operator is that you don’t explicitly specify the lengthy CASE expressions. If you submit the queries from the application and you have a large number of spreading values, you pass shorter query strings through the network. However, the explicit grouped query has its own advantages. One, it is standard. Two, it doesn’t require you to prepare a table expression as the input table since no element is defined implicitly. Three, all pivoting elements can be expressions with manipulated columns. Four, it allows you to compute multiple aggregates if needed without adding more scans of the data. With the PIVOT operator you are limited to only one aggregate. For example, the following query computes both the SUM and the AVG aggregates:
SELECT empid, SUM(CASE WHEN orderyear = 2014 THEN val END) AS sum2014, AVG(CASE WHEN orderyear = 2014 THEN val END) AS avg2014, SUM(CASE WHEN orderyear = 2015 THEN val END) AS sum2015, AVG(CASE WHEN orderyear = 2015 THEN val END) AS avg2015, SUM(CASE WHEN orderyear = 2016 THEN val END) AS sum2016, AVG(CASE WHEN orderyear = 2016 THEN val END) AS avg2016FROM Sales.MyOrdersGROUP BY empid;
This query generates the following output:
Table 6: Result of pivoting with multiple aggregates
empid sum2014 avg2014 sum2015 avg2015 sum2016 avg2016------ -------- ------------ -------- ------------ -------- ------------1 NULL NULL 6627.75 1656.937500 NULL NULL2 613.20 613.200000 NULL NULL NULL NULL3 2222.40 2222.400000 2804.19 701.047500 1835.70 1835.7000004 NULL NULL NULL NULL 709.55 709.5500005 NULL NULL 601.83 601.830000 1096.20 1096.2000008 1488.80 1488.800000 NULL NULL 2048.21 2048.2100009 NULL NULL 3815.25 3815.250000 1064.50 1064.500000
Dynamic PIVOT
With both the PIVOT operator and the explicit grouped query, you need to hard code the spreading values. If you need a solution that doesn’t require you to hard code those, you will need to construct the solution query string by concatenating the spreading values that you query from the data, and then execute the result string dynamically.
In our example, the spreading values are order years. The following query returns the distinct years ordered:.
SELECT N',' + QUOTENAME(orderyear)FROM Sales.MyOrdersGROUP BY orderyearORDER BY orderyear;
The code uses the QUOTENAME function to place the years in square brackets to delimit them, and adds a leading comma before each year. This query generates the following output:
--------,[2014],[2015],[2016]
To concatenate the values into one string, add FOR XML PATH('') at the end of the query, and alias the expression as [text()] to return the nodes as text nodes, like so:
SELECT N',' + QUOTENAME(orderyear) AS [text()]FROM Sales.MyOrdersGROUP BY orderyearORDER BY orderyearFOR XML PATH('');
You get the following output:
,[2014],[2015],[2016]
If you have certain special characters in your values, such as '>', the XML representation uses special tags for those, such as '>'. To return the original characters, add the TYPE directive, to the FOR XML option, place the query in parentheses to make it a scalar expression, and using the .value method return the value as NVARCHAR, like so:
SELECT (SELECT N',' + QUOTENAME(orderyear) AS [text()] FROM Sales.MyOrders GROUP BY orderyear ORDER BY orderyear FOR XML PATH(''), TYPE).value('.[1]', 'NVARCHAR(MAX)');
You will need to remove the leading comma. To achieve this use your favorite string function. I like to use the STUFF function since it allows you to delete a requested number of characters from a specified position. Here’s the code:
SELECT STUFF((SELECT N',' + QUOTENAME(orderyear) AS [text()] FROM Sales.MyOrders GROUP BY orderyear ORDER BY orderyear FOR XML PATH(''), TYPE).value('.[1]', 'NVARCHAR(MAX)'), 1, 1, '');
This code generates the following output:
[2014],[2015],[2016]
This is the string that you need to place in the IN clause of the PIVOT operator. The parts that appear before and after it are static. The following code shows how to build the complete solution query and then execute it dynamically with the sp_executesql stored procedure:
DECLARE @cols AS NVARCHAR(1000), @sql AS NVARCHAR(4000);SET @cols = STUFF((SELECT N',' + QUOTENAME(orderyear) AS [text()] FROM Sales.MyOrders GROUP BY orderyear ORDER BY orderyear FOR XML PATH(''), TYPE).value('.[1]', 'NVARCHAR(MAX)'), 1, 1, '');SET @sql = N'WITH C AS( SELECT empid, orderyear, val FROM Sales.MyOrders) SELECT *FROM C PIVOT( SUM(val)FOR orderyear IN (' + @cols + N') ) AS P;';EXEC sys.sp_executesql @stmt = @sql;
You get the output shown earlier in Table 2. With this solution, you never need to hard code the spreading values, and it will always include all existing ones in the data.
Pivoting Data
This article focused on the logical query processing of the PIVOT operator. I described the three steps involved in the operator: group, spread and aggregate, and their associated elements. I also described the classic alternative using an explicit grouped query and the advantages and disadvantages of each approach. I concluded with a solution for dynamic pivoting. Next month I’ll describe the fourth table operator: UNPIVOT.
About the Author
You May Also Like