Logical Query Processing: The FROM Clause and PIVOTLogical Query Processing: The FROM Clause and PIVOT

This month I continue coverage of the FROM clause--specifically, the logical query processing aspects of the PIVOT table operator.

Itzik Ben-Gan

April 12, 2016

13 Min Read
The FROM Clause and PIVOT

This article continues the discussion about logical query processing. In Part 1 I provided an overview of the concept. I then started to cover the FROM clause, which is the first major clause that is evaluated logically. In Part 2 I covered joins and in Part 3 I covered the APPLY table operator. This month I continue coverage of the FROM clause--specifically, the logical query processing aspects of the PIVOT table operator.

I’ll use the same sample database called TSQLV4 that I used in the previous parts. If you don’t have it installed already, you can find the source code that creates it here. Make sure you’re connected to the TSQLV4 database when you run the code samples in this article:

USE TSQLV4;

Pivoting Data

To demonstrate pivoting I’ll use a view called Sales.MyOrders, which you create by running the following code:

IF OBJECT_ID(N'Sales.MyOrders', N'V') IS NOT NULL DROP VIEW Sales.MyOrders;-- In SQL Server 2016 use: DROP VIEW IF EXISTS Sales.MyOrders;GOCREATE VIEW Sales.MyOrdersASSELECT orderid, empid, YEAR(orderdate) AS orderyear, val, qtyFROM Sales.OrderValuesWHERE custid = 5;GOSELECT *FROM Sales.MyOrders;

 

The query showing the contents of the view returns the output shown in Table 1.

Table 1: Contents of Sales.MyOrders view

orderid  empid  orderyear  val      qty-------- ------ ---------- -------- ----10278    8      2014       1488.80  6410280    2      2014       613.20   6210384    3      2014       2222.40  4310444    3      2015       1031.70  6310445    3      2015       174.90   2110524    1      2015       3192.65  8710572    3      2015       1501.09  8710626    1      2015       1503.60  5210654    5      2015       601.83   3810672    9      2015       3815.25  2710689    1      2015       472.50   3510733    1      2015       1459.00  6110778    3      2015       96.50    1010837    9      2016       1064.50  9210857    8      2016       2048.21  7510866    5      2016       1096.20  6710875    4      2016       709.55   6110924    3      2016       1835.70  56

Pivoting data is a common need where you rotate data from a state of rows to a state of columns. It’s often done for reporting purposes, but there are quite a few other use cases, like handling custom aggregates, solving relational division problems, and others. As an example, suppose that you want to query the MyOrders view, returning a row for each employee ID, a column for each order year, and the sum of all order values in each employee and year intersection. Table 2 shows the desired result.

Table 2: Desired result for pivoting task

empid  2014     2015     2016------ -------- -------- --------1      NULL     6627.75  NULL2      613.20   NULL     NULL3      2222.40  2804.19  1835.704      NULL     NULL     709.555      NULL     601.83   1096.208      1488.80  NULL     2048.219      NULL     3815.25  1064.50

Historically, people used to write grouped queries with CASE expressions to handle such pivoting tasks. Recognizing the common need for pivoting, at some point Microsoft introduced a dedicated proprietary table operator called PIVOT.

The PIVOT Operator

The perspective that Microsoft used when designing the PIVOT operator is quite interesting. Instead of needing a whole query to handle the pivoting task, you use a table operator. A table operator is just a step in the FROM clause. In the PIVOT operator’s case, it’s a step that rotates data from a state of rows into a state of columns. Like any other table operator, the PIVOT operator can operate on the result virtual table produced by preceding table operators like JOIN and APPLY. The outcome of the PIVOT operator is a virtual table by itself, and as such, can be used as the input to subsequent table operators.

Just like with the JOIN and APPLY operators, the logical query processing of the PIVOT operator can be broken down to a sequence of logical steps:

1-P1. Group (grouping element: empid)

1-P2. Spread (spreading element: orderyear IN ([2014], [2015], [2016]))

1-P3. Aggregate (aggregation element: SUM(val))

The leading 1 represents the major step number of the FROM clause, then P stands for PIVOT, and then the ordinals 1, 2 and 3 represent the minor step number within the operator. The following sections describe what happens in each step and the relevance of the associated element.

1-P1. Group

The first step in the PIVOT operator is to group the rows from the input. In our example, the grouping element is the employee ID (empid column), since we want a result row per employee. The grouping element is often referred to as the on rows element, since it defines what each result row represents. Table 3 shows the result of step 1-P1 with our example.

Table 3: Result of step 1-P1

empid  orderyear  val--------------------------1      2015       3192.651      2015       1503.601      2015       472.501      2015       1459.00--------------------------2      2014       613.20--------------------------3      2014       2222.403      2015       1031.703      2015       174.903      2015       1501.093      2015       96.503      2016       1835.70--------------------------4      2016       709.55--------------------------5      2015       601.835      2016       1096.20--------------------------8      2014       1488.808      2016       2048.219      2015       3815.259      2016       1064.50--------------------------

1-P2. Spread

The second step in the PIVOT operator is to spread the data. The previous grouping step defines what you want on rows; similarly, the current spreading step defines what you want on columns. Hence the spreading element is often referred to as the on cols element. You spread some data element from a single source column (in our case the val column) based on the spreading element (in our case the orderyear column) to multiple target columns (in our case [2014], [2015], [2016]). The part in the PIVOT operator’s syntax that defines the spreading is as follows:

 IN (target_columns)

Note that the spreading element has to be an un-manipulated column name. If you need it to be the result of a calculation, use a table expression that defines a column alias for the expression, and use the column alias in the PIVOT operator. As mentioned, in our case the spreading element is the orderyear column from the source view.

As for the target columns, those are based on the distinct values in the spreading column in the source virtual table. Currently, the view MyOrders has three distinct order years: 2014, 2015 and 2016. These values become target column names, and the syntax of the PIVOT operator requires you to refer to them as such. If the values happen to already be regular identifiers, you don’t need to delimit them. If they are irregular identifiers, e.g., if they start with a digit, have a space, are a reserved keyword, you’ll need to delimit them with either the T-SQL specific square brackets, or the standard double quotes. Since order years start with a digit, you will need the delimiters, as in [2014], [2015], [2016].

So in our example, the part of the PIVOT operator that defines spreading is as follows:

orderyear IN ([2014], [2015], [2016])

Note that the PIVOT operator requires you to hard code the target column names. It doesn’t support a subquery that returns the distinct values. If you want a more dynamic solution that doesn’t require you to hard code the values, you will need to construct the PIVOT query as a character string after querying the distinct values from the data and concatenating them, and then use dynamic SQL to execute the result query string. I’ll show such an example later in the article.

Table 4 shows the result of step 1-P2 with our example.

Table 4: Result of step 1-P2

empid  orderyear  val      2014     2015     2016-----------------------------------------------------1      2015       3192.65  NULL     3192.65  NULL1      2015       1503.60  NULL     1503.60  NULL1      2015       472.50   NULL     472.50   NULL1      2015       1459.00  NULL     1459.00  NULL-----------------------------------------------------2      2014       613.20   613.20   NULL     NULL-----------------------------------------------------3      2014       2222.40  2222.40  NULL     NULL3      2015       1031.70  NULL     1031.70  NULL3      2015       174.90   NULL     174.90   NULL3      2015       1501.09  NULL     1501.09  NULL3      2015       96.50    NULL     96.50    NULL3      2016       1835.70  NULL     NULL     1835.70-----------------------------------------------------4      2016       709.55   NULL     NULL     709.55-----------------------------------------------------5      2015       601.83   NULL     601.83   NULL5      2016       1096.20  NULL     NULL     1096.20-----------------------------------------------------8      2014       1488.80  1488.80  NULL     NULL8      2016       2048.21  NULL     NULL     2048.21-----------------------------------------------------9      2015       3815.25  NULL     3815.25  NULL9      2016       1064.50  NULL     NULL     1064.50-----------------------------------------------------1-P3. Aggregate

The third step in the PIVOT operator is to aggregate some measure that you want to return in each row-column intersection. In our case the measure that we want to aggregate is the val column and the aggregate function that we want to apply to it is SUM. So we provide the PIVOT operator with the aggregation element: SUM(val). Just like with the spreading element, the aggregation element must be based on an un-manipulated column name. The result of this step is the final result of the PIVOT operator, which in our example is the result shown earlier in Table 2.

Thinking of the elements involved in the different steps (1: group, 2: spread, 3: aggregate), following is the syntax of the PIVOT operator:

SELECT *FROM table_expression  PIVOT( 3 FOR 2 ) AS alias;

Curiously, observe that the design of this operator requires you to specify the aggregation and spreading elements, but not the grouping element. The grouping element is implied—it comprises of all columns from the input table that you did not specify in the aggregation and spreading elements. What this means is that you will want to work with a table expression as the input to the PIVOT operator, and within the table expression select only the columns representing the grouping, spreading and aggregation elements and no others. This way you won’t end up with unwanted columns in the grouping element. Working with a table expression also gives you the opportunity to assign aliases to expressions that compute the spreading and aggregation elements, if they need to be results of calculations.

Here’s the complete solution query for our pivoting task using the PIVOT operator:

WITH C AS(  SELECT empid, orderyear, val  FROM Sales.MyOrders) SELECT *FROM C  PIVOT( SUM(val)FOR orderyear IN ([2014], [2015], [2016]) ) AS P;

The result of this query is the desired result shown earlier in Table 2.

Figure 1 illustrates the logical query processing steps involved in the PIVOT operator and shows where the associated elements in our example task appear in the code.

Figure 1: PIVOT operator

Figure 01 - PIVOT operator.jpg

Figure 01 - PIVOT operator

Figure 2 shows a more complete picture of the logical query processing of all three table operators that we discussed so far in the series: JOIN, APPLY and PIVOT.

Figure 2: Logical query processing flow chart - the FROM clause

Figure 02 - Logical query processing flow chart - the FROM clause.jpg

Figure 02 - Logical query processing flow chart - the FROM clause

Examine the execution plan for our solution query in Figure 3, focusing on the properties of the Stream Aggregate operator.

Figure 3: Plan for PIVOT operator

Figure 03 - Plan for PIVOT operator.jpg

Figure 03 - Plan for PIVOT operator

Observe that the operator correctly identifies the implied grouping element as the empid column. Also observe how the operator handles the spreading and aggregation parts. For each distinct spreading value (order year in our case), it defines a value based on an aggregate function applied to the result of a CASE expression. For example, for the order year 2016, it defines the following value:

Expr1040: Scalar Operator(SUM(CASE WHEN orderyear=(2016) THEN val ELSE NULL END))

Since both columns orderyear and val are actually results of computations in the source view, the plan actually refers to expression names Expr1005 and Expr1004, respectively. I replaced those expression names with the more meaningful column names for clarity.

Explicit Grouped Query

As you’ve seen, the PIVOT operator is designed as a table operator. Another commonly used solution for pivoting is to write an explicit grouped query. You handle the grouping step with a GROUP BY clause, the spreading step with CASE expressions, and the aggregation step with an aggregate function applied to the results of the CASE expressions. Here’s how you would handle our task using this solution:

SELECT empid,  SUM(CASE WHEN orderyear = 2014 THEN val END) AS [2014],  SUM(CASE WHEN orderyear = 2015 THEN val END) AS [2015],  SUM(CASE WHEN orderyear = 2016 THEN val END) AS [2016]FROM Sales.MyOrdersGROUP BY empid;

You realize that the PIVOT operator does pretty much the same thing behind the scenes as you’ve seen in the query plan shown earlier in Figure 3. The main advantage of the PIVOT operator is that you don’t explicitly specify the lengthy CASE expressions. If you submit the queries from the application and you have a large number of spreading values, you pass shorter query strings through the network. However, the explicit grouped query has its own advantages. One, it is standard. Two, it doesn’t require you to prepare a table expression as the input table since no element is defined implicitly. Three, all pivoting elements can be expressions with manipulated columns. Four, it allows you to compute multiple aggregates if needed without adding more scans of the data. With the PIVOT operator you are limited to only one aggregate. For example, the following query computes both the SUM and the AVG aggregates:

SELECT empid,  SUM(CASE WHEN orderyear = 2014 THEN val END) AS sum2014,  AVG(CASE WHEN orderyear = 2014 THEN val END) AS avg2014,  SUM(CASE WHEN orderyear = 2015 THEN val END) AS sum2015,  AVG(CASE WHEN orderyear = 2015 THEN val END) AS avg2015,  SUM(CASE WHEN orderyear = 2016 THEN val END) AS sum2016,  AVG(CASE WHEN orderyear = 2016 THEN val END) AS avg2016FROM Sales.MyOrdersGROUP BY empid;

This query generates the following output:

Table 6: Result of pivoting with multiple aggregates

empid  sum2014  avg2014      sum2015  avg2015      sum2016  avg2016------ -------- ------------ -------- ------------ -------- ------------1      NULL     NULL     6627.75  1656.937500  NULL     NULL2      613.20   613.200000   NULL     NULL     NULL     NULL3      2222.40  2222.400000  2804.19  701.047500   1835.70  1835.7000004      NULL     NULL     NULL     NULL     709.55   709.5500005      NULL     NULL     601.83   601.830000   1096.20  1096.2000008      1488.80  1488.800000  NULL     NULL     2048.21  2048.2100009      NULL     NULL     3815.25  3815.250000  1064.50  1064.500000

Dynamic PIVOT

With both the PIVOT operator and the explicit grouped query, you need to hard code the spreading values. If you need a solution that doesn’t require you to hard code those, you will need to construct the solution query string by concatenating the spreading values that you query from the data, and then execute the result string dynamically.

In our example, the spreading values are order years. The following query returns the distinct years ordered:.

SELECT N',' + QUOTENAME(orderyear)FROM Sales.MyOrdersGROUP BY orderyearORDER BY orderyear;

The code uses the QUOTENAME function to place the years in square brackets to delimit them, and adds a leading comma before each year. This query generates the following output:

--------,[2014],[2015],[2016]

To concatenate the values into one string, add FOR XML PATH('') at the end of the query, and alias the expression as [text()] to return the nodes as text nodes, like so:

SELECT N',' + QUOTENAME(orderyear) AS [text()]FROM Sales.MyOrdersGROUP BY orderyearORDER BY orderyearFOR XML PATH('');

You get the following output:

,[2014],[2015],[2016]

If you have certain special characters in your values, such as '>', the XML representation uses special tags for those, such as '>'. To return the original characters, add the TYPE directive, to the FOR XML option, place the query in parentheses to make it a scalar expression, and using the .value method return the value as NVARCHAR, like so:

SELECT  (SELECT N',' + QUOTENAME(orderyear) AS [text()]   FROM Sales.MyOrders   GROUP BY orderyear   ORDER BY orderyear   FOR XML PATH(''), TYPE).value('.[1]', 'NVARCHAR(MAX)');

You will need to remove the leading comma. To achieve this use your favorite string function. I like to use the STUFF function since it allows you to delete a requested number of characters from a specified position. Here’s the code:

SELECT  STUFF((SELECT N',' + QUOTENAME(orderyear) AS [text()] FROM Sales.MyOrders GROUP BY orderyear ORDER BY orderyear FOR XML PATH(''), TYPE).value('.[1]', 'NVARCHAR(MAX)'), 1, 1, '');

This code generates the following output:

[2014],[2015],[2016]

This is the string that you need to place in the IN clause of the PIVOT operator. The parts that appear before and after it are static. The following code shows how to build the complete solution query and then execute it dynamically with the sp_executesql stored procedure:

DECLARE  @cols AS NVARCHAR(1000),  @sql  AS NVARCHAR(4000);SET @cols =  STUFF((SELECT N',' + QUOTENAME(orderyear) AS [text()] FROM Sales.MyOrders GROUP BY orderyear ORDER BY orderyear FOR XML PATH(''), TYPE).value('.[1]', 'NVARCHAR(MAX)'), 1, 1, '');SET @sql = N'WITH C AS(  SELECT empid, orderyear, val  FROM Sales.MyOrders) SELECT *FROM C  PIVOT( SUM(val)FOR orderyear IN (' + @cols + N') ) AS P;';EXEC sys.sp_executesql @stmt = @sql;

You get the output shown earlier in Table 2. With this solution, you never need to hard code the spreading values, and it will always include all existing ones in the data.

Pivoting Data

This article focused on the logical query processing of the PIVOT operator. I described the three steps involved in the operator: group, spread and aggregate, and their associated elements. I also described the classic alternative using an explicit grouped query and the advantages and disadvantages of each approach. I concluded with a solution for dynamic pivoting. Next month I’ll describe the fourth table operator: UNPIVOT.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like