Pivoting Data in SQL: Limitations, considerations and solutions
Identify pivoting elements and put those elements into the code
October 22, 2010
Pivoting data is a technique that rotates data from a state of rows to a state of columns, possibly aggregating multiple source values into the same target row and column intersection. Some practical reasons to pivot data include formatting data for reporting purposes, calculating custom aggregations, and solving relational division problems. We’ve covered pivoting in the past—but based on the number of pivoting questions I continue to receive, I think a refresher is in order. Even experienced T-SQL developers have a hard time getting used to the syntax of the native PIVOT and UNPIVOT operators that were introduced in SQL Server 2005; many developers resort to using more traditional syntax even though it tends to be much more verbose.
In this article I discuss standard traditional techniques for pivoting, as well as nonstandard techniques that rely on the newer native PIVOT operator. I also explain the advantages and disadvantages of the PIVOT operator. Finally, I discuss common challenges associated with pivot queries and their solutions.
For sample data I use a database called InsideTSQL2008. You can download the source code to create the sample data from www.InsideTSQL.com/books/source_code/InsideTSQL2008.zip.
Pivoting Phases and a Standard Solution
My first example to demonstrate pivoting involves shipping information from the Sales.Orders table for orders shipped in Spain. The following query returns the source data that you’ll rely on:
USE InsideTSQL2008;SELECT shipperid, shipcity, freightFROM Sales.OrdersWHERE shipcountry = N'Spain'ORDER BY shipperid, shipcity;
Table 1 (below) shows this query’s output.
Suppose that you get a task to return a row for each shipper ID, a column for each shipping city in Spain, and the total freight values for each shipper and city intersection. Table 2 (below) shows the desired result. This task represents a classic pivoting need and is therefore a good example to demonstrate the concept and solutions.
If you want to break a pivoting task into multiple conceptual phases, you can identify three main phases: grouping, spreading, and aggregating. Figure 1 illustrates these phases.
From a user perspective, pivoting involves three main elements: the element that you need on rows (shipperid in our case), the element that you need on columns (shipcity in our case), and the element that you need in the data section, or the intersection of rows and columns (sum of freight in our case).
The first phase illustrated in Figure 1 is the grouping phase. You need to arrange one row in the result for each distinct “on rows” element. In SQL terms, this is called grouping; the element that you need to group by (the on rows element) is shipperid. In the standard traditional solution to pivoting, the grouping phase is implemented simply with a GROUP BY clause, like so:
GROUP BY shipperid
The second phase in pivoting is the spreading phase. You need to spread the values that will later be aggregated (freight in our case) to target columns based on the distinct values that appear in the “on cols” element (shipcity in our case). The distinct ship city values that appear in our sample data for the ship country Spain are Barcelona, Madrid, and Sevilla. In the traditional solution to pivoting, you can achieve the spreading logic using CASE expressions, like so:
CASE WHEN shipcity = N'Barcelona' THEN freight END AS Barcelona,CASE WHEN shipcity = N'Madrid' THEN freight END AS Madrid,CASE WHEN shipcity = N'Sevilla' THEN freight END AS Sevilla
Note that when an ELSE clause isn’t specified in a CASE expression, an ELSE NULL is implied. Because that’s exactly the behavior we need, you can simply omit this part, as I did. Also note that in order to use a solution that’s based on a static query, you have to know ahead of time what the distinct values are in the spreading element, or at least what the maximum number of distinct values is. Later in the article I cover a solution that relies on string concatenation techniques and dynamic SQL to cope with situations in which this information isn’t available.
The third and last phase in pivoting is the aggregating phase. In this phase you aggregate the data values in each target pivoted column. In our example, the aggregate function is SUM and the aggregation element is freight. In the traditional solution to pivoting, you simply apply the aggregate function to the results of the CASE expressions, like so:
SUM(CASE WHEN shipcity = N'Barcelona' THEN freight END) AS Barcelona,SUM(CASE WHEN shipcity = N'Madrid' THEN freight END) AS Madrid,SUM(CASE WHEN shipcity = N'Sevilla' THEN freight END) AS Sevilla
Here’s the complete solution implementing all three phases:
SELECT shipperid, SUM(CASE WHEN shipcity = N'Barcelona' THEN freight END) AS Barcelona, SUM(CASE WHEN shipcity = N'Madrid' THEN freight END) AS Madrid, SUM(CASE WHEN shipcity = N'Sevilla' THEN freight END) AS SevillaFROM Sales.OrdersWHERE shipcountry = N'Spain'GROUP BY shipperid;
You can consider this solution as a template. For any pivoting task you get, identify the three elements involved and put them in the right places in this template. To recap, in our example those elements are:
Grouping: shipperid
Spreading: shipcity IN (Barcelona, Madrid, Sevilla)
Aggregation: SUM(freight)
As for optimization, the general indexing guidelines are to create an index with a key-list starting with the grouping element, followed by the spreading element, and including the aggregation element. If the query contains an additional filter—for example, to support a pivoting page field in the application—make it the leading column in the key-list. In our case the shipcountry column serves such a filtering purpose. Following these guidelines, here’s the optimal index to support our pivoting solution:
CREATE INDEX idx_nc_country_sid_city_i_freight ON Sales.Orders(shipcountry, shipperid, shipcity) INCLUDE(freight);
Figure 2 shows the plan for the solution query. The plan performs a seek operation based on the shipcountry filter in our index, followed by a partial ordered scan of the qualifying rows. The plan then calculates the CASE expressions in the Compute Scalar operator. Finally, the plan handles the aggregation logic using a stream aggregate operator that relies on the ordered scan of the index. This plan is very efficient, scanning only the qualifying rows, only once, and in an order that supports an efficient aggregate.
As an aside, standard SQL has a certain feature that was probably designed to support pivoting needs, although SQL Server doesn’t include this feature—it’s a filtering clause called FILTER that’s available to aggregate functions, allowing the function to operate only on a subset of rows in the group instead of all rows. Pivoting data using the FILTER clause would look like this (don’t try to run this code, because SQL Server doesn’t support it):
SELECT shipperid, SUM(freight) FILTER (WHERE shipcity = N'Barcelona') AS Barcelona, SUM(freight) FILTER (WHERE shipcity = N'Madrid' ) AS Madrid, SUM(freight) FILTER (WHERE shipcity = N'Sevilla' ) AS SevillaFROM Sales.OrdersWHERE shipcountry = N'Spain'GROUP BY shipperid;
But as I showed earlier, it’s very easy to implement similar logic with CASE expressions.
Pivoting with the Native PIVOT Operator
In the previous section, I covered the traditional solution to pivoting, using only standard constructs. One of the downsides of this technique is that it’s verbose. When you have a large number of distinct values in the spreading element, the code can get very long. SQL Server 2005 introduced a native nonstandard table operator called PIVOT that enables less verbose solutions. Very much like the JOIN table operator, you specify PIVOT in the FROM clause of a query, and the output of this table operator is a table result. The syntax of the PIVOT operator is
FROM PIVOT( () FOR IN () ) AS
In this syntax you can identify two out of the three elements that are supposed to be involved in pivoting: the aggregation function and element, and the spreading element and the distinct spreading values. The tricky part is that the grouping element is implicitly defined by elimination—the grouping element is all columns from besides those mentioned as either the spreading or aggregation element. Therefore, you should refrain from querying underlying tables directly, and always prepare a table expression in the form of a CTE or derived table that contains only the three elements that are supposed to be involved in pivoting. This way, the right grouping element that you need will be defined by elimination. With this in mind, here’s a query using the native PIVOT operator that returns total freight values for each shipper and city for orders shipped in Spain:
WITH PivotInput AS ( SELECT shipperid, shipcity, freight FROM Sales.Orders WHERE shipcountry = N'Spain')SELECT *FROM PivotInput PIVOT( SUM(freight) FOR shipcity IN (Barcelona, Madrid, Sevilla) ) AS PivotOutput;
Note that because the city names happen to be strings that are considered valid regular identifiers in our case, you don’t need to delimit them. But if the values were such that they weren’t considered regular identifiers (e.g., if they started with a digit), you would need to delimit them using either square brackets or double quotes. For example, if the spreading values represented order years, you’d use the form
FOR orderyear IN (\[2006\],\[2007\],\[2008)
As I mentioned, even experienced T-SQL developers have a hard time getting used to this syntax—but all you need to do is identify the three elements involved in pivoting (grouping, spreading, aggregating) and put those elements in the right places in the template.
Figure 3 shows the plan for the PIVOT query. As you can see, this plan is very similar to that of the standard solution—so much so that if you look at the properties of the Aggregate operator, under Defined Values, you’ll find that SQL Server constructed CASE expressions behind the scenes:
… \[Expr1022\] = Scalar Operator(SUM(CASE WHEN \[InsideTSQL2008\].\[Sales\].\[Orders\].\[shipcity\]=N'Barcelona' THEN \[InsideTSQL2008\].\[Sales\].\[Orders\].\[freight\] ELSE NULL END)),…
With this in mind, you shouldn’t expect the solution that’s based on the PIVOT operator to perform better than the standard solution. The main benefit in the PIVOT operator at the moment is that it’s less verbose. In fact, there are several ways in which the PIVOT operator is more limited than the standard method. I describe these limitations, as well as workarounds, in the following sections.
Pivoting Challenges
Certain common pivoting needs create challenges because of syntax limitations or other reasons. One such challenge occurs when the spreading element doesn't originally exist in the source data. For example, given the sample data used earlier in the article, suppose that you need to return for each customer a column for each of its order IDs (e.g., for the five most recent orders). The grouping element (“on rows” element) is custid. The aggregation element (data element) is orderid. But there’s no spreading element that’s common across customers by which you can spread the order IDs.
The solution is to assign a row number to the orders, partitioned by custid, and ordered by whatever order that you want to return those values. For example, if you want to return the order IDs from most recent to least recent, you can order by orderdate DESC, orderid DESC. As I mentioned earlier, in a static pivot solution, you need to decide on the maximum number of spreading values that you want to support (e.g., in our case, the five most recent orders).
The transition of order IDs from source to target rows is a one-to-one transition, so the use of an aggregate isn’t really intuitive. Still, the only way for you to return the data element is using an aggregate function, because pivoting relies on grouped data. So you simply need to pick an artificial aggregate that will return the only applicable orderid value (if at all) for each customer and row number. Both MIN and MAX are aggregate functions that will work well for this need, because they support many types; and in case there’s only one non-NULL value, they simply return that value.
Here’s the complete solution returning the order IDs of the five most recent orders for each customer, for orders shipped in Spain:
WITH PivotInput AS ( SELECT custid, orderid, ROW_NUMBER() OVER(PARTITION BY custid ORDER BY orderdate DESC, orderid DESC) AS rownum FROM Sales.Orders WHERE shipcountry = N'Spain') SELECT custid, \[1\],\[2\],\[3\],\[4\],\[5\], CASE WHEN \[6\] IS NOT NULL THEN 'Y' ELSE 'N' END AS moreordersFROM PivotInput PIVOT(MAX(orderid) FOR rownum IN(\[1\],\[2\],\[3\],\[4\],\[5\],\[6\])) AS PivotOutput;
This solution also indicates whether the customer has additional orders beyond the five most recent ones. Table 3 (below) shows this query’s output.
Limitations and Solutions
The PIVOT operator has a few limitations that aren’t applicable to the standard pivoting solution. One limitation is that the PIVOT operator doesn't support expressions in the aggregation and spreading elements directly, whereas the standard pivoting solution does. For example, suppose you want to aggregate a discounted freight value defined by the expression freight * 0.9. With the standard pivoting solution you can simply specify the expression where you normally specify the aggregation element after the THEN clause of the CASE expressions, like so:
SUM(CASE WHEN shipcity = N'Barcelona' THEN freight * 0.9 END) AS Barcelona
But if you try to specify an expression in the aggregation element of the PIVOT operator—for example, SUM(freight * 0.9)—you get an error. The workaround is to assign column aliases to calculations in a table expression and use those aliases in the PIVOT operator. For example, the following query shows how to aggregate a discounted freight:
WITH PivotInput AS( SELECT shipperid, shipcity, freight * 0.9 AS discountedfreight FROM Sales.Orders WHERE shipcountry = N'Spain')SELECT *FROM PivotInput PIVOT( SUM(discountedfreight) FOR shipcity IN (Barcelona, Madrid, Sevilla) ) AS PivotOutput;
Another limitation is that the PIVOT operator doesn't support the COUNT(*) aggregate. One workaround is to apply the COUNT aggregate to a non-NULLable column—and if there’s a column, it makes sense to use the same one that you use as the spreading element (provided that it’s a non-NULLable column). This way, you can more easily cover the query with an index. Here’s an example returning the count of orders for each shipper and city, for orders shipped in Spain:
WITH PivotInput AS( SELECT shipperid, shipcity FROM Sales.Orders WHERE shipcountry = N'Spain')SELECT *FROM PivotInput PIVOT( COUNT(shipcity) FOR shipcity IN (Barcelona, Madrid, Sevilla) ) AS PivotOutput;
Another workaround is to apply the aggregate to a column generated by a constant expression, like so:
WITH PivotInput AS( SELECT shipperid, shipcity, 1 as const FROM Sales.Orders WHERE shipcountry = N'Spain')SELECT *FROM PivotInput PIVOT( COUNT(const) FOR shipcity IN (Barcelona, Madrid, Sevilla) ) AS PivotOutput;
A third limitation is that the PIVOT operator doesn't support multiple spreading elements, whereas the standard technique does. For example, suppose you want to spread the data by both shipped year and ship city. With the standard method you simply specify two predicates in the CASE expression, like so:
SUM(CASE WHEN YEAR(shippeddate) = 2006 AND shipcity = N'Barcelona' THEN freight END) AS Barcelona
The workaround using the PIVOT operator is to concatenate the multiple spreading elements into one element in the table expression that you’re preparing as the input to the PIVOT operator. Then, specify the alias of the concatenated values as the spreading element, and the distinct concatenated values in the IN clause. Here’s an example:
WITH PivotInput AS( SELECT shipperid, CAST(YEAR(shippeddate) AS NCHAR(4)) + N'_' + shipcity AS year_city, freight FROM Sales.Orders WHERE shipcountry = N'Spain')SELECT *FROM PivotInput PIVOT( SUM(freight) FOR year_city IN ( \[2006_Barcelona\], \[2006_Madrid\], \[2006_Sevilla\], \[2007_Barcelona\], \[2007_Madrid\], \[2007_Sevilla\], \[2008_Barcelona\], \[2008_Madrid\], \[2008_Sevilla\] ) ) AS PivotOutput;
Table 4 (below) shows this query’s output.
Yet another limitation is related to the spreading element when using the PIVOT operator. Remember that the spreading values that you specify in the IN clause are column values in the source and become attribute names in the target. Hence they must follow the requirements for identifiers, which means they can’t exceed 128 characters. There’s no way to alias the target column names directly in the IN clause; instead, you can alias them in the query’s SELECT list—but that’s too late. With the standard syntax for pivoting, you specify the spreading values in the filters of the CASE expressions, and you assign any alias that you like directly in the expression, so the source values aren’t limited to 128 characters.
The last limitation I want to cover is that the PIVOT operator doesn't support multiple aggregates, whereas the standard technique does. For example, suppose you need to calculate both the sum and average freight for each shipper by city. Using the standard method, you simply add more expressions to the SELECT list, like so:
SUM(CASE WHEN shipcity = N'Barcelona' THEN freight END) AS SUM_Barcelona,...AVG(CASE WHEN shipcity = N'Barcelona' THEN freight END) AS AVG_Barcelona,...
The workaround using the PIVOT operator is to use a separate operator for each aggregate and join the results, like so:
WITH PivotInput AS( SELECT shipperid, shipcity, freight FROM Sales.Orders WHERE shipcountry = N'Spain')SELECT P1.shipperid, P1.Barcelona as SUM_Barcelona, P1.Madrid as SUM_Madrid, P1.Sevilla as SUM_Sevilla, P2.Barcelona as AVG_Barcelona, P2.Madrid as AVG_Madrid, P2.Sevilla as AVG_SevillaFROM PivotInput PIVOT( SUM(freight) FOR shipcity IN (Barcelona, Madrid, Sevilla) ) AS P1 JOIN PivotInput PIVOT( AVG(freight) FOR shipcity IN (Barcelona, Madrid, Sevilla) ) AS P2 ON P1.shipperid = P2.shipperid;
However, this method doesn’t perform as well as the standard method because the data is scanned separately for each aggregate.
Dynamic Pivoting
A final challenge or limitation of the static solutions to pivoting is such a commonly asked request that it deserves its own section. With static solutions to pivoting, you have to know ahead of time what the unique spreading values are—or at least the maximum number. For example, when the spreading element is the ship city, you need to know which cities currently exist in the data. At the moment, for orders shipped in Spain, the Sales.Orders table contains orders shipped in the cities Barcelona, Madrid, and Sevilla. But the set of distinct cities could change—for example, because of new orders. To come up with a pivoting solution that dynamically adjusts the spreading elements based on the existing data, you need to use dynamic SQL.
First, you need a technique that performs string concatenation, so that you concatenate the distinct ship cities you query at run time to one list that you’ll use later in the PIVOT operator’s IN clause. An efficient method to perform string concatenation is to use the FOR XML PATH option with an empty string as input, indicating that you’re not interested in a root node, and return the values as text nodes with no tags. For example, here’s the code you’d use to concatenate the distinct Spain ship cities:
SELECT STUFF( (SELECT N',' + QUOTENAME(shipcity) FROM (SELECT DISTINCT shipcity FROM Sales.Orders WHERE shipcountry = N'Spain') AS Cities ORDER BY shipcity FOR XML PATH(''), TYPE).value('.\[1\]', 'NVARCHAR(MAX)'), 1, 1, N'')
The innermost query returns the distinct ship cities in Spain. The outer query uses the QUOTENAME function to turn the city names into identifiers, adding square brackets and a comma in front of each city. The FOR XML PATH('') option concatenates the elements into one string, and finally, the STUFF function is used to get rid of the unnecessary first comma. The result of this code is the string
\[Barcelona\],\[Madrid\],\[Sevilla\]
Listing 1 contains the definition of a stored procedure called PivotShipping that accepts a shipping country name as input, constructs the complete PIVOT query dynamically, including the technique shown to concatenate the distinct ship city names, then executes it using sp_executesql.
Here’s an example of invoking the procedure with Spain as the input city, producing the output in Table 2:
EXEC dbo.PivotShipping @country = N'Spain';
And here’s an example of invoking the procedure with UK as the shipping country, returning the output in Table 5:
EXEC dbo.PivotShipping @country = N'UK';>
Easier than You Thought
Although SQL pivoting techniques have been around for a while, developers still ask a lot of questions in classes and public forums about how to achieve pivoting. Pivoting really isn’t very difficult—once you get the idea, it just entails identifying pivoting elements in a given request and putting those elements in the correct places in the code. If you’ve found the PIVOT operator intimidating in the past, I hope this article helps make you feel more comfortable with it. Next month I’ll cover techniques to unpivot data.
Table 1: Source Data from Sales.Orders
shipperid | shipcity | freight |
---|---|---|
1 | Barcelona | 18.69 |
1 | Barcelona | 1.36 |
1 | Madrid | 16.16 |
1 | Madrid | 32.99 |
1 | Madrid | 2.94 |
1 | Madrid | 12.69 |
1 | Sevilla | 38.19 |
1 | Sevilla | 3.20 |
1 | Sevilla | 59.11 |
2 | Barcelona | 10.14 |
2 | Madrid | 77.92 |
2 | Madrid | 8.29 |
2 | Madrid | 97.09 |
2 | Sevilla | 175.32 |
2 | Sevilla | 19.58 |
2 | Sevilla | 107.83 |
2 | Sevilla | 51.87 |
3 | Barcelona | 1.25 |
3 | Barcelona | 6.54 |
3 | Madrid | 7.56 |
3 | Sevilla | 4.32 |
3 | Sevilla | 85.46 |
3 | Sevilla | 23.39 |
Table 2: Total Freights for Each Shipper and City
shipperid | Barcelona | Madrid | Sevilla |
---|---|---|---|
1 | 20.05 | 64.78 | 100.50 |
2 | 10.14 | 183.30 | 354.60 |
3 | 7.79 | 7.56 | 113.17 |
Table 3: Most Recent Order IDs for Each Customer for Orders Shipped in Spain
custid | 1 | 2 | 3 | 4 | 5 | moreorders |
---|---|---|---|---|---|---|
8 | 10970 | 10801 | 10326 | NULL | NULL | N |
29 | 10928 | 10887 | 10568 | 10426 | 10366 | N |
30 | 11037 | 11009 | 10948 | 10911 | 10888 | Y |
69 | 11013 | 10917 | 10306 | 10282 | 10281 | N |
Table 4: Total Freights for Each Shipper and Year_City
shipperid | 2006_Barcelona | 2006_Madrid | 2006_Sevilla | 2007_Barcelona | 2007_Madrid | 2007_Sevilla | 2008_Barcelona | 2008_Madrid | 2008_Sevilla |
---|---|---|---|---|---|---|---|---|---|
1 | NULL | 15.63 | NULL | 18.69 | NULL | NULL | 1.36 | 49.15 | 100.50 |
2 | 10.14 | 77.92 | 107.83 | NULL | 97.09 | NULL | NULL | 8.29 | 246.77 |
3 | NULL | 7.56 | NULL | 6.54 | NULL | 89.78 | 1.25 | NULL | 23.39 |
Table 5: Shipping Information for Cities in the UK
shipperid | Colchester | Cowes | London |
---|---|---|---|
1 | 41.95 | 157.05 | 536.88 |
2 | 358.54 | 136.08 | 739.06 |
3 | 71.46 | 70.52 | 842.73 |
About the Author
You May Also Like