SQL Server: Avoiding a Sort with Descending OrderSQL Server: Avoiding a Sort with Descending Order
In this month's column I focus on queries that need to process data in descending order.
January 19, 2018
Last month I started a two-part series on avoiding an explicit sort with queries that need to process data in order. I described the Merge Join (Concatenation) operator, which is an order-reliant and order-preserving operator that unifies two ordered inputs. This month I focus on queries that need to process data in descending order. I’ll describe cases where SQL Server’s optimizer doesn’t realize that it can rely on index order, and provide workarounds to fix the problem. In my examples I’ll use the TSQLV4 sample database, as I did last month.
Presentation Order
As you know, B-tree-based indexes on disk-based tables have a doubly linked list, and can support both ordered-forward scans (or range scans) and ordered-backward ones. You can tell when looking at Index Scan and Index Seek operators whether the operator is required to return the rows in index key order by checking the Ordered property. It will say True for ordered scans and False when the operator isn’t required to return the data in index key order. The Scan Direction property tells you what’s the scan direction—FORWARD or BACKWARD. I’ll demonstrate this with a simple presentation ordering example before I start showing misbehaving queries.
The following query (call it query 1), returns orders, ordered by orderid, ascending:
USE TSQLV4; -- http://tsql.solidq.com/SampleDatabases/TSQLV4.zip
SELECT orderid, orderdate, custid
FROM Sales.Orders
ORDER BY orderid;
There’s a clustered index on the table with orderid (ascending) as the key; hence, the plan uses an ordered-forward scan of the index, as shown in the query plan in Figure 1.
Figure 1 – Plan for query 1
Similarly, the index can be scanned in an ordered-backward fashion to retrieve the data in descending order. The following query (call it query 2) demonstrates this:
SELECT orderid, orderdate, custid
FROM Sales.Orders
ORDER BY orderid DESC;
The plan for this query is shown in Figure 2.
Figure 2 – Plan for query 2
One difference between ordered-forward and ordered-backward scans is that the former can potentially use parallelism, whereas the latter doesn’t currently have an implementation with parallelism in the storage engine. Other than that, in both cases there is no need for an actual sort operation since the data is retrieved preordered from the index. An ordered index scan scales linearly. An actual sort scans in an extra-linear fashion--more specifically, in an n log n fashion. Therefore, especially when dealing with a very large number of rows, it’s good to be able to avoid explicit sorting by pulling the data preordered from an index.
Ordered set function STRING_AGG
There are a number of cases where the optimizer doesn’t realize that it can rely on index order, and that, unfortunately, results in explicit sorting. It’s especially common when the data needs to be processed in descending order within groups, or window partitions. I’ll start with an example that uses ascending order, then show what happens when using descending order. Suppose that you want to use the STRING_AGG function to return for each customer a comma separated list of orderid values, ordered by orderdate ascending and orderid ascending (as a tiebreaker). You don’t have any presentation ordering requirements, so you’re not planning to incorporate a presentation ORDER BY clause in the query. In hope to enable an order-based Stream Aggregate optimization without the need for explicit sorting, you create the following supporting index:
CREATE INDEX idx_cid_od_oid ON Sales.Orders(custid, orderdate, orderid);
You then use the following query (call it query 3) to achieve the task:
SELECT custid,
STRING_AGG( CAST(orderid AS VARCHAR(20)), ',' )
WITHIN GROUP( ORDER BY orderdate, orderid ) AS custorders
FROM Sales.Orders
GROUP BY custid;
This query generates the following output, shown here in abbreviated form:
custid custorders
----------- ------------------------------------------
1 10643,10692,10702,10835,10952,11011
2 10308,10625,10759,10926
3 10365,10507,10535,10573,10677,10682,10856
...
The plan for this query is shown in Figure 3.
Figure 3 – Plan for query 3
Observe the plan shows an ordered-forward scan of the index, followed by the order-reliant Stream Aggregate operator, and no explicit sort operation. So far, so good. However, suppose you needed to concatenate the orderid values based on orderdate descending, orderid descending ordering. You issue the following query (call it query 4) to achieve this:
SELECT custid,
STRING_AGG( CAST(orderid AS VARCHAR(20)), ',' )
WITHIN GROUP( ORDER BY orderdate DESC, orderid DESC ) AS custorders
FROM Sales.Orders
GROUP BY custid;
This query generates the following output:
custid custorders
----------- ------------------------------------------
1 11011,10952,10835,10702,10692,10643
2 10926,10759,10625,10308
3 10856,10682,10677,10573,10535,10507,10365
...
The plan for this query is shown in Figure 4.
Figure 4 – Plan for query 4
Remember, there’s no presentation ordering requirement in the query. So, theoretically, there shouldn’t be a problem for the optimizer to use an ordered-backward scan to emit the rows in the order that the Stream Aggregate operator needs. But apparently the optimizer doesn’t realize this, and ends up applying explicit sorting.
One possible solution is to use a descending index. Create an index on the key list (custid, orderdate DESC, orderid DESC), like so:
CREATE INDEX idx_cid_odD_oidD ON Sales.Orders(custid, orderdate DESC, orderid DESC);
Rerun the query (call it query 5):
SELECT custid,
STRING_AGG( CAST(orderid AS VARCHAR(20)), ',' )
WITHIN GROUP( ORDER BY orderdate DESC, orderid DESC ) AS custorders
FROM Sales.Orders
GROUP BY custid;
You get the following output:
custid custorders
----------- ------------------------------------------
1 11011,10952,10835,10702,10692,10643
2 10926,10759,10625,10308
3 10856,10682,10677,10573,10535,10507,10365
...
The plan for this run of the query is shown in Figure 5.
Figure 5 – Plan for query 5
Notice that the index is scanned in an ordered-forward fashion, and no explicit sorting is needed. This could be a satisfactory solution if you always need to return the values in descending order. But what if you sometimes need to return them ascending and sometimes descending? It seems like an unnecessary evil to have to create two separate indexes to support both options. As it turns out, there is an odd, yet simple, workaround. You simply need to add a presentation ORDER BY clause requesting to present the result in custid descending order. This request causes the optimize to realize it can collapse both ordering needs—the within group order and presentation order—and, most importantly, rely on index order for both. To demonstrate this, first run the following code to drop the descending index:
DROP INDEX idx_cid_odD_oidD ON Sales.Orders;
Then rerun the query (call it query 6):
SELECT custid,
STRING_AGG( CAST(orderid AS VARCHAR(20)), ',' )
WITHIN GROUP( ORDER BY orderdate DESC, orderid DESC ) AS custorders
FROM Sales.Orders
GROUP BY custid
ORDER BY custid DESC;
This execution generates the following output (observe customer IDs returned in descending order):
custid custorders
----------- ------------------------------------------------------------------------------------
91 11044,10998,10906,10870,10792,10611,10374
90 11005,10910,10879,10873,10695,10673,10615
89 11066,11032,10904,10861,10740,10723,10696,10693,10596,10504,10483,10469,10344,10269
...
The plan for this execution is shown in Figure 6:
Figure 6 – Plan for query 6
The ascending index is scanned in an ordered forward fashion and there’s no need for explicit sorting!
Window Functions
A situation very similar to the one I described with grouped queries with ordered set functions exists with window functions with window partitioning, and window ordering in descending order. First, I’ll demonstrate that the problem doesn’t exist when the window function doesn’t involve a window partition clause.
The following query (call it query 7) computes row numbers based on orderid ordering, ascending:
SELECT orderid, orderdate, custid,
ROW_NUMBER() OVER(ORDER BY orderid) AS rownum
FROM Sales.Orders;
This query generates the following output:
orderid orderdate custid rownum
----------- ---------- ----------- --------------------
10248 2014-07-04 85 1
10249 2014-07-05 79 2
10250 2014-07-08 34 3
10251 2014-07-08 84 4
10252 2014-07-09 76 5
...
The plan for this query is shown in Figure 7.
Figure 7 – Plan for query 7
There’s a clustered index on the table defined with orderid, ascending, as the key. This plan uses an ordered-forward scan in the index to provide the window function with the rows in the order that it needs for the computation, and therefore there’s no need for explicit sorting.
Next, run the following query (call it query 8), asking for the row numbers to be computed based on orderid descending ordering:
SELECT orderid, orderdate, custid,
ROW_NUMBER() OVER(ORDER BY orderid DESC) AS rownum
FROM Sales.Orders;
This query generates the following output:
orderid orderdate custid rownum
----------- ---------- ----------- --------------------
11077 2016-05-06 65 1
11076 2016-05-06 9 2
11075 2016-05-06 68 3
11074 2016-05-06 73 4
11073 2016-05-05 58 5
...
The plan for this query is shown in Figure 8.
Figure 8 – Plan for query 8
Here, as well, the plan scans the clustered index in order to avoid explicit sorting, only this time it uses an ordered-backward scan.
Unfortunately, the optimizer doesn’t consider an ordered-backward scan when the window function has a window partition clause, and you need to use descending order. To demonstrate this, suppose that you need to compute row numbers, partitioned by custid and ordered by orderdate, orderid (we’ll start with an example with ascending order). The following index is ideal for this computation (same index as the one used in the section about grouped ordered set functions):
CREATE INDEX idx_cid_od_oid ON Sales.Orders(custid, orderdate, orderid);
You use the following query (call it query 9) to achieve the desired result:
SELECT custid, orderdate, orderid,
ROW_NUMBER() OVER(PARTITION BY custid ORDER BY orderdate, orderid) AS rownum
FROM Sales.Orders;
This query generates the following output:
custid orderdate orderid rownum
----------- ---------- ----------- --------------------
1 2015-08-25 10643 1
1 2015-10-03 10692 2
1 2015-10-13 10702 3
1 2016-01-15 10835 4
1 2016-03-16 10952 5
1 2016-04-09 11011 6
2 2014-09-18 10308 1
2 2015-08-08 10625 2
2 2015-11-28 10759 3
2 2016-03-04 10926 4
...
The plan for this query is shown in Figure 9.
Figure 9 – Plan for query 9
As you can see, the plan performs an ordered-forward scan of the index, and therefore no explicit sorting is needed. Next, run the following query (call it query 10), this time requesting row numbers to be computed in descending order:
SELECT custid, orderdate, orderid,
ROW_NUMBER() OVER(PARTITION BY custid ORDER BY orderdate DESC, orderid DESC) AS rownum
FROM Sales.Orders;
This query generates the following output:
custid orderdate orderid rownum
----------- ---------- ----------- --------------------
1 2016-04-09 11011 1
1 2016-03-16 10952 2
1 2016-01-15 10835 3
1 2015-10-13 10702 4
1 2015-10-03 10692 5
1 2015-08-25 10643 6
2 2016-03-04 10926 1
2 2015-11-28 10759 2
2 2015-08-08 10625 3
2 2014-09-18 10308 4
...
The plan for this query is shown in Figure 10.
Figure 10 – Plan for query 10
You realize that theoretically the optimizer could have used an ordered-backward scan of the index since the query doesn’t include any presentation ordering requirement. But it doesn’t. It uses an unordered scan followed by explicit sorting.
Just like with the grouped ordered set function example from the previous section, one option is to create a descending index. But this means that if you need to compute row numbers sometimes in ascending and sometimes in descending order, you will need two indexes. Instead, you can use the same trick like with the grouped query with the STRING_AGG function where you added a presentation ORDER BY clause to the query, requesting the rows to be presented based on custid descending order. Then, magically, the optimizer will be able to rely on index order for both purposes. Here’s the trick applied to our query (call it query 11):
SELECT custid, orderdate, orderid,
ROW_NUMBER() OVER(PARTITION BY custid ORDER BY orderdate DESC, orderid DESC) AS rownum
FROM Sales.Orders
ORDER BY custid DESC;
This query generates the following output:
custid orderdate orderid rownum
----------- ---------- ----------- --------------------
91 2016-04-23 11044 1
91 2016-04-03 10998 2
91 2016-02-25 10906 3
91 2016-02-04 10870 4
91 2015-12-23 10792 5
91 2015-07-25 10611 6
91 2014-12-05 10374 7
90 2016-04-07 11005 1
90 2016-02-26 10910 2
90 2016-02-10 10879 3
90 2016-02-06 10873 4
90 2015-10-07 10695 5
90 2015-09-18 10673 6
90 2015-07-30 10615 7
...
The plan for this query is shown in Figure 11:
Figure 11 – Plan for query 11
The plan performs an ordered-backward index scan of the index and this way avoids explicit sorting.
Used in Table Expressions
Using the presentation ORDER BY trick to avoid sorting works when you query the table directly. However, what if you need to define a table expression like a view, based on a grouped ordered set function or partitioned window function, with descending within group or window order, respectively? Consider the following view as an example:
CREATE OR ALTER VIEW Sales.CustOrders
AS
SELECT custid,
STRING_AGG( CAST(orderid AS VARCHAR(20)), ',' )
WITHIN GROUP( ORDER BY orderdate DESC, orderid DESC ) AS custorders
FROM Sales.Orders
GROUP BY custid;
GO
A user querying the view may be completely unaware of any performance issues related to explicit sorting. A user who needs to query the data but doesn’t care about presentation order will likely query the view without a presentation ORDER BY clause, like so (call it query 12):
SELECT * FROM Sales.CustOrders;
The plan for this query is shown in Figure 12.
Figure 12 – Plan for query 12
Observe the explicit sorting that takes place in the plan. Naturally, you could create a descending index to avoid the sort, but what if you’d rather have just one index with ascending order, and want a fix that will enable relying on index order for both ascending and descending calculation order? If you control the queries against the view, you could just add a presentation ORDER BY clause with descending order in the query against the view, like so (call it query 13):
SELECT * FROM Sales.CustOrders ORDER BY custid DESC;
After SQL Server inlines the inner query, you basically get the equivalent of the previously shown query 6. The plan for this query is shown in Figure 13.
Figure 13 – Plan for query 13
As you can see, the plan uses an ordered-backward scan of the index and this way avoids explicit sorting.
But what if you’re not the one writing the queries against the view, and unaware users regularly query it without a presentation ORDER BY clause? You want a solution as part of the view definition itself. Your first attempt might be to incorporate an ORDER BY clause in the view’s inner query, like so:
CREATE OR ALTER VIEW Sales.CustOrders
AS
SELECT custid,
STRING_AGG( CAST(orderid AS VARCHAR(20)), ',' )
WITHIN GROUP( ORDER BY orderdate DESC, orderid DESC ) AS custorders
FROM Sales.Orders
GROUP BY custid
ORDER BY custid DESC;
GO
But this attempt fails with an error saying that an ORDER BY clause is not supported in the view’s inner query:
Msg 1033, Level 15, State 1, Procedure CustOrders, Line 9 [Batch Start Line 240]
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
The reason that an ORDER BY isn’t supported is that the view is supposed to represent a relation, and a relation is unordered. As the error message says, there are exceptions to this restriction when the ORDER BY serves a purpose other than presentation order, such as in supporting a TOP or OFFSET-FETCH filter. With this in mind, you might attempt to use a TOP (100) PERCENT filter, just to allow the ORDER BY clause, like so:
CREATE OR ALTER VIEW Sales.CustOrders
AS
SELECT TOP (100) PERCENT custid,
STRING_AGG( CAST(orderid AS VARCHAR(20)), ',' )
WITHIN GROUP( ORDER BY orderdate DESC, orderid DESC ) AS custorders
FROM Sales.Orders
GROUP BY custid
ORDER BY custid DESC;
GO
This time SQL Server allows the creation of the view. You then query the view without a presentation ORDER BY clause:
SELECT * FROM Sales.CustOrders;
But you get the same plan as the one shown earlier in Figure 13 with an explicit Sort operator. The reason is that because the outer query doesn’t have an ORDER BY clause, and the inner query requests 100 percent of the rows, SQL Server optimizes the TOP and ORDER BY combination out. That’s because as long as the outer query doesn’t have a presentation ORDER BY clause, it doesn’t need to guarantee presentation order. One possible workaround is to use TOP with a very large number of rows, e.g. a trillion, like so:
CREATE OR ALTER VIEW Sales.CustOrders
AS
SELECT TOP (1000000000000) custid,
STRING_AGG( CAST(orderid AS VARCHAR(20)), ',' )
WITHIN GROUP( ORDER BY orderdate DESC, orderid DESC ) AS custorders
FROM Sales.Orders
GROUP BY custid
ORDER BY custid DESC;
GO
Query the view again without an ORDER BY clause (call it query 14):
SELECT * FROM Sales.CustOrders;
The plan for this query is shown in Figure 14:
Figure 14 – Plan for query 14
This time, as you can see, the index is scanned in an ordered-backward fashion, and hence explicit sorting is avoided. Another workaround is to use OFFSET 0 ROWS, like so:
CREATE OR ALTER VIEW Sales.CustOrders
AS
SELECT custid,
STRING_AGG( CAST(orderid AS VARCHAR(20)), ',' )
WITHIN GROUP( ORDER BY orderdate DESC, orderid DESC ) AS custorders
FROM Sales.Orders
GROUP BY custid
ORDER BY custid DESC
OFFSET 0 ROWS;
GO
Using OFFSET 0 ROWS and no FETCH clause means return everything.
Query the view again without an ORDER BY clause:
SELECT * FROM Sales.CustOrders;
You get the same plan as the one in Figure 14 with an ordered-backward index scan and no sorting.
Of course, there’s always the chance that in the future Microsoft will optimize out the last two techniques that currently prevent the sorting, but of course you’ll still get the correct result. Hopefully, by the time this happens, if it ever happens, they will have also fixed the bug that involves the unnecessary sorting!
Summary
Query optimization is a dynamic and complex area. There are cases where you intuitively expect the optimizer to avoid certain work, but the optimizer isn’t always aware of such opportunity. However, often there are workarounds that you can apply to get the optimal treatment. In this two part-series I started last month I covered the Merge Join (Concatenation) operator and queries that need to process data in groups or window partitions in descending order. I covered cases where normally the optimizer doesn’t realize the potential to avoid explicit sorting, and offered some workarounds you can use to make it see the light.
About the Author
You May Also Like