Optimization Tips for Multiple Range Predicates, Part 3

Vector-based range predicates

Itzik Ben-Gan

August 11, 2014

14 Min Read
optimize

Query filters that involve multiple range predicates introduce several optimization challenges. In "Optimization Tips for Multiple Range Predicates, Part 1," and "Optimization Tips for Multiple Range Predicates, Part 2," I discussed challenges with cardinality estimates and indexing, and I provided optimization tips. In this article I focus on vector-based range predicates. In vector-based range predicates, I refer to conceptually combining multiple elements into a vector and applying an inequality-based comparison between two such vectors—more specifically, between a vector of columns and a vector of inputs.

For example, consider the vector of columns (col1, col2, . . ., coln) and the vector of inputs (@c1, @c2, . . ., @cn). Suppose you want to filter only the rows in which the vector of columns is greater than the vector of inputs. Conceptually, you want your query to apply the following filter:

WHERE (col1, col2, ..., coln) > (@c1, @c2, ..., @cn)

Standard SQL actually supports such a syntax as part of its support for row constructors (aka vector expressions). To logically understand which rows from the table are considered qualifying rows, think of the rows sorted by the vector of columns, and place the vector of inputs in the right sort position. The qualifying rows are the ones that appear immediately after that sort position.

T-SQL doesn't support the standard syntax for applying vector comparisons, so if you need vector comparisons, you need to figure out alternatives. As it turns out, a couple of alternatives exist, and they get optimized differently.

In this article I provide a practical scenario in which vector-based comparisons are needed. I describe two ways to express the predicates, I examine their optimization, and I provide a recommendation for which one to use.

In my examples I'll use a 1,000,000-row table called Orders from a database called Performance. You can download the source code to create and populate the Performance database.

As a practical example in which you might need to apply vector-based comparisons, suppose that you need to implement a solution for paging through order information. You need to develop a stored procedure that returns the next page of rows based on the paging sort order. As inputs, the stored procedure accepts the values of the elements in the sort vector from the last row in the previous page. The input sort vector is like an anchor vector after which you need the next page of rows.

I'll start with a simple example based on a single ordering column and then cover examples with multiple ordering columns.

Paging Based on a Single Ordering Column

As an example for paging based on a single sort column, you need to develop a stored procedure called GetPage that returns one page of rows at a time, ordered by the orderid column. As input, you pass the sort key of the last row from the previous page that you already obtained. Assuming the order IDs used in the system are always positive, a default value 0 is used for the input anchor order ID. So when an input value isn't provided, the procedure will return the first page of rows in which the order IDs are greater than 0. You also pass a parameter with the page size, using 25 as the default. The query should return the columns orderid and filler. The latter is just a 200-byte value used as a filler representing multiple columns that you would typically need to return.

The optimal index to support your paging solution is one defined on the sort column orderid as the key and includes the filler column. Run the following code to create the supporting index:

SET NOCOUNT ON;USE Performance;CREATE UNIQUE INDEX idx_oid_i_filler  ON dbo.Orders(orderid) INCLUDE(filler);

Here's the code implementing the stored procedure GetPage based on the previously mentioned requirements:

IF OBJECT_ID(N'dbo.GetPage', N'P') IS NOT NULL  DROP PROC dbo.GetPage;GOCREATE PROC dbo.GetPage  @anc_oid AS INT = 0,  @pagesize AS BIGINT = 25ASSELECT TOP (@pagesize) orderid, fillerFROM dbo.OrdersWHERE orderid > @anc_oidORDER BY orderid;GO

Run the following code to enable I/O statistics:

SET STATISTICS IO ON;

For the first page request, you call the procedure without passing an anchor order ID, like so:

EXEC dbo.GetPage;

You get back the first 25 rows (the default page size). The sort key in the last row is order ID 25. For the next page request, you pass 25 as the input anchor order ID, like so:

EXEC dbo.GetPage @anc_oid = 25;

And for the third page request, you pass as input the order ID 50 because it's the order ID from the last row in the second page:

EXEC dbo.GetPage @anc_oid = 50;

Examine the plan for the procedure's query, which Figure 1 shows.

Plan for Query Using a Single Ordering Column

This plan is optimal. It performs a seek in the covering index, and the seek predicate is the same as the query predicate. This means that the seek reaches the first row in the index leaf that satisfies the predicate, and then the range scan in the leaf scans exactly the 25 qualifying rows. There's no scanning of nonqualifying rows. The total number of logical reads used for the execution of this plan is three.

Therefore, when there's only one ordering column involved, the solution is very simple, and the plan is optimal. However, the situation with multiple sort columns is more complex.

Paging Based on Multiple Ordering Columns

Suppose that the ordering in your paging scenario was based on multiple columns. For example, suppose it was based on the vector (shipperid, orderid). The query is supposed to return in the output the columns shipperid, ordered, and filler. The optimal index to support your solution should have the ordering columns as the index key-list and the rest as included columns. Run the following code to create the optimal index:

CREATE UNIQUE INDEX idx_sid_oid_i_filler  ON dbo.Orders(shipperid, orderid)  INCLUDE(filler);

If row constructors were supported in T-SQL, you could use the following filter:

WHERE (shipperid, orderid) > (@anc_sid, @anc_oid)

But unfortunately they aren't. There are a couple of logically equivalent alternatives. Here's the GetPage procedure definition using one of the alternatives:

IF OBJECT_ID(N'dbo.GetPage', N'P') IS NOT NULL  DROP PROC dbo.GetPage;GOCREATE PROC dbo.GetPage  @anc_sid AS VARCHAR(5) = '',  @anc_oid AS INT = 0,  @pagesize AS BIGINT = 25ASSELECT TOP (@pagesize) shipperid, orderid, fillerFROM dbo.OrdersWHERE shipperid >= @anc_sid  AND (shipperid > @anc_sid OR orderid > @anc_oid)ORDER BY shipperid, orderid;GO

Run the following code to get the first page:

EXEC dbo.GetPage;

When I ran this code on my system I got the output that Figure 2 shows.

shipperid orderid     filler--------- ----------- -------A     9       aA     14      aA     23      a...A     127     aA     128     aA     131     a

Your output will likely be different because the code that populated the Orders table uses randomization to compute some of the values. In my output the values of the shipper ID and order ID in the last row were A and 131, respectively, so in order to get the second page, I passed those as inputs, like so:

EXEC dbo.GetPage @anc_sid = 'A', @anc_oid = 131;

I got the output that Figure 3 shows.

shipperid orderid     filler--------- ----------- -------A     132     aA     140     aA     142     a...A     249     aA     250     aA     253     a

Similarly, to get the third page I passed the values from the last row in the second page, like so:

EXEC dbo.GetPage @anc_sid = 'A', @anc_oid = 253;

I got the output that Figure 4 shows.

shipperid orderid     filler--------- ----------- -------A     255     aA     258     aA     263     a...A     395     aA     399     aA     401     a

As for optimization, Figure 5 shows the plan for the query.

Plan for Query Using Two Ordering Columns, Version 1

Observe that the seek predicate is shipperid >= @anc_sid. The rest of the predicates (shipperid > @anc_sid OR orderid > @anc_oid) are residual predicates. The seek operation reaches the first row in the index leaf that satisfies the seek predicate, and then a range scan ensues until 25 rows that satisfy the residual predicates are found. The farther the page you request is within the same shipperid value, the more nonqualifying rows the range scan needs to access before 25 qualifying rows are found.

The shipperid column in our table is quite dense (20 percent since there are five distinct values), so the farther the users get with their paging, the more unnecessary rows need to be scanned. With the first three page requests the consequences aren't so bad; I got four logical reads for the third page request. But if you get much farther than that, you pay much more.

As an example, let's first find the maximum order ID for shipper A by running the following query:

SELECT MAX(orderid) AS maxoidFROM dbo.OrdersWHERE shipperid = 'A';

On my system I got the output 999999. Now run the following code requesting the next page of rows:

EXEC dbo.GetPage @anc_sid = 'A', @anc_oid = 999999;

This time all rows for shipper A were scanned before the first qualifying row was found. This resulted in 4,573 logical reads on my system. Arguably, in reality paging sessions don't tend to get too far, but remember that paging is just a scenario I use here for illustration purposes. The point is to discuss the efficiency of vector-based range predicates for any purpose—not just paging. You might have a scenario in which you do need to get far with the anchor sort vector.

As it turns out, there's a logically equivalent alternative that gets more efficient treatment by the optimizer. Instead of using the form:

shipperid >= @anc_sid AND (shipperid > @anc_sid OR orderid > @anc_oid)

Use the form:

(shipperid = @anc_sid AND orderid > @anc_oid) OR (shipperid > @anc_sid)

Run the following code to alter the procedure definition using the new form of predicates:

ALTER PROC dbo.GetPage  @anc_sid AS VARCHAR(5) = '',  @anc_oid AS INT = 0,  @pagesize AS BIGINT = 25ASSELECT TOP (@pagesize) shipperid, orderid, fillerFROM dbo.OrdersWHERE (shipperid = @anc_sid AND orderid > @anc_oid)   OR (shipperid > @anc_sid);GO

Run the following code to get the first page:

EXEC dbo.GetPage;

Figure 6 shows the plan for this code.

Plan for Query Using Two Ordering Columns, Version 2

Observe that the optimizer broke the predicates in the filter into two disjunctive (OR'd) seek predicates, each representing only qualifying rows:

  1. Prefix: shipperid = @anc_sid, Start: orderid > @anc_oid

  2. Start: orderid > @anc_oid

Per seek, the start of the range scan is with the first match for Prefix and Start and the end is with the last match for Prefix (or edge of index leaf if Prefix isn't present). When there's more than one seek predicate, such as in our case, they're processed in order.

When a Top operator is the node requesting the rows, a short-circuit takes place as soon as the requested number of rows is returned (25 in our case). You can check the scan count measure in the output of STATSITICS IO to see whether a short circuit occurred before the second seek was applied. Don't let the term scan in this measure confuse you. What it really represents is the access count, regardless of the access method used. When only one seek predicate is used, this measure will report 1; when both are used, it will report 2. Either way, the cost is negligible. For example, for the first page request I got a scan count of 2 and a number of logical reads of 6. This makes sense because the index has three levels and therefore the cost of each seek should be 3 reads.

Here's code to request the second page (again, the order ID in your case will likely be different):

EXEC dbo.GetPage @anc_sid = 'A', @anc_oid = 131;

I got a scan count of 1 and a number of logical reads of 4. The extra read beyond the expected three reads for the seek was likely because the qualifying range spanned an extra page.

Here's code to request the third page:

EXEC dbo.GetPage @anc_sid = 'A', @anc_oid = 253;

I got a scan count of 1 and a number of logical reads of 3.

Now try requesting the same far page you requested with the previous solution (on my system it was with the anchor order ID 999999):

EXEC dbo.GetPage @anc_sid = 'A', @anc_oid = 999999;

I got a scan count of 2 and a logical reads count of 6. With this solution the number of reads will always remain very small, unlike in the previous solution.

Paging Using OFFSET-FETCH

It's interesting to compare the previous examples of anchor-based paging solutions with one based on the OFFSET-FETCH filter. Currently this filter doesn't support an anchor concept; instead, you provide it with the number of rows you want it to skip (OFFSET value) and the number of rows you want it to filter (FETCH value). So you can implement a paging solution in which you pass the page number and page size as inputs and have the OFFSET-FETCH filter use computations based on the inputs to tell it how many rows to skip and filter. Here's an example for such a solution:

IF OBJECT_ID(N'dbo.GetPage', N'P') IS NOT NULL  DROP PROC dbo.GetPage;GOCREATE PROC dbo.GetPage  @pagenum AS BIGINT = 1,  @pagesize AS BIGINT = 25ASSELECT shipperid, orderid, fillerFROM dbo.OrdersORDER BY shipperid, orderidOFFSET (@pagenum - 1) * @pagesize ROWS FETCH NEXT @pagesize ROWS ONLY;GO

The following code requests the first page:

EXEC dbo.GetPage @pagenum = 1;

Similarly, you can request the second, third, and other pages. Figure 7 shows the plan for the query.

Plan for Query Using OFFSET-FETCH

There's no magical way for SQL Server to know to jump to the first qualifying row. The Top operator is the one requesting the rows from the Index Scan operator to its right. The Top operator first requests OffsetExpression rows and discards them (0 for the first page, 25 for the second, 50 for the third, and so on). It then requests Top Expression rows and passes those to the operator to its left (the root node SELECT in our case). The farther you get with the page number, the more data you scan. Worse, if the index wasn't a covering one, SQL Server would have applied OffsetExpression + Top Expression lookups—not just Top Expression times.

I got three logical reads reported for the first page request. But try executing the procedure with a far page, such as 20,000:

EXEC dbo.GetPage @pagenum = 20000; -- logical reads 11400

I got 11,400 logical reads, and that's when the index is a covering one with no lookups involved.

What could be interesting is if the OFFSET-FETCH feature were extended in the future to support an option of an anchor-based sort vector as the starting point. It could be even more powerful if it supported returning the elements of the sort vector from the last row as output parameters. If both ideas were implemented, your stored procedure might have looked like this (don't run this code, because the syntax isn't supported):

CREATE PROC dbo.GetPage  @anc_sid AS VARCHAR(5) = '',  @anc_oid AS INT = 0,  @pagesize AS BIGINT = 25,  @last_sid AS VARCHAR(5) OUTPUT,  @last_oid AS INT OUTPUTASSELECT shipperid, orderid, fillerFROM dbo.OrdersORDER BY shipperid, orderidOFFSET AFTER (@anc_sid, @anc_oid)FETCH NEXT @pagesize ROWS ONLYLAST ROW INTO (@last_sid, @last_oid);GO

Such a syntax would have lent itself to optimization with a single seek in a supporting index straight to the first qualifying row. Just a thought.

Optimal Treatment

It appears that when using multiple predicates in your query filters, getting optimal treatment by the SQL Server optimizer isn't a trivial thing. The situation becomes even more complex when the predicates are range predicates. This three-part series of articles explored the challenges involving multiple range predicates and provided tips and guidelines for getting as optimal treatment as possible.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like