Logical Query Processing Part 6: The WHERE Clause

The first major step in logical query processing handles the FROM clause. The second major step handles the optional WHERE clause.

Itzik Ben-Gan

September 15, 2016

21 Min Read
Logical Query Processing Part 6: The WHERE Clause

Logical query processing describes the conceptual interpretation of SQL queries. This article is the sixth part in a series on the topic. Part 1 provided an overview of the concept and a sample database called TSQLV4. It also provided two sample queries which I referred to as simple sample query and complex sample query. Parts 2, 3, 4 and 5 covered the logical query processing aspects of the table operators JOIN, APPLY, PIVOT and UNPIVOT, respectively—all of which are processed as part of the first major clause FROM. In this article I describe the logical query processing aspects of the second major clause—the WHERE clause.

Logical query processing flow chart including WHERE clause

Recall that the first major step in logical query processing handles the FROM clause. The second major step handles the optional WHERE clause. Figure 1 provides a graphical depiction of the first two major steps in logical query processing.

Figure 1: Logical query processing flow chart - FROM and WHERE

Figure 01 - Logical query processing flow chart - the WHERE clause.jpg

Figure 01 - Logical query processing flow chart - the WHERE clause

As its input, the second step operates on the virtual table that is returned by the first step. It applies a filter based on the predicate that appears in the WHERE clause. For each input row, the outcome of the predicate can be true, false or unknown (when a NULL is involved), and the second step returns only the rows for which the predicate evaluates to true. It discards the rows for which the predicate evaluates to false or unknown.

The following code has the FROM and WHERE parts of our simple sample query:

SELECT C.custid, O.orderidFROM Sales.Customers AS C  LEFT OUTER JOIN Sales.Orders AS OON C.custid = O.custidWHERE C.country = N'Spain';

The result of the first step before applying the WHERE clause is a virtual table with 832 rows. After applying the filter predicate in the WHERE clause, the result is a virtual table with the following 24 rows:

custid      orderid----------- -----------8       103268       108018       1097022      NULL29      1036629      1042629      1056829      1088729      1092830      1030330      1055030      1062930      1087230      1087430      1088830      1091130      1094830      1100930      1103769      1028169      1028269      1030669      1091769      11013

The following code has the FROM and WHERE parts of our complex sample query

The result of the first step before applying the WHERE clause is a virtual table with 701 rows. After applying the filter predicate in the WHERE clause, the result is a virtual table with the following 27 rows:

custid  custlocation    orderid  productid   val------- --------------- -------- ----------- -------------8       Spain.Madrid    10970    52      224.000000022      Spain.Madrid    NULL     NULL    NULL57      France.Paris    NULL     NULL    NULL69      Spain.Madrid    10917    30      25.890000069      Spain.Madrid    10917    60      340.000000069      Spain.Madrid    11013    23      90.000000069      Spain.Madrid    11013    42      56.000000069      Spain.Madrid    11013    45      190.000000069      Spain.Madrid    11013    68      25.000000074      France.Paris    10907    75      108.500000074      France.Paris    10964    18      375.000000074      France.Paris    10964    38      1317.500000074      France.Paris    10964    69      360.000000074      France.Paris    11043    11      210.000000089      USA.WA.Seattle  10861    17      1638.000000089      USA.WA.Seattle  10861    18      1250.000000089      USA.WA.Seattle  10861    21      400.000000089      USA.WA.Seattle  10861    33      87.500000089      USA.WA.Seattle  10861    62      147.900000089      USA.WA.Seattle  10904    58      198.750000089      USA.WA.Seattle  10904    62      1725.500000089      USA.WA.Seattle  11032    36      665.000000089      USA.WA.Seattle  11032    38      6587.500000089      USA.WA.Seattle  11032    59      1650.000000089      USA.WA.Seattle  11066    16      52.350000089      USA.WA.Seattle  11066    19      386.400000089      USA.WA.Seattle  11066    34      490.0000000(27 row(s) affected)

WHERE for filtering, ON for matching

A common source for confusion in SQL is figuring out whether a predicate that you need to use in your query belongs in a join’s ON clause or in the WHERE clause. This can be confusing even for people with years of experience writing SQL queries. Let’s start with the fact that with inner joins, at least as far as standard SQL is concerned, both ON and WHERE serve the same filtering purpose; i.e., in both cases if for a given row the predicate evaluates to true, the row is returned, otherwise (if the predicate evaluates to false or unknown) the row is discarded. Consider the following query:

SELECT C.custid, O.orderid, O.orderdateFROM Sales.Customers AS C  INNER JOIN Sales.Orders AS OON C.custid = O.custidWHERE C.country = N'Spain'  AND O.orderdate >= '20160101';

Here, since the join is an inner join, the logical meaning of the query is the same irrespective of whether you specify any of the three predicates in the ON clause or the WHERE clause. All predicates are considered filtering predicates. This query generates the following output:

custid      orderid     orderdate----------- ----------- ----------30      10872       2016-02-0530      10874       2016-02-0629      10887       2016-02-1330      10888       2016-02-1630      10911       2016-02-2669      10917       2016-03-0229      10928       2016-03-0530      10948       2016-03-138       10970       2016-03-2430      11009       2016-04-0869      11013       2016-04-0930      11037       2016-04-21(12 row(s) affected)

From a logical query processing perspective, the WHERE clause is evaluated after the FROM clause with all of its table operators like the join in our example. But from a physical query processing perspective, SQL Server can apply certain rearrangements as long as the query meaning (final result set) is preserved. For example, SQL Server will often apply what’s called predicate pushdown where instead of processing the predicates from the WHERE clause that are applied to a single table after the join, it processes them before the join. For one, if there are supporting indexes, they can be utilized to handle the filters, and even if not, reducing the input sets before the join leaves the join with less work to do. Figure 2 shows the execution plan for our query.

Figure 2: Predicate pushdown

Figure 02 - Predicate pushdown.jpg

Figure 02 - Predicate pushdown

Notice predicate pushdown took place. Currently there are no indexes to support the filters so the plan performs scans of the two input tables, but it still applies predicate pushdown where it evaluates the filters as part of the scans to reduce the sets that the join needs to operate on.

Unlike with inner joins, with outer joins the ON and WHERE clauses serve very different roles. The WHERE clause still serves the usual filtering role against the rows from the virtual table returned by the FROM clause. The ON clause serves a more sophisticated matching role. In an outer join you mark a table as preserved. From example, in Sales.Customers AS C LEFT OUTER JOIN Sales.Orders AS O ON C.custid = O.custid, you mark the Customers table as preserved, and consider the predicate C.custid = O.custid as a matching predicate. This means that you want all rows from the Customers table preserved irrespective of the outcome of the matching predicate. What the ON clause determines is which rows from the nonpreserved side (Orders) to match to rows from the preserved side (Customers). A customer without a single matching order is still returned with NULLs used as place holders for the attributes from Orders.

A common bug that results from confusion around matching versus filtering is applying what’s supposed to be a matching predicate in the ON clause as a filtering predicate in the WHERE clause. For example, suppose that our query needs to filter only customers from Spain, but match orders to customers only if the customer ID in both sides is the same and the order was placed on or after 2016. Intuitively, people are used to specifying predicates that compare elements from both sides in the ON clause and predicates that are applied to elements from one side in the WHERE clause. Applying this pattern in our query results in a bug:

SELECT C.custid, O.orderid, O.orderdateFROM Sales.Customers AS C  LEFT OUTER JOIN Sales.Orders AS OON C.custid = O.custidWHERE C.country = N'Spain'  AND O.orderdate >= '20160101';

The predicate C.country = N'Spain' is supposed to be a filtering predicate and is used as such in our query. You want to keep only customers from Spain and discard the rest. However, the predicate O.orderdate >= '20160101' is supposed to be a matching predicate since you want to preserve customers who didn’t place orders during this period, but you currently apply it as a filtering predicate. So customers who did not place orders during this period are discarded. In fact, the join in this query effectively becomes an inner join. Any outer rows produced by the outer join for customers without orders have NULLs in the orderdate column, and the filter O.orderdate >= '20160101' discards those. The SQL Server optimizer applies contradiction detection and converts the outer join to an inner, producing the same plan shown earlier for the inner join query in Figure 2. To fix the bug you need to apply the predicate O.orderdate >= '20160101' as a matching predicate in the ON clause, like so:

SELECT C.custid, O.orderid, O.orderdateFROM Sales.Customers AS C  LEFT OUTER JOIN Sales.Orders AS OON C.custid = O.custidAND O.orderdate >= '20160101'WHERE C.country = N'Spain';

This query generates the following output:

custid      orderid     orderdate----------- ----------- ----------8       10970       2016-03-2422      NULL    NULL29      10887       2016-02-1329      10928       2016-03-0530      10872       2016-02-0530      10874       2016-02-0630      10888       2016-02-1630      10911       2016-02-2630      10948       2016-03-1330      11009       2016-04-0830      11037       2016-04-2169      10917       2016-03-0269      11013       2016-04-09(13 row(s) affected)

Notice that customer 22 is a customer from Spain who did not place orders since the beginning of 2016. The plan for this query is shown in Figure 3.

Figure 3: Plan with outer join

Figure 03 - Plan with outer join.jpg

Figure 03 - Plan with outer join

This time the optimizer processes the join as an outer join.

Search arguments and equality versus distinctness

Even though this article’s focus is logical query processing, I do want to spend some time describing some aspects of physical query processing and the areas where it differs from logical processing. When it comes to query tuning, one of the most critical concepts to understand is what a search argument is, or SARG in short. A SARG is a filter predicate that enables the use of an index in a manner that relies on index ordering, such as applying a seek. Here’s the general form of a SARG:

WHERE   

The operator needs to be one that represents a consecutive range of qualifying rows in an index on the filtered column. It can be =, >, >=, <, <=, BETWEEN, > AND <=, >= AND <, and so on. It cannot be <>, for example. The filtered column must be unmanipulated. The expression on the other side can be manipulated. For example, the following is a SARG:

WHERE col1 > @p

If you had an index on col1, SQL Server can apply a seek in the index to handle the filter. The following is not a SARG since you apply manipulation to the filtered column:

WHERE col1 + 1 > @p

So here, SQL Server will have to scan the data instead of using a seek in the index. In this case, you can easily convert the filter to a SARG by subtracting 1 from @p instead of adding it to col1, like so:

This is a SARG:

WHERE col1 > @p - 1

If you wonder how come SQL Server’s optimizer doesn’t apply such internal rearrangements, it’s a good question, but the fact is that in most cases it doesn’t. It could be that Microsoft chose not to add such logic to the optimizer to not make the optimization process take too long and this way become counterproductive. In cases where they can give us simple best practices to follow, not including such logic in the optimizer itself allows a more efficient optimization process.

As another example, the following query (call it Query 1) is a sargable query (a query with a search argument) since the filter predicate doesn’t apply manipulation to the filtered column:

SELECT custid, country, regionFROM Sales.CustomersWHERE region = N'WA';

Before you run the query, create the following index to support the filter:

CREATE INDEX idx_rgn_i_cid_ctry ON Sales.Customers(region)  INCLUDE(custid, country);

The execution plan for Query 1 is shown in Figure 4.

Figure 4: Plan for Query 1

Figure 04 - Plan for Query 1.jpg

Figure 04 - Plan for Query 1

Notice that the filter predicate is applied as a seek predicate since it is considered a SARG.

The following query (call it Query 2) filters only customers with a region that starts with the letter W:

SELECT custid, country, regionFROM Sales.CustomersWHERE LEFT(region, 1) = N'W';

This query is not sargable since it applies manipulation to the filtered column. Figure 5 shows the plan for this query.

Figure 5: Plan for Query 2

Figure 05 - Plan for Query 2.jpg

Figure 05 - Plan for Query 2

As you can see, the plan scans the covering index instead of applying a seek, despite the fact that the qualifying rows appear in a consecutive range in the index. To fix this problem, use the LIKE predicate instead of the left function, like so:

SELECT custid, country, regionFROM Sales.CustomersWHERE region LIKE N'W%';

This time the query is sargable as you can see in the query plan shown in Figure 3.

Figure 6: Plan for Query 3

Figure 06 - Plan for Query 3.jpg

Figure 06 - Plan for Query 3

Suppose that you need to write a query in a stored procedure or a user defined function that filters only customers from a region that is provided as input. The following code uses a local variable to emulate the routine’s parameter:

DECLARE @region AS NVARCHAR(40) = N'WA'; -- also try with NULLSELECT custid, country, regionFROM Sales.CustomersWHERE region = @region;

This query is sargable, and it returns correct results as long as the input is not NULL, like N'WA' in this example. For some customers the region is inapplicable and therefore is set to NULL. The NULL is SQL’s marker for a missing value—whether it’s missing and applicable or missing and inapplicable (our case). SQL uses three-valued predicate logic, which means that any comparison involving a NULL, either in one or both operands, results in neither the logical value true nor false, but rather in the logical value unknown. This is the case with both equality and inequality-based comparisons. So what if the above query needs to support a NULL input, in which case you want it to return all customers that have a NULL region? Try it. You get an empty set back since a comparison between two NULLs with an equality operator yields unknown, and a query filter returns only rows for which the filter predicate evaluates to true. It discards rows for which the predicate evaluates to false and unknown. As written, the query has a bug. A common way for people to fix the bug is to use the ISNULL or COALESCE function in both operands of the comparison to replace a NULL with a value that cannot normally appear in the data, like so (call this Query 4):

DECLARE @region AS NVARCHAR(40) = N'WA'; -- also try with NULLSELECT custid, country, regionFROM Sales.CustomersWHERE ISNULL(region, N'') = ISNULL(@region, N'');

Unfortunately, though, since you apply manipulation to the filtered column, the query is not sargable, as you can see in the plan shown in Figure 7:

Figure 7: Plan for Query 4

Figure 07 - Plan for Query 4.jpg

Figure 07 - Plan for Query 4

One way to make the query sargable is to check for the special case where both operands of the comparison are NULL using the IS NULL operator instead of the equality operator, like so (call this Query 5):

DECLARE @region AS NVARCHAR(40) = NULL;SELECT custid, country, regionFROM Sales.CustomersWHERE region = @region   OR (region IS NULL AND @region IS NULL);

SQL Server knows to treat this form as a SARG as you can see in the plan for this query in Figure 8.

Figure 8: Plan for Query 5

Figure 08 - Plan for Query 5.jpg

Figure 08 - Plan for Query 5

Curiously, standard SQL supports a distinct predicate as an alternative to equality and inequality comparison. The form of the predicate is IS [NOT] DISTINCT FROM . The concept of distinctness differs from equality in the treatment of NULLs. The predicate IS NOT DISTINCT FROM evaluates to true when both sides are NULL (or not NULL and the same), and to false otherwise. The predicate IS DISTINCT FROM evaluates to false when both sides are NULL (or not NULL and the same), and to true otherwise. Had this predicate been supported in T-SQL, you could have used the following filter predicate in our query:

WHERE region IS NOT DISTINCT FROM @region

Alas, T-SQL doesn’t support the distinct predicate, but as illustrated by Paul White in his article Undocumented Query Plans: Equality Comparisons, hope is not lost. Following is an elegant alternative that is supported in T-SQL (call this Query 6):

DECLARE @region AS NVARCHAR(40) = NULL;SELECT custid, country, regionFROM Sales.CustomersWHERE EXISTS (SELECT region INTERSECT SELECT @region);

This technique relies on the fact that set operators (UNION, EXCPET and INTERSECT) use distinctness—not equality—when comparing rows. When region is not distinct from @region, the INTERSECT operator returns one row, EXISTS returns true, and the row in the outer query is returned. When region is distinct from @region, the INTERSECT operator returns an empty set, EXISTS returns false, and the row in the outer query is returned. This is exactly the desired behavior. Remarkably, SQL Server’s optimizer considers this form as a SARG and the plan that you get for this query (Query 6) is the same as the one shown earlier for Query 5 in Figure 8. Since the predicate is considered a SARG it enables an index seek.

A similar construction can be used in a join when you want a distinctness based comparison, like so:

SELECT ...FROM dbo.T1  INNER JOIN dbo.T2ON EXISTS (SELECT T1.col1 INTERSECT SELECT T2.col1);

As mentioned, the standard alternative to “not equal to,” but where you get true when one operand is NULL and the other isn’t, is IS DISTINCT FROM. With our region column and @region parameter you would use the following filter predicate:

WHERE region IS DISTINCT FROM @region

Again, this form is not supported in T-SQL, but the following elegant form is:

WHERE NOT EXISTS (SELECT region INTERSECT SELECT @region)

Or, instead of checking for an empty set intersection, you can check for a nonempty set difference, like so:

WHERE EXISTS (SELECT region EXCEPT SELECT @region)

Short circuit

One of the aspects of logical query processing that can be confusing is the fact that all expressions that appear in the same logical step are evaluated as a set, and since a set has no order, there’s no guarantee that SQL Server will evaluate the expressions in order of appearance. I’ll demonstrate this through an example.

Use the following code to create and populate a table called Properties:

SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.Properties', N'U') IS NOT NULL DROP TABLE dbo.Properties;GOCREATE TABLE dbo.Properties(  name     VARCHAR(128) NOT NULLCONSTRAINT PK_Properties PRIMARY KEY,  datatype VARCHAR(128) NOT NULL,  val      VARCHAR(500) NOT NULL);INSERT INTO dbo.Properties(name, datatype, val) VALUES  ('property1', 'SMALLINT', '1759'    ),  ('property2', 'VARCHAR',  'abc'     ),  ('property3', 'INT',      '43112609'),  ('property4', 'DATE',     '20110212');

The table holds various properties, with columns for the property name, datatype name and value, with the last stored as a character string. Suppose that you want to filter only integer properties that are greater than 10. You use the following query attempting to achieve this:

SELECT name, datatype, valFROM dbo.PropertiesWHERE datatype IN ('TINYINT', 'SMALLINT', 'INT', 'BIGINT')  AND CAST(val AS BIGINT) > 10;

In this query you assume that SQL Server will evaluate the predicates in order of appearance from left to right, and that it will short-circuit when the property is not an integer one. But since all expressions in the same logical step are treated as a set, you don’t have an assurance that SQL Server will process them in written order. When I ran this query in my system, it failed with the following error:

Msg 8114, Level 16, State 5, Line 138Error converting data type varchar to bigint.

Looking at the Predicate property of the Clustered Index Scan operator in the query execution plan reveals that SQL Server rearranged the filter predicates as follows:

    CONVERT(bigint,[tempdb].[dbo].[Properties].[val],0)>(10)AND (   [tempdb].[dbo].[Properties].[datatype]='BIGINT' OR [tempdb].[dbo].[Properties].[datatype]='INT' OR [tempdb].[dbo].[Properties].[datatype]='SMALLINT' OR [tempdb].[dbo].[Properties].[datatype]='TINYINT')

Even though SQL Server actually supports a concept of a short-circuit, it does not guarantee that it will evaluate your filter predicates in written order. This is in line with the logical query processing design so it shouldn’t be surprising. But here’s where things do get surprising. You might think that you can circumvent the problem by encapsulating a query that filters only integer properties in a table expression (CTE, derived table, view), and applying a query with a conversion against the table expression, like so:

WITH C AS(  SELECT name, datatype, val  FROM dbo.Properties  WHERE datatype IN ('TINYINT', 'SMALLINT', 'INT', 'BIGINT'))SELECT *FROM CWHERE CAST(val AS BIGINT) > 10;

From a logical query processing perspective, such code should not fail. However, for performance reasons, the SQL Server parser unnests, or inlines, the inner query’s code in the outer query, resulting in code that is equivalent to the original query without the table expression. Consequently, the code fails with the same error.

One solution to this problem is to use the TRY_CAST function instead of the CAST function to handle the conversion. The former attempts the conversion, and if successful, returns the converted value; but if unsuccessful, instead of failing it returns a NULL. This way, even if SQL Server evaluates the conversion before the other predicate, the code will not fail. A similar TRY_CONVERT function is provided as an alternative to the CONVERT function. Here’s the code that implements this solution:

SELECT name, datatype, valFROM dbo.PropertiesWHERE datatype IN ('TINYINT', 'SMALLINT', 'INT', 'BIGINT')  AND TRY_CAST(val AS BIGINT) > 10;

Currently SQL Server supports TRY_% functions only for conversion purposes. It would be nice to have a similar function such as TRY_EXPRESSION for any type of scalar expression, since similar problems can arise due to other errors (divide by zero, overflow, input out of domain, and so on).

Another solution is to use a CASE expression that proceeds with the conversion only if the datatype is an integer, like so:

SELECT name, datatype, valFROM dbo.PropertiesWHERE  CASEWHEN datatype IN ('TINYINT', 'SMALLINT', 'INT', 'BIGINT')  THEN CAST(val AS BIGINT)  END > 10;

SQL Server’s documentation of the CASE expression says: "You should only depend on order of evaluation of the WHEN conditions for scalar expressions (including non-correlated sub-queries that return scalars), not for aggregate expressions."

In our case we do not use an aggregate expression so we can rely on evaluation order and a short-circuit.

For more details on the topic, see the article Unwanted T-SQL Expression Failures.

WHERE and column aliases

Often people want to use column aliases that they created in the SELECT list for columns that result from computations in the WHERE clause, like so:

SELECT custid, CONCAT(country, N'.' + region, N'.' + city) AS custlocationFROM Sales.CustomersWHERE custlocation IN (N'Spain.Madrid', N'France.Paris', N'USA.WA.Seattle');

However, remember that in terms of logical query processing, the WHERE clause (step 2) is evaluated before the SELECT clause (step 5). Consequently, aliases created in the SELECT clause are not visible to expressions in the WHERE clause. This code generates the following errors:

Msg 207, Level 16, State 1, Line 204Invalid column name 'custlocation'.Msg 207, Level 16, State 1, Line 204Invalid column name 'custlocation'.Msg 207, Level 16, State 1, Line 204Invalid column name 'custlocation'.

The reason that you get three errors is that the predicate custlocation IN (N'Spain.Madrid', N'France.Paris', N'USA.WA.Seattle') is internally converted to a conjunction of three predicates: custlocation = N'Spain.Madrid' OR  custlocation = N'France.Paris' OR custlocation = N'USA.WA.Seattle'.

An obvious workaround is to use a table expression such as a CTE or a derived table, where you create the alias in the inner query and use it anywhere you like in the outer query. A more elegant solution is to combine the use of the APPLY operator with the VALUES clause (table value constructor), and this way create the aliases that you need very early in logical processing, as part of the processing of the FROM clause. This will make the aliases available to clauses that are evaluated in subsequent phases, like the WHERE clause. Applied to our example, the solution code looks like this:

SELECT C.custid, A.custlocationFROM Sales.Customers AS C  CROSS APPLY ( VALUES( CONCAT(C.country, N'.' + C.region, N'.' + C.city) ) )    AS A(custlocation)WHERE A.custlocation IN (N'Spain.Madrid', N'France.Paris', N'USA.WA.Seattle');

For more details on this technique and further examples, see Logical Query Processing Part 3: The FROM Clause and APPLY.

Would that it WHERE so simple

You would think that the WHERE clause is just a basic filter and that there shouldn’t be much to say about it. As it turns out, there’s a lot involved. Logical query processing explains why you cannot refer to aliases that were defined in the SELECT clause in the WHERE clause, and why there’s no assurance for the order in which the expressions in the WHERE clause will be evaluated. You also want to make sure you understand the complexities of NULL treatment, such as the difference between equality-based and distinctness-based comparisons. A good understanding of this topic helps you write correct and robust code. It’s also important to understand physical query processing considerations, such as which predicate forms constitute a search argument and which don’t, so that you can write optimal queries.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like