Logical Query Processing: The FROM Clause and Joins
In this article I delve into the particulars of individual clauses, focusing on the FROM clause and the JOIN operator.
February 11, 2016
Last month I provided an overview of logical query processing. I explained that this term represents the logical interpretation of a query. I provided a sample database called TSQLV4, two sample queries that I referred to as simple sample query and complex sample query, and a graphical depiction of logical query processing as a flow chart. Starting with this article, I delve into the particulars of individual clauses--this time focusing on the FROM clause and the JOIN operator.
The FROM Clause
The FROM clause is the first major query clause that is logically processed, despite the fact that the SELECT clause is written first. There are two main things that you specify in this clause. One, this is where you specify the input tables of the query. Two, this is where you specify table operators like JOIN, APPLY, PIVOT and UNPIVOT, which operate on the input tables.
If you specify only one table in the FROM clause without any table operators, the output of this step is also the input of the step—the input table.
If you do have table operators involved, they are logically evaluated in written order—from left to right—with the output of one operator becoming the left input of the next. This logical order defines which elements are or aren’t available to any given step. What one step creates is visible to subsequent steps, but not the other way around.
Note that from a physical processing perspective, SQL Server can alter the logical processing order when it knows that the output will remain the same, and the plan with the altered order is estimated to have a lower cost than the one with the original order. In the next section about the JOIN operator I’ll get into more details.
The JOIN Operator
In the relational model, a join is not limited to being a binary operator but rather is an n-ary operator. In SQL, though, a join is defined as a binary operator with two inputs—the left and right inputs. SQL supports three fundamental join types: cross, inner and outer. Each join type implements a different subset of the following steps:
1-J1. Apply Cartesian product
1-J2. Identify matches based on the ON predicate
1-J3. Add outer rows
A cross join involves only step 1-J1, an inner join involves 1-J1 and 1-J2, an outer join involves 1-J1, 1-J2 and 1-J3.
To understand these, consider the FROM clause of the simple sample query from the overview article:
FROM Sales.Customers AS C
LEFT OUTER JOIN Sales.Orders AS O
ON C.custid = O.custid
1-J1. Apply Cartesian product
The first step in the join applies a Cartesian product between the two inputs of the join. Since there are 91 rows in the Customers table and 830 rows in the orders table, the result has 75,530 rows, shown here in abbreviated form:
C.custid C.companyname ... O.orderid O.custid O.orderdate ...--------- -------------- ---------- --------- ------------72 Customer AHPO 10248 85 2014-07-0472 Customer AHPO 10249 79 2014-07-0572 Customer AHPO 10250 34 2014-07-08...72 Customer AHPO 10471 11 2015-03-1172 Customer AHPO 10472 72 2015-03-1272 Customer AHPO 10473 38 2015-03-13...83 Customer ZRND 10993 24 2016-04-0183 Customer ZRND 10994 83 2016-04-0283 Customer ZRND 10995 58 2016-04-02...83 Customer ZRND 11075 68 2016-05-0683 Customer ZRND 11076 9 2016-05-0683 Customer ZRND 11077 65 2016-05-06
Note that the conceptual treatment of the join starts with a Cartesian product, but the SQL Server query optimizer will often manage to avoid it in practice in the plan that it builds. Remember that the physical processing doesn’t have to follow the exact steps defined by the logical processing, rather only produce the same result. The whole point of optimization in the physical layer is to apply shortcuts when possible.
Also observe in the result of this step that the column names are prefixed by the table name, or alias, if assigned. This is done to allow referring to a column name in an unambiguous manner if both inputs have the same column name--like custid, in our case.
The connection to the relational model has to do with the fact that the result of each step in logical query processing is a virtual table, representing a relation in the relational model. A relation has two parts: a header, which is a set of attributes (what SQL calls columns), and a body, which is a set of tuples (what SQL calls rows). A set in mathematics has no order. So an attribute is not supposed to be identified by position, but rather by name, and therefore attribute names in a relation must be unique. Since both source tables have a column called custid, you get unique names by adding the prefixes.
What’s interesting is that if you assign table aliases, you rename the tables for the duration of the query. In all subsequent logical query processing phases if you try to refer to the full table name, as in Customers.custid, you’ll get an error. You need to use the alias, as in C.custid.
1-J2. Identify matches based on the ON predicate
The purpose of the ON clause is to identify matches in the result of the previous step based on the specified predicate, which in our case is ON C.custid = O.custid. SQL uses three valued predicate logic, where the outcome of a predicate is one of the three logical values: true, false and unknown. With an equality operator, when both sides of the predicate are not NULL you get either true or false. You get a true when they are the same--for example, in this row:
C.custid C.companyname ... O.orderid O.custid O.orderdate ...--------- -------------- ---------- --------- ------------72 Customer AHPO 10472 72 2015-03-12
You get a false when they are different, for example in this row:
C.custid C.companyname ... O.orderid O.custid O.orderdate ...--------- -------------- ---------- --------- ------------72 Customer AHPO 10248 85 2014-07-04
If you have a NULL in either side or both, the result is the logical value unknown. That’s the case with both equality and inequality operators. In our sample data, we don’t have any NULLs in the C.custid or O.custid columns, so the predicates evaluate to true or false in all cases. Here’s the result of the ON predicate applied to the rows in the result of the previous step (again, abbreviated):
C.custid C.companyname O.orderid O.custid O.orderdate C.custid = O.custid?--------- -------------- ---------- --------- ------------ --------------------72 Customer AHPO 10248 85 2014-07-04 false72 Customer AHPO 10249 79 2014-07-05 false72 Customer AHPO 10250 34 2014-07-08 false...72 Customer AHPO 10471 11 2015-03-11 false72 Customer AHPO 10472 72 2015-03-12 true72 Customer AHPO 10473 38 2015-03-13 false...83 Customer ZRND 10993 24 2016-04-01 false83 Customer ZRND 10994 83 2016-04-02 true83 Customer ZRND 10995 58 2016-04-02 false...83 Customer ZRND 11075 68 2016-05-06 false83 Customer ZRND 11076 9 2016-05-06 false83 Customer ZRND 11077 65 2016-05-06 false
C.custid C.companyname O.orderid O.custid O.orderdate C.custid = O.custid?--------- -------------- ---------- --------- ------------ --------------------72 Customer AHPO 10248 85 2014-07-04 false72 Customer AHPO 10249 79 2014-07-05 false72 Customer AHPO 10250 34 2014-07-08 false...72 Customer AHPO 10471 11 2015-03-11 false72 Customer AHPO 10472 72 2015-03-12 true72 Customer AHPO 10473 38 2015-03-13 false...83 Customer ZRND 10993 24 2016-04-01 false83 Customer ZRND 10994 83 2016-04-02 true83 Customer ZRND 10995 58 2016-04-02 false...83 Customer ZRND 11075 68 2016-05-06 false83 Customer ZRND 11076 9 2016-05-06 false83 Customer ZRND 11077 65 2016-05-06 false
If the join is an inner join, the result of the second step is also the result of the join. But if the join is an outer one, as is the case in our simple sample query, there’s a third step involved.
1-J3. Add outer rows
The third step is only applicable to outer joins. In an outer join, you mark which table or tables you want to preserve: left, right or both (full outer join). In our example, the left table (Customers) is the preserved one. The join needs to preserve all rows from the preserved table irrespective of whether the ON clause identified any matches from the right table (Orders). This step identifies rows from the preserved side that don’t have any matches and adds those to the result with NULLs as placeholders in the nonpreserved side’s columns. In our example, the customers with IDs 22 and 57 don’t have matching orders and therefore this step adds those to the result of the previous step:
C.custid C.companyname ... O.orderid O.custid O.orderdate ...--------- --------------- ----------- --------- ------------85 Customer ENQZT 10248 85 2014-07-0479 Customer FAPSM 10249 79 2014-07-0534 Customer IBVRG 10250 34 2014-07-0884 Customer NRCSK 10251 84 2014-07-0876 Customer SFOGW 10252 76 2014-07-0934 Customer IBVRG 10253 34 2014-07-1014 Customer WNMAF 10254 14 2014-07-1168 Customer CCKOT 10255 68 2014-07-1288 Customer SRQVM 10256 88 2014-07-1535 Customer UMTLM 10257 35 2014-07-16...22 Customer DTDMN NULL NULL NULL57 Customer WVAXS NULL NULL NULL
Note that in SQL Server, the physical processing typically doesn’t split the work to the three steps I described here; like I said, its only responsibility is to return the same output.
If another JOIN operator exists in the FROM clause, the result of the first join becomes the left input of the second join, and the relevant subset of steps are then applied to that join depending on the join type. Figure 1 puts in all together as a flowchart.
Figure 1: Logical query processing flow chart - Joins
Figure 01 - Logical query processing flow chart - Joins
Controlling the Logical Processing Order of Joins
As mentioned, normally joins are logically processed in written order, from left to right. As an example, consider the following query:
SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitpriceFROM Sales.Customers AS C INNER JOIN Sales.Orders AS OON C.custid = O.custid INNER JOIN Sales.OrderDetails AS ODON O.orderid = OD.orderid;
This query returns a total of 2,155 rows.
The first join that takes place here is the one between Customers and Orders based on the ON predicate C.custid = O.custid. The second join takes place between the result of the first join and the OrderDetails table based on the ON predicate O.orderid = OD.orderid. The ON predicates are actually the ones that define which units are joined together. They create what I like to think of as implied parentheses. You can actually add explicit parentheses. The above query implicitly defines the following parentheses:
SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitpriceFROM ( Sales.Customers AS C INNER JOIN Sales.Orders AS O ON C.custid = O.custid ) INNER JOIN Sales.OrderDetails AS ODON O.orderid = OD.orderid;
Since all joins in the query are inner joins, any nonmatches from either side are discarded. For example, there are two customers (with IDs 22 and 57) that don’t have any matching orders, and therefore they don’t appear in the output. Suppose that you wanted to preserve all customers in the result. The natural inclination is to change the join between Customers and Orders to a left outer join, like so:
SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitpriceFROM Sales.Customers AS C LEFT OUTER JOIN Sales.Orders AS OON C.custid = O.custid INNER JOIN Sales.OrderDetails AS ODON O.orderid = OD.orderid;
Surprisingly, when you run this query you still get the same output as before with 2,155 rows, without the two extra customers that you expected. Can you explain why?
Logical query processing starts with the left outer join between Customers and Orders, preserving the two customers that don’t have orders. The result rows for those two customers are populated with NULLs as placeholders in the columns from the Orders table. Then, the second join is an inner join between the result of the first join and OrderDetails. When evaluating the predicate O.orderid = OD.orderid for the outer rows, the result is the logical value unknown since O.orderid is NULL in those rows. Consequently, the outer rows are discarded. This basically nullifies the outer join, effectively turning it into an inner join. You can generalize this and say that any left outer join that is subsequently followed by an inner join or a right outer join that compares something from the nonpreserved side of the join with anything, nullifies the outer join.
Interestingly, when SQL Server optimizes this query it applies contradiction detection, realizes that the outer join is meaningless, and converts the outer join to an inner join, as you can see in the query execution plan shown in Figure 2.
Figure 2: Query plan with inner join
Figure 02 - Contradiction detection
So, how do you fix this? The answer is, you don’t want a left join between Customers and Orders and then an inner join between the result and OrderDetails. You want a left join between Customers and the result of the inner join between Orders an OrderDetails. Remember that you control which units are joined with the ON clause, which must follow the two units that it connects. To join Orders and OrderDetails, you place the ON predicate O.orderid = OD.orderid right after this join. Let’s call this result OOD. To join Customers and OOD, you need the ON predicate C.custid = O.custid to appear after this join. Putting this all together, here’s how you’re supposed to write the query:
SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitpriceFROM Sales.Customers AS C LEFT OUTER JOIN Sales.Orders AS O INNER JOIN Sales.OrderDetails AS OD ON O.orderid = OD.orderidON C.custid = O.custid;
Now the output contains 2,157 rows, correctly including the two customers that don’t have matching orders.
This query uses implied parentheses, but for clarity, it’s strongly recommended to specify them explicitly, like so:
SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitpriceFROM Sales.Customers AS C LEFT OUTER JOIN ( Sales.Orders AS O INNER JOIN Sales.OrderDetails AS OD ON O.orderid = OD.orderid )ON C.custid = O.custid;
The execution plan for this query now processes the join as an outer join as shown in Figure 3.
Figure 3: Query plan with outer join
Figure 03 - Plan with outer join
The above query represents part of the work involved in the complex sample query [BG: Link?]. In addition, the query is required to return only orders placed since the beginning of 2016. Be careful not to specify the predicate ordedate >= '20160101' in the WHERE clause, since doing so will filter out all customers who didn’t place orders during this period, and our query is supposed to preserve all customers. One way to achieve this is to add this predicate to the ON clause that connects Orders and OrderDetails, like so:
SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitpriceFROM Sales.Customers AS C LEFT OUTER JOIN ( Sales.Orders AS O INNER JOIN Sales.OrderDetails AS OD ON O.orderid = OD.orderid AND O.orderdate >= '20160101' )ON C.custid = O.custid;
In an inner join, the ON clause serves a filtering purpose, so the join between Orders and OrderDetails returns only the rows where the order date is on or after 2016. Then the outer join between Customers and the result of the join between Orders and OrderDetails preserves all customers.
Another way to achieve the same task is to add the predicate to the outer join’s ON clause, like so:
SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitpriceFROM Sales.Customers AS C LEFT OUTER JOIN ( Sales.Orders AS O INNER JOIN Sales.OrderDetails AS OD ON O.orderid = OD.orderid )ON C.custid = O.custidAND O.orderdate >= '20160101';
The two queries are logically equivalent, and in fact, the optimizer produces identical plans for both. To me, the first is more natural. The important thing, like I said, is not to specify the predicate in the WHERE clause because doing so will nullify the outer part of the outer join. I’ll elaborate on the difference between ON and WHERE in a future article in the series that focuses on the WHERE clause.
Controlling the physical processing order of joins
The previous section described how to control the logical processing order of joins. As mentioned, the implementation (SQL Server in our case) doesn’t have to physically process the joins in the same order as the logical processing does, rather it’s only required to produce the same result. So the optimizer explores different join orders and picks the plan with the lowest estimated cost between the ones it explored.
The types of rearrangements that the optimizer can apply depend on the join type that you’re using. For outer joins, the only rearrangement allowed is to change A LEFT JOIN B to B RIGHT JOIN A, and the other way around. But it cannot completely change the order of the tables in a multi join query, since that could change the meaning of the query. With inner and cross joins it’s open season for the optimizer to do any kind of reordering that it likes. This includes the order of appearance of the tables, e.g., rearranging the following joins:
A INNER JOIN B INNER JOIN C INNER JOIN D
To the following joins:
B INNER JOIN C INNER JOIN A INNER JOIN D
This also includes rearranging the packing of units. For example, consider the following arrangement:
A INNER JOIN B
ON A.x = B.x
INNER JOIN C
ON B.y = C.y
INNER JOIN D
ON C.z = D.z
Remember the ON clause determine which units get connected. Written in this manner you get the following implied parentheses:
( ( A INNER JOIN B
ON A.x = B.x )
INNER JOIN C
ON B.y = C.y )
INNER JOIN D
ON C.z = D.z
The optimizer can change this arrangement, for example, to the following:
A INNER JOIN
B INNER JOIN C
ON B.y = C.y
ON A.x = B.x
INNER JOIN D
ON C.z = D.z
Where the implied parentheses are:
( A INNER JOIN
( B INNER JOIN C
ON B.y = C.y )
ON A.x = B.x )
INNER JOIN D
ON C.z = D.z
According to Benjamin Nevarez in the article Optimizing Join Orders, in combinatorics, the number of possible arrangements for N tables is (2N − 2)! / (N − 1)!, where the exclamation point (!) is factorial. For example, with 5 tables you get 1,680 arrangements. With 10 tables you get 17,643,225,600 arrangements! (here the exclamation point is shock). Since the optimizer has time and cost thresholds for optimization, you realize that with more tables in the query, the less likely it is that it will find the truly optimal order. If you suspect that the optimizer chose a grossly inefficient order, you can force it to use an order that you think that is more optimal. You do so by adding the query hint OPTION(FORCE ORDER) at the end of the query. If you want to force join order in all queries in the session, you can set the session option FORCEPLAN to ON. Note that forcing order is recommended mainly as a troubleshooting step, and if used in production, then only as a short term temporary solution. The reason for this is that you want the optimizer to be able to change its choices over time due to changing conditions like changes in indexing and data characteristics.
As an example for forcing physical join ordering, consider the following query:
SELECT C.custid, C.country, O.orderid, OD.productid, P.productname, OD.qty, OD.unitpriceFROM Sales.Customers AS C INNER JOIN Sales.Orders AS OON C.custid = O.custid INNER JOIN Sales.OrderDetails AS ODON O.orderid = OD.orderid INNER JOIN Production.Products AS PON Od.productid = P.productid;
On my system the optimizer chose the plan shown in Figure 4 for this query.
Figure 4: Query plan for multi-join query
Figure 04 - Plan for multi-join query
The optimizer physically reordered the joins to the following:
SELECT C.custid, C.country, O.orderid, OD.productid, P.productname, OD.qty, OD.unitpriceFROM Production.Products AS P INNER JOIN Sales.Customers AS C INNER JOIN ( Sales.Orders AS O INNER JOIN Sales.OrderDetails AS OD ON O.orderid = OD.orderid ) ON C.custid = O.custidON Od.productid = P.productid;
If you think, for example, that the originally written order is the optimal, you can force it, like so:
SELECT C.custid, C.country, O.orderid, OD.productid, P.productname, OD.qty, OD.unitpriceFROM Sales.Customers AS C INNER JOIN Sales.Orders AS OON C.custid = O.custid INNER JOIN Sales.OrderDetails AS ODON O.orderid = OD.orderid INNER JOIN Production.Products AS PON Od.productid = P.productidOPTION (FORCE ORDER);
Remember, the implied parentheses here are the following:
SELECT C.custid, C.country, O.orderid, OD.productid, P.productname, OD.qty, OD.unitpriceFROM ( ( Sales.Customers AS C INNER JOIN Sales.Orders AS O ON C.custid = O.custid ) INNER JOIN Sales.OrderDetails AS OD ON O.orderid = OD.orderid ) INNER JOIN Production.Products AS P ON Od.productid = P.productidOPTION (FORCE ORDER);
The plan for this query is shown in Figure 5.
Figure 5: Query plan with forced order
Figure 05 - Query plan with forced order
What’s next?
This article is the second part in the series about logical query processing. Here I focused on the FROM clause, and within it more specifically on the JOIN table operator. I described the logical steps involved in a join, how to control logical join ordering, and how to control physical join ordering. Next month I’ll continue the discussion on the FROM clause looking at additional table operators.
About the Author
You May Also Like