Get in the Loop with CTEs
Use common table expressions to recursively manipulate data
April 19, 2004
SQL Server 2005, formerly code-named Yukon, introduces a wealth of new T-SQL features and enhancements, which I briefly highlighted in my November 2003 article, "What's New in Yukon T-SQL." But in my opinion, common table expressions (CTEs) are by far the most important and exciting new SQL Server 2005 T-SQL feature. CTEs come in two forms: non-recursive and recursive. In this article, I delve into the details of CTEs. I first discuss non-recursive CTEs, then recursive CTEs in single-parent (tree) scenarios. After you've read the article, see the sidebar "Practice What You've Learned," to test your CTE skills.
Non-Recursive CTEs
A CTE is a named table expression that Microsoft implemented in SQL Server 2005 according to the ANSI SQL-99 standard. In its non-recursive form, a CTE is an alternative to derived tables, views, and inline user-defined functions (UDFs). A derived table is a named table expression that exists for the duration of a query. If you need to refer to a derived table multiple times in an outer query, you must define multiple derived tables by using the same query for each. This requirement can lead to lengthy code that's difficult to read and maintain and makes a derived table less advantageous than a view, as I discuss shortly. One advantage that derived tables have over views is that a derived table can refer to variables you defined in the same batch. Views are also named table expressions that persist in the database until you explicitly drop them. Unlike with derived tables, a query can refer to a view name multiple times; however, the view can't refer to variables. Inline UDFs have the same characteristics as views except that they can refer to the function's input arguments. CTEs have the best features of derived tables, views, and inline UDFs—they are named table expressions that exist only for the duration of the query, that an outer query can refer to multiple times, and that can refer to variables defined in the calling batch.
Let's jump right into an example that demonstrates all these advantages. Run Listing 1's code in SQL Server 2005's new AdventureWorks sample database to return current and previous yearly total sales values from the SalesOrderHeader and SalesOrderDetail tables. Listing 1's code first defines a variable called @CustID and sets its value to 3. The code then defines the CTE, using the WITH clause followed by the CTE name (YearlyCustOrdersCTE) and an optional result column list in parentheses (OrderYear, TotalValue). After the AS clause, the CTE's body contains a query that returns the order year and total sales value. The outer query following the CTE's body refers to the CTE name twice—for current and previous yearly sales—by left-joining the instance of the CTE representing current yearly sales (Cur) to another instance representing previous yearly sales (Prev).
Sometimes in SQL Server 2005, the purpose of the WITH clause is ambiguous because you can use it for other purposes (e.g., to specify a table hint). So, you should suffix the statement preceding the CTE's WITH clause with a semicolon to specify where the preceding statement ends. In T-SQL, semicolons are required only as a suffix to a statement preceding a CTE's WITH clause. However, ANSI supports the use of a semicolon to specify where any statement ends, so suffixing all statements with a semicolon is good programming practice.
You can use one WITH statement to define multiple CTEs. Each CTE's query expression can refer to any CTE you defined before it in the same WITH statement. The outer query can also refer to all CTEs you defined. As an example of defining multiple CTEs, run the code in Listing 2 to return, for each order in the SalesOrderHeader table, the first and last dates of the order month.
Listing 2's first CTE (OrdersCTE) returns the sales order ID and order date from the SalesOrderHeader table. The second CTE (OrdersBOMCTE) calculates the beginning of the order month by referring to the first CTE. The third CTE (OrdersBOMEOMCTE) calculates the end of the order month by referring to the second CTE. The outer query simply returns all columns from the third CTE.
You can even modify data through CTEs. Before I show you an example of how to modify data, run the code that Listing 3 shows to create the SalesSummary table and populate it with sample data. The table contains current yearly sales quantities and values in the Qty and Value columns, respectively, but all new rows contain NULLs in the previous year's PrevQty and PrevValue columns.
Suppose you want to update the PrevQty and PrevValue columns by querying the SalesSummary table. If you want to stick to ANSI-compliant solutions, you have to use two subqueries, one for each column, which requires scanning the table twice. If you don't care about ANSI compliance, you can use T-SQL-specific syntax that allows joins in an UPDATE statement. With CTEs, you can have the best of both worlds—the ability to scan the table once and remain ANSI compliant. Run the code that Listing 4 shows to update the SalesSummary table by using a CTE called SalesCTE. This CTE performs a left outer join to match current yearly sales with previous year's sales. The outer UPDATE query modifies the previous-year columns with data from the current-year columns.
Recursive CTEs: Single Parents
TEs can contain references to themselves and, by doing so, apply recursion. Recursive CTEs are useful in manipulating single-parent hierarchies (trees) such as employee organizational charts and forum discussions. They're also useful for multiparent hierarchies (graphs) such as bills of materials (BOMs).
The use of recursive CTEs in T-SQL has several important advantages over other ways of manipulating hierarchical data both in SQL Server and in other database systems. Other methods for manipulating hierarchical data in SQL Server include the use of UDFs or stored procedures that apply recursion or loops to data from temporary tables or table variables. The code of such UDFs and stored procedures is usually lengthy, hard to follow, and hard to maintain—and it's not ANSI compliant. Recursive CTEs keep your code short and ANSI compliant. For hierarchical data manipulation, some other database systems use proprietary code that's non-portable; others apply cursors internally, making the solution inefficient. Recursive CTEs use a series of set-based queries, making for highly efficient solutions. Figure 1 shows the syntax of a basic recursive CTE.
The CTE header functions the same way in recursive and non-recursive queries, so let's focus on the recursive CTE's body and outer query. The body includes at least two queries separated by a UNION ALL operator. (Later, I discuss options you have when specifying more than two queries.) The Anchor Member is a query that SQL Server invokes only once and that doesn't reference the CTE's name. The Recursive Member—a query that contains a reference to the CTE's name—uses the query result in its first iteration. SQL Server invokes the Recursive Member repeatedly until it returns an empty set. In the first invocation, the reference to the CTE's name represents the result the Anchor Member returned. In subsequent invocations, the reference to the CTE's name represents the result the previous invocation of the Recursive Member returned. The reference to the CTE's name in the outer query represents the UNION ALL of all the invocations of both members.
Let's walk through an example to better understand recursive CTEs. Run the code that Listing 5 shows to return Employee 12 and all subordinates at all levels. The Anchor Member at callout A returns details about Employee 12. Then, the Recursive Member at callout B returns the direct subordinates of the employees that the previous step returned. Because the first step is the invocation of the Anchor Member, the Recursive Member returns direct subordinates of Employee 12 (Employee 3). The next invocation returns direct subordinates of Employee 3 (4, 9, 11, 158, 263, 267, 270). The next invocation returns direct subordinates of employees 158 (79, 114, 217) and 263 (5, 265). The final invocation of the Recursive Member returns no subordinates, so SQL Server doesn't invoke the Recursive Member again. The outer query returns Employee 12 and all that employee's subordinates at all levels.
Set-Operation Rules
As I mentioned, a recursive CTE can contain more than two members and has rules governing which set operations (UNION, UNION ALL) you can use between the members. UNION ALL takes two inputs and generates a result containing the rows from both inputs, and UNION does the same with the addition of performing a DISTINCT operation over the unified results to remove duplicates. (For details about the UNION and UNION ALL set operations, see my December 2003 column, "Set-Operation Alternatives") When one of the CTE members is a Recursive Member, SQL Server repeatedly performs the UNION ALL set operation between members of a recursive CTE until the Recursive Member returns an empty set. SQL Server 2005 doesn't support recursive CTEs where a Recursive Member participates in a UNION operation, directly or indirectly.
Table 1 shows valid and invalid set operations in a recursive CTE. Notice that any scenario where a Recursive Member participates in a UNION operation is invalid. For example, in the scenario AM U RM, the Recursive Member participates in a UNION operation directly, which is invalid. In the scenario RM UA AM U AM, the Recursive Member participates in a UNION operation indirectly because SQL Server performs a UNION between the result of RM UA AM and AM, also an invalid operation.
Listing 6 contains code examples of two recursive CTEs. The first is a valid CTE that applies the AM U AM UA RM scenario; the second CTE is invalid because it applies the AM UA RM U AM scenario.
Limiting Recursion
When you have recursion in any environment, you risk causing infinite recursive calls. Such a situation might happen as a result of a logical bug in your code—for example, if your Recursive Member returns the same rows as the Anchor Member. Infinite recursive calls can also result from invalid data entered into your database. For example, suppose your table contains data specifying that Employee 10 manages Employee 11 and Employee 11 manages Employee 10. Such a scenario is called a cyclic relationship. SQL Server 2005 doesn't provide a tool that detects cyclic relationships, but it does let you limit the number of recursive iterations to prevent infinite recursive calls. You can add the MAXRECURSION option in an OPTION clause you specify at the end of the outer query and provide a number that determines the maximum number of invocations for the Recursive Member. SQL Server 2005 defaults to a maximum of 100 recursion calls if you don't specify a value. If you specify 0, you remove any limitation on the number of recursive calls.
As an example of using the MAXRECURSION option, run the code in Listing 7 to return Employee 109 and direct and indirect subordinates, with a limit of three recursive calls. Note that if the number of recursive invocations reaches the limit you specify in the MAXRECURSION option before the recursive member returns an empty set (normal recursion termination), your code fails and SQL Server returns the following error message:
.Net SqlClient Data Provider: Msg 530, Level 16, State 1, Line 1
Statement terminated. Maximum recursion 3 has been exhausted before statement completion.
In such a case, you're not guaranteed to get any result set. Therefore, don't use the MAXRECURSION option if you want to logically limit the number of returned levels (as opposed to using it as a safety measure against infinite recursive calls). For example, if you want to return only three levels of subordinates below Employee 109, you should devise a custom solution instead of using MAXRECURSION.
An efficient way to logically limit the number of levels to return is to generate a pseudo level column (call it lvl). You return a 0 value in the pseudo lvl column in the Anchor Member, return lvl + 1 in the same column in the Recursive Member, and specify lvl < level_limit in the Recursive Member's filter. To see how to use the pseudo lvl column, run the code in Listing 8, which returns Employee 109 and subordinates three levels down.
The real power of CTEs lies in their recursive manipulation of data. CTEs let you write short, efficient, ANSI-compliant code to manipulate hierarchies. In my next column, I'll show you how to manipulate multiparent hierarchies, then I'll explore advanced concepts related to recursive CTEs.
About the Author
You May Also Like