The Zen of Recursion
Use SQL Server stored procedures recursively to traverse hierarchies
December 19, 2000
Recursion is the property or quality of self-reference. In programming terms, recursion refers to a routine that repeatedly calls itself until a certain condition is fulfilled. One powerful way that SQL Server database developers can use recursion is to traverse hierarchies—collections of items that have parent-child relationships. People deal with hierarchies all the time: Every day, you have a series of tasks you must prioritize and then perform. Also, most people work within a hierarchy of employees. The classic way to represent a hierarchy within a database is to add a parent key to the object. A parent key is simply the primary key of an object that has a parent relationship to the current object. To see an example, let's define a management structure. Table 1 defines a typical employee, Chris; the record for Chris's boss appears in Table 2. Note that the boss's record has an ISupervisorID of 0 and that Chris's record has an ISupervisorID of 212, the IEmployeeID value in the boss's record. These values specify that the boss record is a parent of Chris's record. The matching values of ISupervisorID in Chris's record and IEmployeeID in the boss's record form a data link—or parent key—between the two records.
Now, as a simple example of stored-procedure recursion in a hierarchy, let's say that you're tasked with generating a list of all employees and their subordinates. You need to write a script to retrieve all the records at the top level (i.e., the employees whose ISupervisorIDs are 0), then grab the child records for each record in that set, and so on in a recursive process. Listing 1 contains pseudocode for such a process, and Table 3 is an example of output that meets your needs.
This approach is logically sound, but when you turn the pseudocode into the SQL code in Listing 2 and run it, you get the execution error that Figure 1 shows. The code worked for a level, then failed because SQL Server declared the cursor named cur_Level with a connection-level scope. This means that during its lifetime, the cursor can be accessed across every query that occurs on the same connection. Because the recursive calls occur on this connection, they inherit the cursor, causing an error whenever there's an attempt to create another cursor with the same name. You have two ways of fixing this problem.
First, you could change the Default to local cursor database option setting (support for local cursors is a feature of SQL Server 7.0). By default, this option is set to GLOBAL to conform to the requirements of earlier SQL Server releases, but you can set the option to LOCAL to bypass this problem if you think you'll be using locally scoped cursors often.
A second option for getting rid of the error is to simply add the LOCAL keyword to the declaration, as in Listing 3. After you add the LOCAL keyword, the code performs perfectly, giving you the results that Figure 2 shows.
Now let's look at a more common and useful example: a threaded discussion list. Recursion is a good tool to use in creating a discussion list because messages are presented in a parent-child format, with base-level messages first, followed by replies to that base-level message.
A threaded discussion list is a typical hierarchy: Messages are either top-level subjects or responses (children) to other messages. The message structure looks like Table 4. Let's say that this message board becomes enormously popular, receiving thousands of hits per second. An effective scaling strategy for a high-use application is forward caching, which means that you pregenerate the content to display later. Inserting the content into a cache table ahead of time conserves the server resources that the board's high-traffic conditions demand. Because performance is so important, forward caching is a better option than simply selecting the content through stored procedures.
When you use ADO to access the output of these procedures, the results are represented as a series of recordsets that are accessible through ADO's NextRecordset method (for more information about NextRecordset, see MSDN online documentation). The process of generating messages for users of the discussion list comprises two steps. First, when the user selects the main page, SQL Server generates the message board from a forward cache table called tblForumCache. Then, when the user creates a new message or responds to one, SQL Server uses the recursive stored procedure to regenerate the cache table.
Besides simply displaying the message, you'll want to display how many replies a particular message has, as well as the date and time that each message was created. The final output should look something like Table 5.
You can improve the user experience with this application. Let's say that for expanding and collapsing these elements, you're going to use a Dynamic HTML (DHTML) function that relies on each element's ancestry (i.e., the hierarchy of parent and child relationships up to the top of the data structure). Here's an example message that has two levels of replies:
- SQL Server - entered 6/28/00 3:15 PM by cbehrens
- Stored Procedures - entered 6/28/00 4:15 PM by mjohnson
- Recursion - entered 6/28/00 5:15 PM by hsmith
For each message, you need to generate an ancestry code consisting of the IMessageID values of all the messages that came before. This ancestry code, which the DHTML functions will use, allows for the locating and sorting of messages that belong to each top-level message.
(100) - SQL Server - entered 6/28/00 3:15 PM by cbehrens
(100_101) - Stored Procedures - entered 6/28/00 4:15 PM by mjohnson
(100_101_102) - Recursion - entered 6/28/00 5:15 PM by hsmith
Each line begins with an ancestry code; the message titled "SQL Server" has nothing but its own ID (100) because it's a base-level message. The message titled "Stored Procedures" is a child of the first message, so it carries a code of 100_101: the message ID of its parent plus its own message ID. And, the message titled "Recursion" has a code of 100_101_102—namely, the IDs of each message in its ancestry plus its own message ID.
Table 6 shows the structure of the cache table that you use to pregenerate content, as I discussed earlier. Table 7 lists each of the query elements and their definitions.
The first thing you need to do in this process is to clear the old cache. To do this, simply delete the old data:
DELETE FROM tblForumCache
Next, you need to use a separate procedure to perform the recursion. Let's name this procedure sp_RecurseMessages and give it two parameters: IMessageID and IDepthLevel. To call the procedure, run this command:
EXEC sp_RecurseMessages 0, 0
These parameters tell the procedure to start with a depth level and message ID of 0. You need to specify a 0 value so that the procedure can pass the information back and forth between the levels of recursion. The pseudocode for this procedure is similar to that for the previous procedure, with a few additions, as Listing 4 shows.
The finished recursive code, which Listing 5 shows, opens the cursor initially by selecting the children of the designated parent (initially the parent will be the top-level message because the parent ID is 0). After the procedure enters the record loop, it must update the parent record with an additional reply count that corresponds to the parent message. The code then runs a second procedure to obtain the ancestry of the message and inserts the ancestry values into the cache table.
Next, the stored procedure searches for children of the current message, after incrementing the depth variable to show that the code is proceeding down a level. Then the original procedure calls itself, proceeding through the same steps and perhaps down to the children's children. After the children recordsets are exhausted, the procedure that was called recursively goes up a level, returning control to the calling procedure, and decrements the depth count to reflect that change in level.
Note that the code initializes the @szAncestry variable before passing it to the sp_GetAncestry function. You could optimize the sp_GetAncestry procedure in various ways, such as getting the reply count all at once rather than running a query each time, but this method is a good and simple way to present this stored procedure.
Finally, you need to create the function that actually yields the ancestry. In this case, you don't want a recordset result; a simple string will suffice. To get the string value of the procedure, you declare the @szAncestry variable that you're passing in as an OUTPUT parameter:
CREATE PROCEDURE sp_GetAncestry@szPath varchar(500) OUTPUT,@lMessageID int
The rest of the stored procedure, which Listing 6 shows, essentially repeats the simple recursion function. Notice that the procedure again declares the cursor as local to avoid scoping problems. The code then processes down the hierarchy and back up, picking up the message ID with an underscore along the way, to form the ancestry code that the procedure needs.
You can implement this procedure by placing a trigger on UPDATE, INSERT, and DELETE events in the tblMessage table. Then, when a message is updated, inserted, or deleted, the trigger clears the cache and executes the cache-rebuilding function. Although this approach will cause a slight hiccup when the user creates a message, 90 percent of the users, who won't post a message anyway, will enjoy greatly improved performance because of the forward caching.
To optimize this flow even further, you could try the following three tricks:
Generate ancestry gradually. You could use a series of stored procedures to build a history string as the procedure traverses the hierarchy, trimming and adding on where necessary. Doing so would save the overhead of the sp_GetAncestry call.
Generate ReplyCount all at once. You could run one query to determine how many children a message has, thus avoiding the multiple calls to update the record.
Pregenerate the forum content. You could further develop the trigger so that after it creates the cache table, it transforms the table's contents into an HTML file that would serve as the forum. Figure 3 shows a finished forum screen. This approach would save the Web server the overhead of dynamically generating the file, and it would save the database server an open connection and the overhead of querying the cache table.
Recursion can be a powerful tool for dealing with hierarchical data. With the new LOCAL keyword and a little cleverness, you can leverage recursion in SQL Server 7.0 with terrific results.
About the Author
You May Also Like