Auxiliary Table-of-Numbers Tricks
Use a table-valued function to generate a virtual table of integers
December 17, 2006
Using an auxiliary table of numbers (call it Nums) can be helpful in many T-SQL querying problems. The table is very simple—it’s just one integer column (call it n) containing a sequence of integers from 1 and on (as many as you need). You can use an auxiliary table of numbers to perform tasks such as splitting strings containing arrays of elements, generating histograms, unpivoting data, generating copies, and generating test data. Since this type of table can be so useful, you’d probably want to create one in your database and use it in your queries. However, suppose that you aren’t allowed to create new tables in your database. Or even if you are allowed, you’re looking for a faster way to obtain a table of numbers than querying a real table. So here’s the challenge, which is the focus of this article: Write a table-valued function (call it fn_nums) that accepts as an input parameter the number of rows you want to produce (call it @ max) and returns a table of integers in the range 1 through @max.
The function should return the result as quickly as possible. Here’s an example of how you’d query the function to return 10 numbers:
SELECT n FROM dbo.fn_nums(10)AS Nums;
(Some of the code in the article and listings wraps to multiple lines because of space constraints.) This query should produce the output that Table 1 shows. To test performance, run the code with the input 10000000. Make sure that when you test performance, you select the SQL Server Management Studio (SSMS) Discard results after execution check box—to do so, click Tools, Options, Query Results, SQL Server, Results to Grid (or Text). Doing so lets you test server-processing time, excluding the time it takes to generate the output. Try to come up with an implementation that runs at around 10 seconds or less for 10 million rows. See if you can work out your own solution before looking at the three solutions I provide. Good luck!
Solution 1: Simple Recursive Query
My first solution is based on a simple recursive query. Listing 1 shows the solution code, including the function’s definition. The code defines a recursive common table expression (CTE) called Nums. The anchor member returns a single row with the value 1 in column n. The recursive member returns a row with the value n + 1 from the previous row if n is smaller than the input parameter @max; otherwise, it returns an empty set (recursion termination check). In short, the recursive member stops as soon as it generates the requested number of rows. The outer query returns the unified set combining the result set returned by the anchor member and all the result sets returned by the recursive member. The recursive member is invoked @max times (the last returning an empty set).
To test the function, run the following query:
SELECT n FROM dbo.fn_nums(10) AS Nums;
You’ll get the output that Table 1 shows. Before you try it with 10 million rows, though, I’d like to point out a logical problem with this implementation. To prevent infinite invocations of a recursive member in a recursive CTE, SQL Server limits the number of invocations of the recursive member to 100 by default. The 101st invocation would cause the query to fail at runtime and generate an error. So if you try running this code:
SELECT n FROM dbo.fn_nums(1000)AS Nums;
You’ll get the error message Msg 530, Level 16, State 1, Line 1, The statement terminated. The maximum recursion 100 has been exhausted before statement completion. SQL Server lets you change the default limit of 100 by specifying a MAXRECURSION hint. Setting MAXRECURSION to 0 means no limit. However, if you try incorporating the MAXRECURSION hint in the query within the function’s definition, as Listing 2 shows, you’ll get the following error: Msg 156, Level 15, State 1, Procedure fn_nums, Line 10, Incorrect syntax near the keyword 'OPTION'.
SQL Server doesn’t let you incorporate query hints within a function’s definition. You’re left with no choice but to specify the hint in the outer query against the function, like this:
SELECT n FROM dbo.fn_nums(1000) AS Nums OPTION (MAXRECURSION 0);
Of course, doing so is undesirable because you have to know how the function is implemented (i.e., that it uses a recursive query) to realize that you need to specify the hint in the outer query, so you lose some of the important features that encapsulation is supposed to give you. But that’s the reality.
To test the performance of this implementation, first make sure that you select the Discard results after execution check box in SSMS. Then in a new query window, run the following code:
SELECT n FROM dbo.fn_nums(10000000) AS NumsOPTION (MAXRECURSION 0);
In my test system, this query ran for 253 seconds. That’s very slow, not to mention that you also have to specify the hint in the outer query. The reason for the bad performance is mainly the overhead involved with each individual invocation of the recursive member (e.g., spooling and indexing of the intermediate sets). In short, this solution is far from satisfactory, so let’s explore further alternatives.
Solution 2: Sophisticated Recursive Query
My next solution is more sophisticated than the previous one and is also based on a recursive query. Listing 3 shows the solution code, including the function’s definition. The code defines a recursive CTE called SQR that generates as many rows as the ceiling of the square root of @max. Then a second CTE called Product produces a Cartesian product (cross join) of two instances of SQR. This product will have at least @max rows—or a bit more, since I used the expression CEILING(SQRT(@ max)). Without calculating the ceiling, you might get fewer rows in the product. So it’s better to return more rows than you might need, then use the TOP option in the outer query to obtain exactly the number of rows you need.
The outer query uses the ROW_ NUMBER function to actually generate the sequence of numbers. Notice that the ROW_NUMBER function uses a column called const in the ORDER BY clause. This column is nothing more than a constant that I produced in the SQR CTE. The optimizer is smart enough to figure out that it’s a constant, thus avoiding an explicit sort operation altogether. This solution is very efficient.
Run the following code to test the function (again, with results discarded):
SELECT n FROM dbo.fn_nums(10000000) AS NumsOPTION (MAXRECURSION 0);
The query ran for eight seconds in my test system. That’s very fast—however, you need to specify the MAXRECURSION hint here as well.
Solution 3: Nonrecursive Query
My last solution doesn’t rely on recursive CTEs; therefore, its benefit is that you don’t need to specify any hints in the query against the function. Listing 4 shows the solution code.
The code defines a CTE called C0, which returns two rows with a single column called const that holds a constant value. Here, the CTE’s purpose is simply to generate two rows; it doesn’t really matter what those rows contain. Then a series of CTEs (C1, C2, C3, C4, C5, C6) double the rows of the CTE defined before them by performing a cross join between two instances of the previous CTE. The number of rows you can potentially get via a CTE Cn is 2^(2^n). For example, having six CTEs besides C0 can potentially yield 18, 446,744,073,709,552,000 rows. Having five CTEs besides C0 can potentially yield about four billion rows. In other words, six CTEs would be sufficient in practical terms for any number of rows that you might need.
The outer query then uses the TOP option to return the number of requested rows (@max), and a ROW_NUMBER function actually generates the sequence of integers (again, by using a column holding a constant in the ORDER BY clause). The nice thing about the optimization of this code is that the optimizer is smart enough to figure that it needs to generate only @max rows. It doesn’t bother to generate any more rows, although logically it might seem as if the code is doing so. Therefore, this solution is very fast.
To test the function, run the following code (again, with results discarded):
SELECT n FROM dbo.fn_nums(10000000) AS Nums;
This code runs for nine seconds for 10 million rows. It’s slightly slower than the previous solution, but its important advantage is that you don’t need to specify any hints in the query against the function.
Fast and encapsulated
I’ve examined three different solutions for generating a virtual auxiliary table of numbers via a table-valued function. The first solution uses a simple recursive query that iterates once per row. This solution is the slowest and requires you to specify a hint in the query against the function. The second solution produces a Cartesian product between two instances of a recursive CTE containing as many rows as the square root of @max. This solution is very efficient, but it also requires you to specify a hint in the query against the function. The third solution uses a series of CTEs, each doubling the number of rows from the previous CTE. This solution doesn’t rely on recursive CTEs—rather, nonrecursive ones. It’s both very fast and has the advantage that you don’t need to specify a hint in the query against the function, so we can safely announce it as the winner!
About the Author
You May Also Like