Set-Operation Alternatives
How to mimic INTERSECT and EXCEPT in T-SQL
November 19, 2003
ANSI SQL defines three operations—UNION, INTERSECT, and EXCEPT—that you can apply to two sets. These operations let you combine rows from different sets, locate which rows exist in both input sets, or find which rows exist in one set but not the other. Each of these operations has many uses and practical implementations. For example, you can use UNION to combine sales data from partitioned tables, INTERSECT to find which customers are also employees, and EXCEPT to find which customers aren't employees. However, T-SQL supports only the UNION operation. Fortunately, you have alternatives in T-SQL that you can use to provide the functionality of the missing operations.
Set Operations
In ANSI SQL terms, the database engine performs a set operation on the sets that two queries return, and the set operation results in a new set. Both queries must have the same number of columns. Matching columns in the two sets must have the same data types, or the column with the lower data-type precedence must be implicitly convertible to the column with the higher data-type precedence. (For example, a character string might contain a number that's convertible to an integer.) The number of columns in the result set is the same as the number of columns in each input set, and the names of the result columns are the same as those in the first set. ANSI set operations take the following form:
Query1{UNION | INTERSECT | EXCEPT} [ALL | DISTINCT]Query2
Note that each of the operations has two variations: ALL and DISTINCT, with DISTINCT being the default if you don't specify either operation. I discuss the purpose of each variation in a moment when I look at each set operation. As I mentioned, T-SQL supports only the UNION operation. You can't explicitly specify the DISTINCT keyword in T-SQL for any of the three operations.
Before delving into the operations, I should mention a couple of their characteristics. Set operations work on entire rows; that is, they compare all columns of each input row—unlike JOIN operations, for example, which match rows based on a partial set of columns. Also, set operations treat two NULLs as duplicates, as the GROUP BY and ORDER BY operations do.
My examples use two sets, U and V, represented by the U and V tables, which you create and populate by running the code that Listing 1 shows. I created the tables with two columns each to demonstrate that the operations work on whole rows. However, when listing set members throughout the article, I include only col2's values for brevity. So instead of specifying the tuples U = {('R1', 1), ('R2', 2), ('R2', 2), ('R2', 2), ('R2', 2), (NULL, NULL)}, V = {('R2', 2), ('R2', 2), ('R3', 3), ('R3', 3),('R4', 4), (NULL, NULL)}, I say: U = {1, 2, 2, 2, 2, NULL}, V = {2, 2, 3, 3, 4, NULL}.
Now, let's explore the set operations and their variations and see what alternatives T-SQL provides for the unsupported INTERSECT and EXCEPT operations. You're probably familiar with UNION and UNION ALL because they're common in T-SQL, but I also briefly discuss those for the sake of completeness.
UNION
UNION returns a set combining both inputs and performs a DISTINCT operation on the combined result. Thus, the result of U UNION V is {1, 2, 3, 4, NULL}. The following T-SQL code performs a UNION between U and V:
SELECT * FROM U UNION SELECT * FROM V
UNION ALL is similar to UNION, except that it returns the combined set of both inputs without performing the DISTINCT operation. The result of U UNION ALL V is {1, 2, 2, 2, 2, 2, 2, 3, 3, 4, NULL, NULL}. You use the following T-SQL code to perform a UNION ALL between U and V:
SELECT * FROM U UNION ALL SELECT * FROM V
If U and V are disjoint—meaning that they have no common rows—and neither has duplicates, UNION and UNION ALL return the same result. However, you should always choose UNION ALL when you're trying to combine disjoint sets because SQL Server will try to eliminate duplicates from the result even if none exist; UNION ALL prevents SQL Server from doing the extra work of trying to remove nonexistent duplicates.
Note that when you use UNION or UNION ALL, you can't specify an ORDER BY clause for each of the input queries; you can specify only one ORDER BY clause at the end of the query. The query then applies that ORDER BY to the result of the UNION ALL operation.
But you might want to sort one input after the other or maybe have each input sorted in a different secondary order. For example, you could sort U's rows before V's rows, sort rows within U by col1, and sort within V by col2 in descending order. You can get this varied sorting by using a derived table and the CASE expression, as Listing 2 shows.
INTERSECT
INTERSECT returns only rows that exist in both inputs and eliminates duplicates. The result of U INTERSECT V is {2, NULL}. For a T-SQL alternative to the unsupported INTERSECT option, most programmers use an INNER JOIN based on all columns along with a DISTINCT operation, as the following query shows:
SELECT DISTINCT U.col1, U.col2 FROM U JOIN V ON U.col1 = V.col1 AND U.col2 = V.col2
However, when NULLs exist in both inputs, this query returns incorrect results. In this case, it returns {2}.
An INNER JOIN operation eliminates NULLs, but the INTERSECT operation should treat a NULL the same way it treats any other value. You can use the query that Listing 3 shows to achieve the correct results, accounting for any NULLs and duplicates that appear in either input. The query eliminates duplicates from each of the inputs and performs a UNION ALL between the results, generating the derived table D1. The query then groups the rows from D1 by all columns and filters out all rows that appear only once.
INTERSECT ALL adds a small twist to INTERSECT. It also returns rows that appear in both inputs, but if a row exists x times in U and y times in V, it should appear MIN(x,y)—that is, the lesser of x and y— times in the result. The expected result is {2, 2, NULL}. Note that 2 appears twice in the result because it appears twice in U and four times in V.
To find a T-SQL substitute for the INTERSECT ALL operation, first use the code that Listing 4 shows to create an auxiliary table of consecutive numbers in the range 1 through the maximum number of duplicates you expect in one table. I populated Nums with values from 1 through 1000, but if you expect more duplicates, you should use a larger maximum value.
Listing 5 shows the rest of the solution to the INTERSECT ALL substitution. This time, within the queries forming derived table D1, I eliminated duplicates by using GROUP BY instead of DISTINCT. Using GROUP BY lets the code calculate the number of duplicates within each input (the result column is called cnt).
The code in Listing 5 groups the rows from D1 by col1 and col2 and returns the minimum count of duplicates between the inputs (in the result column mincnt), producing the derived table D2. Finally, the code joins D2 to Nums, using the JOIN condition n <= mincnt. This JOIN causes each row to appear in the result as many times as the value of mincnt, which is the requirement of INTERSECT ALL.
EXCEPT
EXCEPT (sometimes called MINUS) returns rows that appear in one input but not in the other, so it also eliminates duplicates. Unlike the preceding set operations—for which U V produces the same result as V <operation> U—with EXCEPT, the order in which you specify the set names affects the results. The expected result from U EXCEPT V is {1}, and the result from V EXCEPT U is {2, 3}.
The following query shows a T-SQL way of getting the results for U EXCEPT V:
SELECT col1, col2FROM (SELECT DISTINCT 'U' AS setname, col1, col2 FROM U UNION ALL SELECT DISTINCT NULL, col1, col2 FROM V) AS D1GROUP BY col1, col2HAVING COUNT(*) = 1 AND MAX(setname) = 'U'
The derived table D1 contains distinct rows from each input and a pseudo column called setname, which contains the literal 'U' for U's rows and NULL for V's rows. The code groups the rows from D1 by col1, col2. The COUNT(*) = 1 expression in the HAVING clause ensures that the query returns only the rows that appear in one of the inputs, and MAX(setname) = 'U' ensures that it returns only the rows that appear in U.
The trick to using this code is that the MAX() function doesn't account for NULLs; it considers only the rows in D1 that contain 'U'. To get V EXCEPT U, you can revise the preceding query so that it returns NULL in setname for U's rows and 'V' for V's rows.
EXCEPT ALL adds an interesting twist to EXCEPT. For U EXCEPT ALL V, if a row appears x times in U and y times in V, it appears the greater of x minus y or 0 times in the result. The expected result from U EXCEPT ALL V is {1, 2, 2}. Because 2 appears four times in U (x=4) and 2 times in V (y=2), it appears twice (4-2) in the result. Similarly, the expected result from V EXCEPT ALL U is {3, 3, 4). The EXCEPT ALL operation eliminates the 2 from the result because the greater of (2-4) and 0 is 0. Furthermore, 3 appears twice in the result because it appears twice in V and doesn't appear in U at all.
Listing 6 shows the query that implements U EXCEPT ALL V in T-SQL. The derived table D1 has the positive count of duplicates for U and the negative count for V. D2, another derived table, summarizes U's positive cnt value and V's negative cnt value into the result column dups and makes sure that the query returns the row only if the value of dups is greater than 0. In effect, the value of dups is MAX(x-y, 0).
Finally, the code joins the derived table D2 to the Nums auxiliary table, duplicating each row as many times as the value in dups. To get V EXCEPT ALL U, you can modify Listing 6's query by swapping the positive and negative signs of the counts in U and V.
T-SQL Alternatives
Even when your options are limited, you can usually find T-SQL alternatives to the ANSI features that T-SQL doesn't support. Always be aware of the subtleties of the ANSI requirements, and make sure that your solution considers all possibilities. NULLs have always been troublemakers, partly because ANSI SQL treats NULLs inconsistently in different elements of the query. Knowing exactly how each query element works and how it treats NULLs is important so that you can provide accurate solutions.
INTERSECT and EXCEPT are handy operators, and I hope that someday Microsoft will add T-SQL support for them. But until then, you can use the tricks I've shown here as alternatives.
About the Author
You May Also Like