Set Operations

SQL Server 2005 Tips

Itzik Ben-Gan

March 27, 2008

9 Min Read
ITPro Today logo in a gray background | ITPro Today



Set operations are fundamental operations in set theory that let you unify sets, figure out the intersection of sets, and determine which elements appear in one set but not the other. SQL Server 2005 supports three set operations, called UNION, INTERSECT, and EXCEPT. A set operation operates on the result sets of two queries (call them query1 and query2), comparing complete rows. The general form of a set operation is

 query1  set operation: {UNION | INTERSECT |  EXCEPT}  query2  [ORDER BY ...]

The set operation you use and the result of the row comparisons determines whether a source row appears in the result of the set operation. UNION will return a row if it appears in either input set, INTERSECT will return a row if it appears in both input sets, and EXCEPT will return a row if it appears in the first set but not the second.

In this article I provide a few set operation tips that you might not be aware of, including circumventing unsupported logical phases, INTERSECT ALL and EXCEPT ALL, and treatment of NULLs. I assume you’re already familiar with the fundamentals of set operations; for more information about set operations, see the Learning Path, page 28.

Circumventing Unsupported Logical Phases


As you can see from the general form of a set operation that I provided, only one logical query processing phase is allowed on the result of a set operation—ORDER BY. The syntax of a set operation doesn’t allow applying any other logical phases to the result (e.g., table operators, WHERE, GROUP BY, HAVING, TOP). However, you can use a simple trick to get around this restriction—define a table expression (e.g., common table expression—CTE—or derived table) based on the query with the set operation, then have the outer query against the table expression apply any logical phases/query clauses that you need. For example, if you run the query that Web Listing 1 shows in the Northwind database, it returns customer-employee pairs that had order activity in both 1996 and 1997.

Suppose you need to group the result by employee, and count the number of customers each employee handled. This can be achieved easily with a table expression such as the one that Web Listing 2 shows. In a similar manner, you can apply any other logical phase to the table expression C.

Going back to the general form of the set operation, query1 and query2 aren’t allowed to have ORDER BY clauses because by definition a set operation operates on unordered sets, and an ORDER BY clause would impose order. This restriction applies even when you use the TOP option. Suppose you need to apply a set operation to the result sets of two queries defined with a TOP option and an ORDER BY clause. To accomplish this task, you must define two table expressions based on the TOP queries, and have the set operation applied to two queries against the table expression, as Web Listing 3 shows. This query returns shipping destinations (country, region, city) that were in the top five shipping destinations in 1997 (based on number of orders) but weren’t in the top five in 1996.

INTERSECT ALL and EXCEPT ALL


ANSI SQL supports two versions of each set operation— a DISTINCT version and an ALL version. SQL Server 2005 implements both versions of the UNION set operation, but only the DISTINCT version of INTERSECT and EXCEPT. In this section I explain what the two versions mean with each set operation, and I present alternatives you can use for the missing ALL versions.

The DISTINCT version of INTERSECT means that as long as a row appears at least once in each set, the row will appear once in the output. For example, the following query returns one occurrence of each location (country, region, city) that has at least one customer and one employee:

 SELECT Country, Region, City FROM  dbo.Customers  INTERSECT  SELECT Country, Region, City FROM  dbo.Employees;

This query returns three rows in the output, where each qualifying location appears only once.

ANSI SQL also supports an INTERSECT ALL set operation that wasn’t implemented in SQL Server 2005. INTERSECT ALL returns as many occurrences of a row as the minimum number of times that it appears in any set. If row R appears m times in the result set of query1, and n times in the result set of query2, R will appear a minimum number of times (m, n) in the result of the set operation. For example, if R appears three times in one set and five times in the other, R will appear three times in the result of the set operation. You typically need to use INTERSECT ALL if you want to pair each instance of a row from one set with a different instance of the row in the other. Using our last query as an example, say that each employee is supposed to handle only one customer from the same location, and you need to check how many customers can be matched with an employee from the same location. In this case, you’d need INTERSECT ALL, as in the following code (don’t run this query):

 SELECT Country, Region, City FROM  dbo.Customers  INTERSECT ALL  SELECT Country, Region, City FROM  dbo.Employees;

Unfortunately, SQL Server 2005 doesn’t support this standard INTERSECT ALL syntax. A simple alternative is to assign row numbers to number the occurrences of each location, and issue the INTERSECT operation between the two sets that include the row numbers, as Web Listing 4 shows.

As an example, the location UK, NULL, London appears six times in Customers and four times in Employees. The query against Customers assigns row numbers 1 through 6 to the occurrences of this location, and the query against Employees assigns row numbers 1 through 4 to the occurrences of this location. Four occurrences of this location (with row numbers 1 through 4) intersect; therefore this location will appear four times in the output.

Similarly, ANSI SQL defines a set operation called EXCEPT ALL that isn’t implemented in SQL Server 2005. If row R appears m times in the result of query1 and n times in the result of query2, R will appear a maximum number of times (m-n, 0) in the result of query1 EXCEPT ALL query2. For example, a location L that appears six times in the result of query1 and four times in the result of query2 will appear two times in the result of query1 EXCEPT ALL query2, and zero times in the result of query2 EXCEPT ALL query1.

You typically need EXCEPT ALL when you want to figure out how many instances of a row from one set can’t be paired with a different instance of the row in the other set. Using our last query as an example, say that each employee is supposed to handle only one customer from the same location, and you need to check how many customers can’t be matched with an employee from the same location.

To achieve the logical equivalent of the missing EXCEPT ALL operation, you can use the same trick with the row numbers I showed earlier, only this time you use the EXCEPT operation instead of INTERSECT, as Web Listing 5 shows.

As I mentioned earlier, the location UK, NULL, London appears six times in Customers and four times in Employees. The occurrences of this location are numbered 1 through 6 in the first set and 1 through 4 in the second. Only occurrences 5 and 6 from the first set don’t find a match in the second; therefore in the 85 output rows you’ll find only two occurrences of UK, NULL, London.

Continued on page 2

Treatment of NULLs


One of the key benefits of using set operations over alternative methods is that a set operation treats NULLs as equal when comparing rows. Take the following query as an example:

 SELECT Country, Region, City FROM  dbo.Customers  INTERSECT  SELECT Country, Region, City FROM  dbo.Employees;

This query returns distinct locations where both employees and customers exist. Web Table 1 shows the outputof the query. Notice that the location UK, NULL, London appears in the output because both customers and employees exist in this location. The set operation considers the NULL region in the customer location (UK, NULL, London) as being equal to the NULL region in the employee location (UK, NULL, London).

Other language elements such as ON and WHERE filters have a different treatment when comparing two NULLs. For example, an attempt to write a solution that uses an INNER JOIN statement, as Web Listing 6 shows, would return the output shown in Web Table 2. Notice that the location UK, NULL, London doesn’t appear in the output even though it does appear in both Customers and Employees. This location was eliminated by the INNER JOIN statement. When the JOIN predicate in the ON clause compared the NULL region in both sides, the comparison yielded UNKNOWN. An ON filter (as well as WHERE and HAVING) eliminates rows for which the predicate yields UNKNOWN. To get the exact logical equivalent of the INTERSECT query, you’d need to explicitly handle NULLs as in Web Listing 7. At this point, the solution is so convoluted that you can really appreciate the brevity and simplicity of the INTERSECT set operation.

In a similar manner, the EXCEPT operation will also treat two NULLs as equal. For example, the following query returns the output that Web Table 3 shows, with the distinct locations where employees exist but not customers:

 SELECT Country, Region, City FROM  dbo.Employees  EXCEPT  SELECT Country, Region, City FROM  dbo.Customers;

Notice that the location UK, NULL, London doesn’t appear in the output because it does appear in both Employees and Customers. Now run the alternative solution in Web Listing 8, using the NOT EXISTS predicate. You’ll get the output that Web Table 4 shows.

This time the location UK, NULL, London was returned even though it appears in both Employees and Customers. When the inner query’s filter compared the NULL region in the customer row and the NULL region in the employee row, the predicate evaluated to UNKNOWN, and the row was filtered out as if London, NULL, UK doesn’t exist in the Customers table. Hence, the location was returned from the Employees table. To get the true logical equivalent of the EXCEPT query, you’d need to handle NULLs explicitly, as in Web Listing 9. But again, this alternative solution is quite convoluted, emphasizing the advantage of using the shorter, simpler, and more elegant EXCEPT version.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like