Survive the (Relational) Divide
Sometimes a hybrid solution is the answer
July 23, 2003
Relational division is a useful T-SQL programmer's tool for solving problems users encounter as they try to retrieve information from their database tables. In my previous two columns, I discussed different techniques for handling relational division, which involves matching sets that share some or all members. In "Set Members and Relationships," June 2003, InstantDoc ID 38515, I discussed solutions based on aggregations. In "A Different Setup," July 2003, InstantDoc ID 38812, I reviewed solutions based on correlated subqueries. This month, I present a relational-division puzzle, invite you to solve it by applying those earlier techniques, then show you an additional solution and some variations.
The Puzzle
In a fictitious training company called Luminata, the training staff stores information about courses and course materials in a SQL Server database. A table called Courses holds a row for each course that contains course attributes such as the course ID, title, and duration. Information about materials such as student textbooks and software CD-ROMs resides in a table called Materials, which contains a row of identifiers such as material ID, material name, supplier ID, and shipper ID for each material. A table called CourseMaterials tracks the specific materials required for each course in columns called materialid and courseid. Note that each material can appear in Materials only once but can appear in CourseMaterials multiple times because the same material might be required for multiple courses. Run the script that Listing 1 shows to create the Courses, Materials, and CourseMaterials tables and populate them with sample data.
Although courses usually require their own unique set of course materials, on rare occasions at Luminata, different courses share the same set of materials. The training staff wants to reduce the maintenance overhead involved with ordering and organizing class materials, so it has devised a streamlined way to identify courses that share the same materials: The staff assigns every course a minimum ID. For all courses that require unique material sets, the course ID is also the minimum ID. However, for courses that share the same course materials, the minimum ID assigned is the lowest course ID among them. For example, suppose courses with IDs 2 and 3 both require materials 1, 5, and 6. The minimum course IDs that identify the set of required materials would be 2 for both courses.
Here's your puzzle: Write code that for each course returns an ID and a minimum ID from the courses that share the same course materials. Before looking at my solutions, try to solve the puzzle on your own by using the aggregations and correlated subqueries techniques I showed you in my previous two articles. If you use Listing 1's sample data for your solution, you should obtain the results that Figure 1 shows. Courses 2 and 3 share the same set of course materials (i.e., the set {1, 5, 6}), so the staff assigns 2 as the minimum course ID for both courses.
Pure Set-Based Solutions
Now, compare your solutions with a couple of pure set-based solutions that I devised by using the aggregations and correlated subqueries techniques. My solutions are based on the following pseudo query:
SELECT courseid, COALESCE( (SELECT MIN(courseid) FROM Courses AS C2 WHERE C2.courseid < C1.courseid AND ), courseid) AS mincourseidFROM Courses AS C1
For each row in the instance of Courses called C1, a subquery calculates the minimum course ID from the courses that share the same set of materials in the instance called C2. Because the minimum course ID is required, the subquery examines only courses in C2 that carry a lower course ID than they have in C1. When the course in C1 is already the minimum course ID, the subquery returns NULL; in that case, the COALESCE() function returns the course ID from C1. This pseudo code is fairly straightforward; the tricky part is to write the expression that evaluates to true when the sets are equal.
Using aggregations. The first solution, which Listing 2 shows, is based on aggregations. First, you perform a UNION ALL between the set of materials belonging to the course in C1 and the set of materials belonging to the course in C2. Then, you group the result by materialid and see whether any of the groups (materials) contains only one member. A material appears once in the UNION ALL result only if it exists in one course and not in the other; otherwise, the material would appear twice. The NOT EXISTS() predicate verifies that no material appears only once in the result of the UNION ALL between C1's and C2's materials.
Using correlated subqueries. Listing 3 shows the second form of the expression, which is based only on correlated subqueries. The expression uses two nested NOT EXISTS() predicates. One predicate verifies that no material exists in C1 that doesn't exist in C2, meaning that the set of materials in C1 is a subset of the set of materials in C2. The other predicate verifies that no material exists in C2 that doesn't exist in C1, meaning that the set of materials in C2 is a subset of the set of materials in C1. If both predicates are true, C1 and C2 have exactly the same set of materials.
Congratulations if you successfully devised pure set-based solutions to the puzzle. Should you use these solutions in all cases when you want to use relational division to solve a problem? The performance of the pure set-based solutions is reasonable (less than a second) when you work with a small number of courses—say, 100 or fewer—and each course requires only about a half dozen materials. However, when your target data consists of many courses—say, more than 1000—performance degrades badly, and you might need to look for a faster solution. (For details about the performance of various solutions, see the sidebar "Benchmarking Relational-Division Solutions," page 20.) Let's look at alternative solutions for working with progressively larger populations.
Hybrid Set-Based and Iterative Solutions
Think how easy it would be to compare the set of course materials in one course to the set of course materials in another course if each set were represented as a binary string that contained a sorted array of the course material IDs. Even though the T-SQL language doesn't support the concept of arrays, you can use T-SQL code to mimic them in several ways. One approach is to concatenate fixed-length binary values representing the values you want an array to represent.
Using arrays. The first step in concatenating fixed-length binary values is to write a user-defined function (UDF)—the iterative part of the solution. The UDF accepts a course ID as an argument and returns a varbinary string that contains a sorted array of the material IDs associated with that course. Because materialid is of an integer data type, this solution limits you to 2000 material IDs per course, which works fine for this puzzle because you probably need fewer than 100 material IDs. Listing 4 shows you how to implement such a UDF, which I've named dbo.fn_generatearr().
The dbo.fn_generatearr() UDF uses a special form of an assignment SELECT operation, which might assign variables more than once during its run. The query (the set-based part of the solution) scans all rows in CourseMaterials belonging to the course ID you provided as an argument and assigns values to variables for each row that the query is accessing. By invoking another UDF called dbo.fn_insertpos(), which I describe and create in a moment, the query assigns the target position of the current material ID in the intermediate state of the array you've built to a variable called @insertpos. The query then inserts the binary representation of the current material ID into its appropriate position within the array inside of @arr by concatenating the parts before the target position, the material ID, and the part after the target position. For example, the binary string generated for a course that requires materials 1, 5, and 6 would be 0x0000000100000005000000000006 in hex notation.
Run the code that Listing 5 shows to create the dbo.fn_insertpos() UDF. This UDF accepts as arguments a material ID and a binary string that represents an intermediate state of an array; it returns the sorted position into which the query should insert the material ID.
You can use any sort algorithm to implement the function. For example, you can choose an algorithm that instructs the UDF to iterate through the array values from left to right, stopping when the new value is greater than the value to the left of the current insert position and less than the value to the right. I implemented a binary algorithm that requires far fewer iterations than a simple scan of the array from left to right. The UDF positions the @insertpos variable in the middle of the array and initializes the @jumplen variable to half the number of elements in the array, rounded up.
The dbo.fn_insertpos() code then enters a loop in which it divides @jumplen by two in each iteration and compares the new material ID to the values to the left and right of the current insert position. If the new value is greater than the value to the right of the current insert position, the new insert position moves to the right in a @jumplen interval. If the new value is less than the value to the left of the current insert position, the new insert position moves to the left in a @jumplen interval. If neither of these conditions is true, the new value is already in the correct position for the function to return the value in @insertpos. The loop terminates when @jumplen is no longer greater than 1.
The loop performs one comparison less than the required number of comparisons so that it can use simple comparison logic to find the final position of the new value. Within the loop, an intermediate insert position is assumed to lie between two values and not at an edge of the array. After the loop ends, the function's code determines whether @insertpos should move another step to the left or to the right and finally returns the current value in @insertpos.
After you've written and implemented a UDF that generates an array from the material IDs of a given course, the solution is fairly simple. First, you create a view that for each course returns an array of materials:
CREATE VIEW VArrASSELECT courseid, dbo.fn_generatearr(courseid) AS arrFROM Courses
Then, you use the following query to match courses with the same array, grouping the results by course ID from one instance of the view and calculating the minimum course ID from the other:
SELECT A.courseid, MIN(B.courseid) AS mincourseidFROM VArr AS A JOIN VArr AS B ON A.arr = B.arrGROUP BY A.Courseid
Unlike the set-based solutions I presented earlier, which didn't scale, the VArr-view solution scales well. In my benchmark tests, the VArr solution provided reasonable performance (less then 7 seconds) for up to 3000 courses, each requiring about half a dozen materials. However, performance declined when I added more than 3000 courses, which seemed unsatisfactory to me. To try to speed up the query, I tested another array-based approach.
Using arrays and checksum values. The first modification I tried was to use the CHECKSUM_AGG() function, which returns an integer checksum value for a set of values in a group. I created a view that for each course returns an array and a checksum value for the set of materials:
CREATE VIEW VCSArrASSELECT courseid, CHECKSUM_AGG(materialid) AS CS, dbo.fn_generatearr(courseid) AS arrFROM CourseMaterialsGROUP BY courseid
Hoping that the checksum values would help the query optimizer partition the data, I also used both values in the JOIN condition:
SELECT A.courseid, MIN(B.courseid) AS mincourseidFROM VCSArr AS A JOIN VCSArr AS B ON A.cs = B.cs AND A.arr = B.arrGROUP BY A.courseid
When the query processed up to a few thousand courses, the VCSArr-view solution performed in half the time of the VArr-view solution, but when the query attempted to process 10,000 courses or more, both solutions performed almost the same. So, the search for a faster solution continued.
Using arrays and temporary tables. I knew that temporary tables can help store intermediate results of calculations, which you can index and access several times. So, I decided to see how well temporary tables would work with an array. Similar to the VArr view, the following code stores course IDs and arrays of their material IDs in a temporary table called #TArr and creates an index on arr and courseid:
SELECT courseid, dbo.fn_generatearr(courseid) AS arrINTO #TArrFROM CoursesCREATE UNIQUE CLUSTERED INDEX idx_uc_arr_courseid ON #TArr(arr, courseid)
Note that an index row can't exceed 900 bytes, so arr is limited to 896 bytes because the courseid column is also included in the index. This restriction means that you're limited to 249 materials per course. If you need more materials, you can use the result of a CHECKSUM() function on the array in an index instead of using the array itself. As in the VArr-view solution, you use a query to match courses that have the same array, as follows:
SELECT A.courseid, MIN(B.courseid) AS mincourseidFROM #TArr AS A JOIN #TArr AS B ON A.arr = B.arrGROUP BY A.courseid
This solution performs well and scales well. In my tests, it ran in less than 6 seconds for up to 10,000 courses and in about a minute for 100,000 courses (about 550,000 rows in CourseMaterials). The following code generates the temporary table, including for each course the array and a checksum of the course materials:
SELECT courseid, CHECKSUM_AGG(materialid) AS cs, dbo.fn_generatearr(courseid) AS arrINTO #TCSArrFROM CourseMaterialsGROUP BY courseid
If the array length exceeds 892 bytes (900 bytes minus 8 bytes for the checksum and courseid values), exclude it from the index; if the array is less than 892 bytes long, include it:
CREATE UNIQUE CLUSTERED INDEXidx_uc_cs_arr_courseid ON #TCSArr(cs, arr, courseid)
Finally, the following query (which is similar to the query against the VCSArr view) returns the desired results:
SELECT A.courseid, MIN(B.courseid) AS mincourseidFROM #TCSArr AS A JOIN #TCSArr AS B ON A.cs = B.csAND A.arr = B.arrGROUP BY A.courseid
Note that when arr fits in the index, this solution provides no performance improvement over the VArr-view solution.
Array or Not?
Each successive array-based solution to the relational-division puzzle performed better than the previous one. Tuning is an ongoing process: When you're not satisfied with your solution's performance, you can probably find a faster way to achieve the desired results. Pure set-based solutions should be your first choice because they're usually short, portable, and easy to maintain. But hybrid arrays that combine iterative and set-based logic might perform better.
About the Author
You May Also Like