Hashing for Performance
By organizing your data, hashing helps SQL Server process JOIN queries more efficiently without resorting to sorting.
March 26, 2002
A little organization goes a long way
In "Indexes and Joins," March 2002, InstantDoc ID 23731, I explained two of the techniques that SQL Server can use to process join operations. Nested-loop joins, probably the most common type of join, use the algorithm that most people think about when writing join operations. That is, for each qualifying row in one table, SQL Server will find all the rows in another table that match that row on the join column (or columns). Merge joins are a particularly efficient join method that requires only one pass through each input, but SQL Server can use it only when both data inputs are already sorted by the join column.
Now let's look at the third algorithm—hashing—that SQL Server can use to process your join operations. Remember that in most cases, letting the SQL Server query optimizer determine the type of join to use is the best practice. But when you're familiar with the ways SQL Server can process joins, you can understand what you're seeing when you examine a query plan.
Hash Joins
Hash joins, like merge joins, were introduced in SQL Server 7.0. SQL Server can use this join technique when no useful index exists on the join column in either table. (This lack of indexes also implies that the data isn't in a useful sorted order, because you need an index on a table to ensure that it's already internally sorted.) Hashing, a well-understood algorithm in computer science, is commonly used for searching algorithms on many kinds of systems. I present a simplified definition of hashing in the following paragraphs.
In "Indexes and Joins," I showed that if the data is already sorted, a merge join is a highly efficient method for finding matching rows. If the data has no predefined order and no set limit on the number of times a value can occur, any search operation has to look at every row—which, for a huge table, can be very time-consuming. I also demonstrated that for a small input, SQL Server might decide to sort the input before the join so that it could then perform a merge operation. But for a large table, what are the options? Sorting millions of rows before joining could end up taking more time than a brute-force nested-loop join on completely unordered rows. Hashing provides a middle ground. It lets SQL Server organize the data into categories to make finding matching rows more efficient, but this organization isn't a complete sort. Hashing lets you determine whether a particular data item matches an existing value by dividing the existing data into groups based on some property. You can put all the data with the same value for that property into a hash bucket, which is like a temporary table. Then, to see if a new value has a match in the data already examined, you simply look in the bucket for the right value.
For example, suppose you need to keep all the wrenches in your workshop organized so that you can find the right size and type when you need it. You can organize them by their size property and put all half-inch wrenches (of any type) together, all 10mm wrenches together, and so on. When you need to find a 15mm box wrench, you simply go to the 15mm drawer to find one (or to see whether the last person who borrowed it has returned it). You don't have to look through all the wrenches, only the 15mm ones, which greatly reduces your searching time. Alternatively, you can organize your wrenches by type and have all the box wrenches in one drawer, all the socket wrenches in another drawer, and so on. So although your wrenches aren't stored in a completely sorted sequence by both type and size, just putting like wrenches in the same drawer (or bucket) lets you limit your search to that one drawer. Some organization is better than none.
Hashing data in a table isn't quite so straightforward. You need a property by which to divide the data into sets of manageable size with a relatively equal membership. For integers, the hashing property might be something as simple as applying the modulo operator to each existing value. If you decide to hash on the operation modulo 13, every value is put in a hash bucket based on its remainder after dividing by 13. You need 13 buckets because the possible remainders are the integers 0 through 12. In real systems, the hash functions are substantially more complex and the number of buckets can be in the tens of thousands or more; coming up with a good hash function for an application is an art as well as a science. When SQL Server uses hashing to join two inputs, SQL Server uses one input—the build input—to build the hash buckets. Then, one row at a time, it inspects the second input—the probe input—and tries to find a match in the appropriate existing bucket. If it finds a match, SQL Server returns the required columns from the probe input row and the matching build input row.
Just as with merge join, hash join requires only one pass through each table. If you're used to measuring query performance by looking at the logical I/O required, you might be surprised. One pass through each table requires very little I/O, but that doesn't mean queries that use hash join are as fast as queries that use merge join. You need to consider the computational effort of building the hash buckets from the build input as well as the memory required to store the hash buckets. SQL Server generally chooses hash join as the join strategy for queries that have no useful indexes on either join column.
Let's look at an example of when SQL Server chooses hashing. The code in Listing 1 makes a copy of the Orders and Order Details tables from the Northwind database but doesn't build any indexes on the tables. The plan for the SELECT query in Listing 2 shows that SQL Server performs a table scan on each table and uses a hash join to join the tables together. If you turn on STATISTICS IO and execute Listing 2's query, you'll see that the number of reads isn't high. The output shows only 21 reads for the o2 table and 10 reads for the od2 table, which correspond to the number of pages in each table. SQL Server reads each page only once, and for large tables, the number of reads might be very large. However, most of the work for processing a hash join occurs in building the hash buckets, and the cost of executing the query is high because of memory resources required to store the hash buckets.
Although using hash joins is a big improvement over using nested-loop joins without any good indexes, Microsoft intended hash joins primarily for use in ad hoc queries submitted while an end user is running an application. If you're testing an application and you find that certain queries always use hash joins, take that as a hint that you need to do more tuning. You might consider building indexes on the join columns in one or both of the tables, for example. Hash joins are good for times when you can't tune adequately because the actual queries are unknown during application development.
Finding the Best Indexes
Now that we've looked at the different join algorithms and discussed how indexes can help the performance of JOIN queries, you might wonder whether a particular method exists for figuring out which indexes to put on your tables for optimum join performance. Unfortunately, I haven't found one. Knowing how SQL Server processes joins internally can help you get started choosing acceptable indexes, or you can use the Index Tuning Wizard, but there's no substitute for testing. Try out as many different combinations of possibly useful indexes on the tables being joined as you can.
For example, you can start with an index on your primary keys (which you should already have if you defined your primary keys as a constraint on the table). You'll have to decide whether to make the index on the primary key column (or columns) clustered or nonclustered. I won't discuss that topic in detail now, but you can try each one and compare performance. Sometimes a clustered index performs better, and sometimes a nonclustered index does. Before you make the final decision about whether the primary key should have a clustered or a nonclustered index, you need to test all the possible queries that will use the table, including all your data-modification operations. You'll probably also want to consider indexes on your foreign key columns and columns used for search arguments, grouping, and sorting.
As you can see, configuring indexes to get the best performance is a complex task. Next month, I'll look at some specific examples that show how slight modifications in the choice of indexes can make a big difference in the performance of JOIN queries.
About the Author
You May Also Like