New Features for Query ProcessingNew Features for Query Processing
The query processor is one of the most important database system components that influence the management of large databases. SQL Server 7.0's focus is on new techniques for maximizing your query performance.
June 30, 1999
SQL Server 7.0 maximizes query performance
Besides ease of use and scalability, one of SQL Server 7.0's key features is that it fulfills the role of an enterprise database system. Although people have used SQL Server 6.0 and 6.5 as an enterprisewide data management system for a long time, SQL Server 7.0 shows great improvements in this area and is much more competitive in the large database/data warehouse market. The query processor is one of the most important database system components that influence the management of large databases. Microsoft has significantly improved the SQL Server 7.0 query processor in three areas: indexes, join techniques, and disk access. Also Microsoft added new features to the Query Analyzer (formerly ISQL/w). But before we describe these improvements and new features, you need to know what the query processor does.
What the Query Processor Does
The query processor has two main tasks: query optimization and query execution. Query optimization is the process of choosing the best solution for query execution. Often there are several different ways to perform a query, all leading to the same correct result. The job of the query processor is to create one or more access plans for a given query. If several possible solutions exist that give the correct query result, the query processor must select the optimal access plan for the query. For instance, if an index exists that you can use for a given query, the query optimizer chooses between a table scan and an index access. A table scan involves sequential data access, in which the system fetches each row according to its physical memory location; index access is direct access, in which the system fetches each row using additional structures built for the given index. For a small number of selected rows, an index scan is more effective than a table scan, because the index scan fetches each row directly and therefore quickly. However, the table scan is superior to the index scan for queries with a large number of selected rows, because a table scan reads pages of large memory units only once (instead of separately accessing each row that satisfies the selection criteria). The general tasks of the optimizer are maintaining statistics about the volume of data, choosing the indexes for the query, and (in the case of a query containing a join operation), deciding the appropriate table-join order.
Index Intersection and the Covering Index
The previous releases of SQL Server use indexes with one important restriction: Generally, they can use only one index per table, even if multiple criteria are present in the WHERE clause of a SELECT statement. Listing 1 shows an example in which the WHERE clause of the SELECT statement includes two predicates with columns dept_no and enter_date. The query processor in previous releases of SQL Server used only one index, even if both columns were indexed. In contrast, SQL Server 7.0 uses both existing indexes by selecting subsets of data and performing an intersection between them. This feature is called index intersection.
A covering index is a nonclustered index built on all the columns required to satisfy a query (both in the SELECT list and in the WHERE clause). A covering index is therefore an extension of index intersection and is related to index-only retrieval. If all the columns in a query are indexed, the index-only retrieval technique fetches only index values. SQL Server 6.5 already exploited this technique, which gives significant performance improvements because the index structures give all the values needed for the query. However, the earlier SQL Server query processors cannot find a covering index for a query and therefore do not apply the index-only retrieval if the indexes are separately defined.
Listing 2 shows an example with the two columns enter_date and emp_no in the SELECT list. (The column enter_date also appears in the WHERE clause.) If there are two separate indexes for the columns enter_date and emp_no, SQL Server 6.5 cannot use the index-only retrieval. The only way to force the use of this technique is to build a compound index structure on columns enter_date and emp_no in this exact order.
SQL Server 7.0 takes index-only retrieval a step further by considering the join of multiple indexes over the same table to obtain a column set that covers the given query with two separately indexed columns in the SELECT list. SQL Server 7.0 joins both existing indexes on the columns enter_date and emp_no using the Row Identifier (RID). The RID uniquely identifies a row within a table; SQL Server uses the RID to place a row lock on a row.
Additional Join Techniques
The optimizer applies different heuristic rules to improve the performance of query execution. The most important heuristic rule is that the query applies each unary operation (i.e., an operation on only one table) before any binary operation such as join or union. The reason for this rule is that the size of the input tables determines the size of the binary operation; therefore, in many cases, SQL Server determines this size by multiplying the rows of both tables.
A join operation is the most time-consuming operation in query processing because it requires a lot of disk I/O. For this reason, it is preferable to have different techniques to apply in specific cases to give optimal query execution performance. SQL Server 7.0 supports three processing techniques for joining: nested-loop join, sort-merge join, and hash join.
The nested-loop join is the only processing technique that previous releases of SQL Server fully supported. If no indexes are on the join columns, the nested-loop join works by brute force. In other words, for each row of the outer table, the optimizer retrieves and compares each row from the inner table. The nested-loop join is slow if one of the join columns has no index. Without indexes, the optimizer has to scan the outer table once and the inner table n times, where n is the number of rows of the outer table. Therefore, the optimizer generally chooses this method only if the join column of the inner table is indexed, so the inner table does not have to be scanned for each row in the outer table. (Also, the optimizer selects this technique if joined tables have a small number of rows or if the join yields a temporary result, which is then used for a subsequent computation.)
The sort-merge join provides a cost-effective alternative to constructing an index for a nested-loop join. The rows of the joined tables must be physically sorted using the values of the join column. (If the rows of the joined tables are not sorted, SQL Server 7.0 enforces the sort operation for both tables.) The optimizer then scans the tables in order of the join columns, matching the rows with the same value for the join columns. No index is necessary when the optimizer performs a sort-merge join. The sort-merge join processing technique has a high overhead if the rows from both tables are not sorted. However, this method is preferable when join columns for both tables are not indexed. For instance, the sort-merge join is especially useful when each joined table has the clustered index on the join column. (SQL Server 6.5 supports only a rudimentary form of this join technique.)
A hash join uses a technique called hashing. Hashing provides fast access to rows on certain search conditions. The search condition must be an equality condition on one column called the hash field. Usually the hash field is also a key field of the hash file, in which case it is called the hash key. The idea behind hashing is to provide a hash function, which is a function that the optimizer applies to the hash key value of a row. This function yields the physical address of the disk block that stores the record.
The optimizer typically uses the hash join when the rows of the joined tables are not sorted and the join columns are not indexed. The reason for this usage is that hash-based algorithms efficiently process large unsorted and nonindexed inputs. The optimizer hashes the rows of one table to the hash file and compares the values from the other table against the stored values. The optimizer uses the same hashing function on the join columns and the hash keys. The hash join gives the best performance if you have one small table and one large table. In that case, the optimizer computes the hash value for each row of the smaller table and inserts each row into a hash bucket (depending on the hash value). Then, the optimizer scans the larger table one row at a time, computes the hash key values, and produces matches.
Because the hash-join method does not require an index, it is highly applicable for ad hoc queries, in which you don't expect indexes. Therefore, this technique is valuable for retrieval operations in data warehousing systems, especially if the system groups rows together.
Disk Scan Improvements
The main unit of data storage in SQL Server is the page. The page size in SQL Server 7.0 is 8KB, in contrast to previous releases' 2KB page size. The unit of a read operation by SQL Server is usually an extent, which has also increased in size (from 16KB to 64KB). Both changes increase the system performance by letting SQL Server 7.0 read eight times more data in one I/O operation than previous releases allowed. (For more information on SQL Server 7.0 storage units, see Inside SQL Server, "Page Structures in SQL Server 7.0," Premiere issue.)
Another disk scan improvement in SQL Server 7.0 is the existence of the Index Allocation Map (IAM). The IAM is a bitmap that maps data pages for a table in physical order. During the read operation, SQL Server sequentially reads the IAM to specify which pages need to be read. If a specified operation reads several pages, the IAM lets the system execute read-ahead—namely, to place the needed pages into the buffer in advance. The previous releases of SQL Server use a page chain, where each page has a pointer that points forward to the next page. If the pages are read randomly, the page chain doesn't let the system use read-ahead in most cases.
SQL Server Query Analyzer
The SQL Server Query Analyzer is a tool that generates, executes, and stores Transact SQL (T-SQL) statements. Besides these features, you can use Query Analyzer to show a query's execution plan in a text or graphics format. Either format lets you examine the plan and make corrections to a query's indexes if the performance of the query is inefficient.
To run the Query Analyzer, click the Start menu, Programs, and then Query Analyzer in the SQL Server 7.0 menu, or select Tools, Query Analyzer from the Enterprise Manager main menu. To see a textual execution plan for a statement, activate the SHOWPLAN_TEXT or SHOWPLAN_ALL option of the SET statement in the upper part of the Query Analyzer window by setting one of these two options to ON. (The SQL Server 6.5 option for SHOWPLAN is no longer supported, so you must use one of these options instead, in SQL Server 7.0.) Next, enter the T-SQL statement that you want to analyze. Both options display detailed information about the selected execution for the query. Figure 1 contains an example of an execution plan in the text format.
To obtain a graphical execution plan for a statement, enter the statement and click the Display Estimated Execution Plan icon in the toolbar of Query Analyzer. Screen 1 contains an example of a graphical execution plan.
If you start the execution plan of a statement in the graphics mode, the statement does not execute and the flow diagram of the estimated plan appears in the lower part of the Query Analyzer window. Similarly, if you start the execution plan of a statement in the text mode, the statement does not execute because the SHOWPLAN_TEXT and the SHOWPLAN_ALL options disable the execution of the statement.
Viewing a query's execution plan can help you improve that query's performance. For example, you can use the text or graphics mode to view and modify a query's index. To understand this example, you first need to know the basics about SQL Server 7.0's data-organization forms. SQL Server 7.0 uses two different data-organization forms to store a table's data on disk: heaps and clustered tables. A heap is a form of data organization in which SQL Server stores table data in no particular order, and the sequence of the data pages is in no order. SQL Server stores tables with no clustered indexes, or without any indexes, in heaps.
SQL Server stores the data rows of a clustered table in an order based on the clustered index. SQL Server uses the linked list to link the data pages. Which physical data structure SQL Server chooses depends on the CREATE INDEX statement. Listing 3 shows the query to use to demonstrate possible performance issues. If you create the indexes in Listing 4, SQL Server stores both tables in heaps.
Suppose table order_tab contains 3000 rows, and the table position contains 20 rows per row of the table order_tab on average. A 1:n relationship exists between the order_tab table and the position table; that is, for each row of the order_tab table, there are n rows in the position table. Listing 5 contains the code for these two tables.
Figure 1 and Screen 1 show the textual and graphical execution plans, respectively, for the query in Listing 4. SQL Server reads the query execution plan in Figure 1 from the bottom up. Therefore, the SQL Server optimizer first uses both indexes i_pos_orderno and i_order_orderno for the index search on the column order_no in the position and order_tab table, respectively.
Next, the optimizer executes the Bookmark Lookup. The Bookmark Lookup value (e.g., [Bmk1002]) represents the RID, which the optimizer uses to identify the corresponding row. Both Bookmark Lookups correspond to the position and order_tab tables that appear in the SELECT statement. Finally, the optimizer applies the nested-loop method for the join operation in Listing 4.
You can modify the physical structure of the order_tab and position tables if you create clustered (instead of nonclustered) indexes. Listing 6 creates two clustered indexes for these tables. These clustered indexes prompt SQL Server to store both tables as clustered tables. Figure 2 and Screen 2, page 34, show the textual and graphical execution plans, respectively, for the query in Listing 4, after the creation of both clustered indexes.
By using clustered indexes, you reduce the number of nodes from five to three. Additionally, the optimizer chooses the hash-join method (instead of the nested-loop method) to execute the join operation. Thus, you eliminate the row access using RID. These changes can result in significant improvements of the elapsed time for the query in Listing 4.
Improved Querying
We have shown you quite a few, but not all, of the significant improvements in query processing that SQL Server 7.0 offers. SQL Server 7.0 supports new index techniques that give you significant performance gains for many queries, usually using just one table.
New join techniques let the optimizer choose the most appropriate method for performing a join and thereby improve performance of queries over two or more tables. Finally, SQL Server 7.0's Query Analyzer lets you use new index and join techniques by examining the execution plan of a query and making corrections to indexes.
About the Author
You May Also Like