Inside SQL Server: Parse, Compile, and Optimize
Introduction to the query processor
October 23, 1999
SQL Server 7.0's query processor isn't a specific piece of SQL Server; it's the part of the relational engine that parses, optimizes, and compiles. The relational engine and the storage engine contain most of the SQL server components. The relational engine parses the submitted statements, optimizes the SQL statements, compiles the code, and manages the query execution. During execution, programs call the storage engine to retrieve or manipulate the data stored on disk. (See also, "Optimizing Parameterization for Query Plan Reuse" and "Optimizing Variables and Parameters").
The query processor goes through three phases before producing a plan from a query that you submit. First, it parses and normalizes the statements. Then it compiles the Transact SQL (T-SQL) code. Finally, it optimizes the SQL statement. This month, I look at each phase of query processing and how SQL Server manages cached query plans.
Parse and Normalize
I won't go into detail about the query-parsing process. In short, the parser checks for correct syntax, including the correct spelling and use of keywords. The normalizer's main function is to perform binding, which means checking whether tables and columns exist, loading meta data about the tables and columns, and adding information about required (implicit) conversions to the sequence tree (an internal data structure that stores information about the query). Normalization also replaces references to a view with the definition of that view and performs a few syntax-based optimizations. These syntax optimizations include converting right outer joins to equivalent left outer joins, flattening subqueries to an equivalent join syntax, and simplifying some negative expressions.
Compilation
At the second phase, compilation, the query processor looks for statements that the optimization process won't include and runs the statements through a T-SQL language compiler. These statements (e.g., variable declarations and assignments, conditional processing with IF, and iteration with WHILE) aren't part of the basic data-manipulation language (SELECT, INSERT, UPDATE, and DELETE); they're the extra features that add functionality to your SQL code. These statements don't need optimizing, but they need compiling before execution.
Although true compilation affects only these special T-SQL statements, people frequently use the term compile to refer to the whole process of compilation and optimization. When the documentation refers to recompiling a stored procedure or the SQL Server Profiler records a compilation event, compilation means the process of compiling the special T-SQL statements and optimizing the SELECT, INSERT, UPDATE, and DELETE statements.
Optimization
The third and most complex phase of query processing is optimization. The SQL Server query optimizer determines the ideal execution plan for each SQL statement in a query batch or stored procedure. Unlike the optimizer in earlier releases of SQL Server, the SQL Server 7.0 optimizer doesn't always perform an exhaustive search to find the best query plan. With so many possibilities, including three kinds of joins and any number of indexes on each table, evaluating all possible execution plans could take longer than executing the query.
To find a good plan in a reasonable amount of time, the optimizer goes through several phases. In the first phase, the optimizer determines whether the query's plan would be trivial. A trivial plan has a known, constant cost because it involves no parameters or uses predefined system functions. If a query has a trivial plan, the optimizer returns that plan to the query processor, and no more work is necessary.
If the query doesn't have a trivial plan, the optimizer retrieves the available statistics for all columns and indexes involved in the query, and starts looking for a plan that is good enough. (The optimizer has an internal definition of good enough based on table size and query complexity.) The optimizer first looks for simple plans, which are similar to the plans earlier releases of SQL Server use. Such simple plans usually use nested-loop joins and one index per table. If the optimizer finds no plan that's good enough, it investigates more complex plans—using multiple indexes on a table, for instance. If a table involved in a join has no good indexes, the optimizer evaluates the cost of using a hash join and then tries more complex possibilities. (For descriptions of nested-loop and hash joins, see Dusan Petkovic and Christian Unterreitmeier, "New Features for Query Processing," July 1999.)
After the optimizer tries an internally determined number of plans and finds no plan that's good enough, the final phase of optimization begins. Now the optimizer decides whether to try to create a plan for parallel processing. It considers this option only if the machine is a multiprocessor and you've appropriately set the configuration options Cost threshold for parallelism and Max degree of parallelism. If a parallel plan isn't possible, the optimizer enters a full optimization phase and evaluates every possible way to process the query.
Theoretically, the query optimizer always finds a query plan for any valid SQL query. If nothing else, the plan can specify table scans on every table, or forcible nested-loop joins. However, the optimizer is a constantly evolving piece of code, and it doesn't always do what theory predicts.
Procedure Cache Management
After the optimizer develops a plan and returns it to the query processor for execution, the plan goes into SQL Server's cache. As I demonstrated in my discussion of reusing cached plans (see "SQL Server 7.0 Plan Caching," September 1999, and "Stored Procedure Plans," October 1999), caching can be a good thing. Not needing to reoptimize a query or stored procedure can give you big performance improvements. But how long does a plan stay in cache? With luck, it stays as long as it's useful.
SQL Server 7.0 has no separate area of memory for storing procedure or query plans. It has one unified cache, where it stores data pages and query plans (and a few other things). A component of SQL Server called the lazywriter is responsible for determining whether to free up memory the cache is using or keep the plan in cache. When the optimizer produces a plan and places it in cache, SQL Server assigns a cost to it, using an internal unit of measurement based on I/O requirements and CPU time. The cost reflects how expensive recreating the plan would be. The lazywriter is an internal SQL Server process that regularly cycles through all the buffers in SQL Server's memory cache and looks at the cost of each. It places any buffer with a cost of zero on the free list, where it's available for reuse by another plan or new data pages brought into cache. If the cost of a buffer is greater than zero when the lazywriter inspects it, SQL Server decrements the cost by one. Every time a plan in cache is reused, its cost reverts to its original value. Thus, if a plan is reused before the lazywriter reduces its cost to zero, the plan stays in cache. Plans that are more expensive to produce stay in cache longer, even if they aren't reused as often. Plans that are cheap to produce must be reused more frequently to stay in cache.
Preparing for Execution
When a program wants to execute a query, SQL Server checks whether a plan is already in cache. Sometimes a long gap occurs between the time a query is placed in cache and the time SQL Server executes it—in some cases, minutes or hours after the optimizer creates the plan. It's easy to find out whether the plan for a stored procedure is in cache because you can reference it by name. But how can SQL Server find out whether an ad hoc or autoparameterized query is in cache and available for reuse? When the plan goes into cache, SQL Server generates a cache key for the query, combining information relating to the query's text and to the SET options in effect at the time of plan creation. When searching for a plan, SQL Server uses the same internal algorithm used to generate the key originally, and can use the key to determine whether any matching plans are in cache.
A query might need recompilation and reoptimization if a long time passes between execution and initial plan caching. This process occurs primarily based on meta data changes (adding or removing columns or indexes) or statistics changes. Recompilation doesn't take place because of changes in parameters, system memory available, or data in the cache.
The Final Phase
The last step is to execute the query, which involves communication between the query processor and the data storage engine. But that, as you might guess, is a topic for another time.
About the Author
You May Also Like