Essential Aspects of Database Design
Important considerations for indexing, maintenance, and statistics
January 23, 2007
The key to great database performance lies in finding ways to help SQL Server more quickly save or retrieve data and reduce its response time. In my previous article, "Database Design for Performance," (November 2006), I explained how database design affects performance. Building on that foundation, let's look now at three essential areas related to database design: indexing, maintenance, and statistics. All three are interrelated: When you have the right indexes, database maintenance will be effective, and effective database maintenance includes updating statistics, which helps SQL Server better use the indexes when responding to a query.
Indexing for Performance: General Guidelines
Without going through all of the details about how to best index, I can recommend some general guidelines you should follow. First, choose the clustering key. Next, add your primary and unique constraints. Manually index your foreign keys. Finally, leverage tools such as Database Engine Tuning Advisor to help you do some workload tuning. For more information about this tool, see the SQL Server 2005 Books Online topic "Database Engine Tuning Advisor Overview."
However, this is not to say that indexing for performance is simple—you need to consider many factors. I'd like to focus on two areas: clustered indexes and secondary, or nonclustered, indexes. Questions to consider include the following: Should every table have a clustered index? Is just the existence of a clustered index good enough? Are all clustered indexes equally effective? With nonclustered indexes, does the selectivity of the high-order element matter? Should nonclustered indexes be wider—that is, should they include more columns? How much can these indexes really help SQL Server to retrieve data more quickly?
Clustered Indexes
When you create the key to a clustered index, you need to be aware of several factors. Because of how the storage engine has changed from the way that SQL Server 6.5 handled base tables, in SQL Server 7.0 and later releases the important things that I look for in a clustering key are that it's unique, narrow, and static.
Unique. A clustering key should be unique because it's used as the lookup key for all nonclustered indexes. For example, imagine an index in the back of a book. To enable you to find the data that an index entry points to, that index entry must be unique. Otherwise, you wouldn't find the exact item you're looking for. Similarly, the clustered index needs to be unique. However, SQL Server doesn't require that your clustering key be created on a unique column. You can create it on any column. If the clustering key isn't unique, SQL Server will make it unique by adding a four-byte integer to the data. Just realize that creating the clustered index on a nonunique clustering key causes additional overhead at index creation, wastes disk space, creates additional costs during INSERTs and UPDATEs, and in SQL Server 2000, adds the cost of a clustered index rebuild, which is more likely to be needed because of the poor choice of clustering key.
Narrow. A clustering key should be narrow (i.e., composed of as few columns as possible), for some of the same reasons it should be unique. Because the clustering key is used as the lookup key for all nonclustered indexes, the clustering key is duplicated in all nonclustered indexes. If the clustering key is too wide (i.e., composed of more columns than it needs to be), then all of the nonclustered indexes will be unnecessarily wide, which wastes disk space, creates additional costs during INSERTs and UPDATEs, and requires more time (because of the greater size) to rebuild these index structures. So, what does narrow mean? It means using as few bytes as possible to help uniquely define your rows—a narrow numeric when possible, as SQL Server handles numeric data types more efficiently than it does most other data types.
Static. A clustering key should be static for some of the same reasons it should be unique and narrow. If the clustering key is used as the lookup key for all nonclustered indexes, then the clustering key is duplicated in all nonclustered indexes. In fact, for a given table, the clustering key will be the most-duplicated data. If this data changes, then the value will have to be updated in the base table as well as in every nonclustered index. If the key changes, it will cause the record to move. When a record moves, it creates fragmentation. Like unnecessarily wide, nonclustered indexes, fragmentation wastes disk space and creates additional costs during INSERTs and UPDATEs. Fragmentation also increases the time required to perform data updates (because of record relocation and the likely need for subsequent splits) and requires more maintenance.
Typically, I recommend using a numeric IDENTITY column as the clustering key, but I'm frequently asked about using a globally unique identifier (GUID). A GUID does meet the criteria fairly well: It's unique, usually static, and relatively narrow. So, what's wrong with using GUID? Apart from being large and somewhat incomprehensible, in SQL Server 2000, the function that generates GUID values—newid()—doesn't create an ever-increasing pattern as an IDENTITY column would. The ability to create an ever-increasing pattern isn't one of the primary criteria I look for in a clustering key, but it does add many benefits. Without an ever-increasing pattern, your table is likely to become fragmented and perform poorly. For more information about GUID, see the sidebar "Using the newsequentialid() GUID Function."
If the clustering key is ever increasing, then new rows have a specific location where they can be placed. If that location is at the end of the table, then the new row needs space allocated to it, but SQL Server doesn't have to make space in the middle of the table. If a row is inserted in a location that doesn't have any room, then SQL Server needs to make room. So, if you insert a row based on last name, then as rows come in, space will need to be made where that name should be placed. To make room, SQL Server makes a split.
Splits in SQL Server are 50/50 splits. Simply put, 50 percent of the data stays and 50 percent of the data is moved. This keeps the index logically intact even though it's not physically intact. When an index has a lot of splits, we call that index fragmented.
Good examples of an index that is ever increasing are IDENTITY columns, which are also naturally unique, static, and narrow. Another example is something that meets as many of these criteria as possible, such as a datetime column, or even better than that, a composite key of datetime + identity.
Nonclustered Indexes
The general strategy to take is that it's better to have fewer, but wider composite indexes than to have many narrow indexes. Narrow indexes, such as single-column indexes, have fewer uses than wider indexes. I've seen a lot of debates over the selectivity of the first column (known as the high-order element) of the index, but it's not really important whether the first column is the one that's most frequently used, and that's not my primary focus around selecting nonclustered indexes. When you're choosing the right nonclustered indexes, it's more important to "know the system" and understand how SQL Server uses indexes rather than assume general criteria about the first column.
The high-order element of the indexes is important; however, the element doesn't need to be highly selective for the index to be useful. (Note that the selectivity of an index depends upon the percentage of rows in a table having the same value for the indexed key. An index's selectivity is optimal if few rows have the same value.) Let's consider a hypothetical table that's very wide and that covers personal data similar to that of the census data for the United States. This table is tracking a lot of items, and the table width averages 1200 bytes per row. Using our knowledge of page densities and record format, we can see that we can fit only an average of six rows per page. If the table stores 300 million rows at six rows per page, the table will be 381GB in size. Generally, a table of this size would be difficult (and potentially slow) to analyze. However, if a common aggregate query for the table is one that returns the number of households by "HeadofHousehold Gender," then "Age," then "Dependents," we're looking at a query that needs only a small subset of columns from the table.
Additionally, based on the column order, this index is also useful for counting rows by HeadofHouseholdGender alone or HeadofHousehold Gender and Age together. You can also use this index for any left-based subset of columns, that is, columns that comprise the left portion of the index. The point is, the high-order element (HeadOfHouseholdGender) is not very selective, yet this index is useful for a variety of aggregates. Another thing that makes this index so effective is its size. With three columns that are likely to be stored as a bit (Gender), a tinyint (Dependents), and another tinyint (Age), the total for this index key—including an index row header (four bytes) and the clustering key (about eight bytes)—is 15 bytes per row. Because this row is so narrow, SQL Server can fit 539 rows per page. The personal data table stores 300 million records, yet the index is only 4GB in size. A 4GB structure is significantly more optimal to read than if you have to read the entire table. Also, if you have the data ordered appropriately for the Group By command that you need to retrieve, then SQL Server can achieve a stream aggregate rather than a hash aggregate, which helps the query perform. Additionally, if this data is read-only, as census data would be, you can also consider using more interesting forms of indexes such as those created by Indexed Views. I'm not saying you can't create an indexed view on transaction processing systems, but there are more possibilities of blocking on a transaction processing system.
Maintenance
When data changes, we need to perform maintenance. Maintenance helps keep your tables contiguous and your pages denser; it also helps improve resource utilization and therefore improves overall performance. But how do you maintain your tables, and what does it really mean to defragment a table? This topic is fairly complex, but here are a few things to think about and review.
In SQL Server 2005, you can use the sys.dm_db_index_physical_stats dynamic management function to determine how fragmented your table is. Use this function to get a feel for how much and how fast fragmentation is occurring. Also, consider rebuilding your table if you have 25 percent or greater fragmentation, you have a dedicated maintenance window and ample disk and log space, and you can rebuild online.
Consider defragmenting your table if you can't rebuild online, you don't have more than 25 percent fragmentation, and you can't afford the disk or log space to rebuild.
Statistics
Because statistics are the optimizer's source for estimating how much data will be processed, they're crucial for better performance. Statistics are used with indexes, but standalone statistics (statistics that are created in addition to indexes) can also help to improve a variety of optimization strategies as well as improve the usefulness of a composite nonclustered index. This is also complex, but here are a few things to think about and do: First, make sure that the database option "auto create statistics" is on. (It's on by default.) Second, make sure that the database option "auto update statistics" is on. (It's on by default.) Third, consider using the sp_createstats stored procedure to create additional statistics on the secondary columns of your nonclustered indexes. Use the following command:
sp_createstats ‘indexonly', ‘fullscan'
Finally, realize that certain correlations between columns can't be perfectly described by statistics alone. Sometimes the only way to answer a query is with the right index.
Results Match Effort
If you design with performance in mind, the rewards to your business can be huge. With careful thought and attention to indexing, maintenance, and statistics, as well as database design, you'll be the hero who designed for performance—and achieved it.
About the Author
You May Also Like