Database Harmony
Normalization best practices that can give an existing database new life
February 20, 2002
When SQL Server Magazine asked me to write an article about how to normalize a database, I thought writing the article would be easy. However, I soon realized that the task of explaining all the things that I do when I evaluate an existing database was going to be harder than I thought. Many of the steps in my process are automatic for me—how could I put them into a logical order that would make sense to someone else?
Typically, a client doesn't ask me specifically to evaluate a database's level of normalization. Clients ask me to perform a database performance and integrity audit, which can include examining an apparent loss of data integrity. In a performance and integrity audit, I review and analyze many details, and reviewing the database architecture is only one part of the process.
When I analyze database performance, tuning, and integrity, I first look for obvious problems, resolve them, then investigate more minute details. I look across the three layers of a database environment: the physical layer, the external layer, and the logical layer. The physical layer consists of hardware and networking (i.e., the physical components that support the database server). The external layer consists of the applications that are running against the database (i.e., batch jobs, ad hoc queries, and Web accesses). The logical layer consists of the user schema (i.e., the user-table architecture) and database-configuration parameters.
I start my audit by reviewing the physical layer—the hardware and network components—to eliminate operational bottlenecks, if possible, by implementing simple fixes such as increasing memory, hard disk, or network capacity. While I'm examining the physical layer, I also gather information about SQL Server file groups to learn how the database files are arranged on disk. This information shows me what the database's physical environment looks like. From the client, I learn the purpose of the database. I also get the client's description of the apparent loss of integrity. Are queries returning fewer records in the result sets than expected? More records than expected? Different values than expected? Or is the problem something such as corrupted data in the tables after a database operation?
Next, I review the external layer, which includes the applications running against the database, to make sure that resource contention or locking conflicts aren't causing the performance and integrity problems. I use this sequence because rearranging the job mix or rewriting just one application so that it runs more efficiently is easier than restructuring the underlying files. I gain insight into the database's primary purpose by reviewing the programs that run against the database. I learn whether the database captures transactions (as it would in e-commerce), acts as a data warehouse, performs trend analysis, or is a hybrid that both captures transactions and produces information for decision support.
Finally, I look at the logical layer—the user-table schema and the configuration parameters. Based on the size and purpose of the database, I check the configuration parameters (e.g., buffer sizes, number of worker threads, tempdb, cache hit ratio) to be sure that nothing is grossly out of alignment. Then, I need to analyze only the user-table schema to determine whether the performance or integrity problems are associated with improper levels of data normalization.
Diagnosing Normalization Problems
Either undernormalizing or overnormalizing user tables can cause database performance and integrity problems. An undernormalized database typically has one table that contains many columns and data about many topics. Figure 1 is an example of such a table in a financial-analysis database. Each record in the table contains information about a financial product that an analyst is working on, the category into which the product falls, the contact person for the product, and a checklist for managing each product. Four subjects are stored in this table; this kind of overcrowding makes the data difficult to manage. The rule of thumb to follow is one table, one topic.
Conversely, an overnormalized database contains many tables, each of which has only a few columns. So a query that generates even a simple report must join many tables. Figure 2 is an extreme example of an overnormalized lookup table for regions. In this table, the regions don't have names—they have unique numbers. The table in Figure 2 has a region identifier, RegionID, which is also a number data type. Any query that includes a reference to region has to join to this Region table on the RegionID, even though RegionID and Region will likely have the same values.
Somewhere between these two extreme examples is a middle ground that you need to find for your production database environment. If you undernormalize, you'll encounter data-integrity problems when you try to insert, update, or delete records. An insertion anomaly happens when you don't know all the values of a multicolumn primary key, so you can't insert a record into the table. An update anomaly happens when you change data and you have to change more than one record in the table because the data is repeated many times in the same table. A deletion anomaly happens when you remove a record from a table and you lose unassociated or loosely associated facts in the process. For more information about modification anomalies, see SQL by Design, "The Four Integrities," November 1999.
A properly normalized database (at least third normal form—3NF) won't produce these insertion, update, and deletion anomalies. (For information about normalization levels, see SQL by Design, "Why You Need Database Normalization," March 1999. For a brief discussion of when denormalizaton is the best choice, see the Web sidebar "When Not to Normalize Your Database.") As you reduce the level of normalization from third to second or even to first normal form (1NF), you'll have to account for the data-integrity violations that happen as a result of the denormalization process. To retain the data integrity, you'll have to create compensating routines in every program that modifies the data in these tables.
Normalizing the Cellar Database
When I perform a user-schema evaluation, I use a checklist to guide me through the detailed steps that I have to remember. You might want to consider adapting this checklist for your own use. (For a copy of the checklist, see the sidebar "User-Schema Evaluation Checklist.") To look at the way I proceed through the checklist, let's use an example database called Cellar. Despite its small size and limited user base, this example is useful for demonstrating how to normalize a database. I explain the process I use to audit this database. You can follow along with me as I analyze Cellar by executing Web Listing 1 and Web Listing 2, which create and load the Cellar database. To download the Web listings in this article, click on the "Download the Code" icon at the top of the page.
Set the context. Imagine that the client is a wealthy person who lives in a large home with a climate-controlled wine cellar in the basement. As an aficionado of fine wines, the client has friends around the world with whom he shares his love of wine, and when he travels, he meets with some of those friends to sample a wine. The client uses the Cellar database to keep track of these wine tastings and to track how to get in touch with his friends. He records in the Cellar database the wines served at each wine tasting and who came to each wine tasting. The client is the only person who uses the Cellar database.
This brief description tells me what the Cellar database is, who uses it, and why it exists. This context helps me move on to the next step in the analysis: diagramming the database's physical design.
Create a physical design. Figure 3 shows a physical diagram that I created in Enterprise Manager. The diagram shows five tables—CELLAR, TASTEWINE, TASTING, TASTEGROUP, and VISITOR—the table relationships, the fields and data types, and each table's primary key. I can infer from the physical design that the CELLAR and TASTING tables are in a many-to-many (M:N) relationship (you can serve a wine that you store in Cellar at more than one tasting, and a tasting can feature more than one wine from Cellar). I can also infer that TASTING and VISITOR are involved in a M:N relationship (a tasting can include one or more visitors, and a visitor can attend one or more tastings). TASTEWINE and TASTEGROUP are associative tables. At first glance, this database looks well normalized.
List ambiguous tables. The physical diagram and the client's description of the use and purpose of the Cellar database help me understand most of the tables. The TASTING table contains a list of dates and places where the wine tastings occurred (or will occur). The VISITOR table is a list of wine-tasting group members and their phone numbers. TASTEWINE and TASTEGROUP are associative tables. The only table I don't quite understand is the CELLAR table; it contains a list of wine types, names, producers, years, and costs. But how do these fields relate to the primary key, Bin? After questioning the client further, I find out that the numbers in Bin represent storage containers in the wine cellar. Each container holds one type of wine. A type of wine is defined as one wine made by one producer in one year. Figure 4 which is part of the CELLAR table, shows this organization (one type of wine listed in each row of the table).
I add the CELLAR table to the list of objects that I need to review more closely. Whenever I see a table that includes more than one type of data, I get suspicious. The CELLAR table lists wine types as well as where the wines are stored. I think that CELLAR is a candidate for further decomposition.
Find hidden arrays. A hidden array is a list or a collection of values that's stored as a series of simple attributes in a single table. The members of the list or collection are separate columns in the table, each with its own column name. Examples of hidden arrays are days of the week (i.e., Sunday through Saturday), different phone numbers (e.g., work, home, other), and different addresses (e.g., billing, shipping, other). Another term for a hidden array is multivalued attribute. (For a full explanation of multi-valued attributes, see Solutions by Design, "Multivalued Attributes," page 51.)
List ambiguous field names. In the CELLAR table, YearProduced is the four-digit year in which each wine was made. To fully understand this field, you have to know a few facts about wine making. Wine is pressed in the fall of the year, immediately following the grape harvest. In the Northern Hemisphere, the grape harvest might run from September to November, depending on the vineyard's location and the type of grape. In the Southern Hemisphere, the grape harvest is in March, April, or May. Knowing this information helps you calculate the age and drinkability of a wine—a Northern Hemisphere wine produced in 2000 will be 1.5 years old by the time you read this article, but a Southern Hemisphere wine will be 2 years old.
The field Ready is also a little ambiguous. You have to know that a wine is generally not drinkable when it's first made. You have to wait a period of time, from 6 months for the Beaujolais Nouveau to many years for the finest French reds. So, the field named Ready is another four-digit year that denotes when the wine will be ready to drink.
So far, all the fields seem straightforward. I have to depend on the client's memory and verbal explanations for the definitions of the fields. Such a situation isn't ideal, but it's the most common situation. An analysis will be more precise if the client has a metadata repository and has kept it updated. The metadata repository can act as a dictionary for the entire database—it can contain definitions and properties of each of the fields. As part of my analysis, I usually create a glossary that can later be converted or input to a metadata repository tool, if the client wants to do so.
Look for dirty data. I always visually inspect samples of a client's data to determine whether the data is "dirty"—that is, whether the client has incomplete or incorrect data in the database and, if so, where it is. Dirty data occurs when tables don't have primary keys or uniquely indexed natural keys, so users can easily insert duplicate rows. Dirty data also occurs when dependent one-to-many (1:M) relationships aren't enforced, allowing rows in the child table that don't have an associated row in the parent table. Sometimes I use queries (specifically outer joins) to determine whether data is dirty; then I log the results to show to the client.
The Cellar database's tables are pretty clean. No dates exist in the TASTEGROUP table or the TASTEWINE table that don't appear in the TASTING table. Each Visitor ID (VID) in the TASTEGROUP table is already in the VISITOR table, and every Bin number in TASTEWINE is already in the CELLAR table. All the YearProduced and Ready values in the CELLAR table fall within a valid range, from 1900 to 2005. Each phone number in the VISITOR table is at least 12 characters long, which implies that the area code and user-friendly formatting (nnn-nnn-nnnn) have been included for each record in which the phone number isn't NULL. I routinely run these kinds of queries to substantiate my suspicions if I believe that a lack of entity or referential integrity controls is causing the loss of data integrity. (For more information about database integrities, see "The Four Integrities.")
Locate redundancy across tables. You have data redundancy across tables when the same non-key data is in two or more tables in the database. For example, if I found a field called Name instead of VID in the TASTEGROUP table, I'd assume that visitor names were redundant in this database. A follow-up visual inspection of the data would confirm my suspicions. If a field is non-key—that is, if a field isn't part of a primary key or part of a foreign key—then the field should appear only one time and in only one table.
If you find what you think is a redundant data condition across tables, you need to investigate more closely. In Figure 3, a datetime type of field is present in three tables: TASTING, TASTEGROUP, and TASTEWINE. All three occurrences of the date field denote when a tasting happened. You might mistake this repetition for cross-table redundancy. However, tdate in the TASTING table is a primary key. Tgdate in TASTEGROUP and twdate in TASTEWINE are part of the primary key for each of their respective tables. Tgdate and twdate are also foreign keys and are part of enforced relationships.
Find transitive dependencies within tables. A transitive dependency is a situation in which one non-key field identifies or determines the value of another non-key field in the same table. You can't find a transitive dependency unless you understand each of the fields, what they mean, and how they relate to one another. If you don't have a metadata repository available, you have to depend on the client's memory and understanding of the data to make sense of it.
In the CELLAR table, which is already on my change list, I believe that I've found a transitive dependency. Comments that describe how a wine tastes aren't dependent on where the wine is stored. When a wine will be ready to drink is also not dependent on where the wine is stored. A combination of Wine plus Producer plus YearProduced determines when a wine will be ready (Ready). Many things determine how a wine tastes: the type of wine, who produced it, and when it was produced. The wine taster records his evaluation in Comments. Thus, Wine, Producer, and YearProduced combine to define the values of Ready and Comments. This combination is the transitive dependency in the CELLAR table. When you find a transitive dependency, you need to break the dependent fields into a separate table and relate the two tables to each other. I add this suggestion to my change list.
Identify less-than-full functional dependency. Full functional dependency means that each non-key field in a table is totally dependent on the entire primary key for its meaning. If the primary key is multicolumn, then each non-key field must be dependent on the entire primary key, not just one part of the primary key. Any table that contains fields that have less-than-full functional dependency on the primary key is in 1NF. Figure 1 is an example of a 1NF table with an embedded array (the checklist). This table doesn't have a primary key. You'd decompose this table into second normal form (2NF), then 3NF, following the process outlined in "The Four Integrities."
Propose database changes. I've converted my list of changes for the Cellar database to the code that Web Listing 3 shows. I propose to break the CELLAR table into two tables: WINETYPE and WINESTORAGE. The WINETYPE table contains a list of the wines in the cellar, the wine producers, the years the wines were produced, each wine's ready date, and comments about each wine. I uniquely identify each entry in the WINETYPE table with a WineID (the primary key). The associated WINESTORAGE table contains an entry for each storage bin number; the entry includes the WineID of the wine that's stored in the bin, the number of bottles, and the cost per bottle. WINESTORAGE is an enforced reference to WINETYPE because I want to restrict storage of wines in the WINESTORAGE table to only those wines identified in the WINETYPE table.
I also believe that I need to decompose the VISITOR table into two tables: PERSON and PERSONPHONE. The PERSON table contains a list of all the friends who come to the wine tastings; I uniquely identify each person with a VID. The PERSONPHONE table contains phone numbers and phone types related to each VID. If a person has three phones (a home phone, a work phone, and a cell phone) in the VISITOR table, then that person will have three entries (three rows) in the PERSONPHONE table. The relationship between PERSON and PERSONPHONE is 1:M and is also an enforced relationship because I don't want a phone number in PERSONPHONE that doesn't have an associated record in the PERSON table.
The code in Web Listing 3 also contains SQL statements that I can use to transfer data from CELLAR to WINETYPE and WINESTORAGE, and from VISITOR to PERSON and PERSONPHONE. I can create the new tables while production programs are accessing the old tables. But I need to load the new tables while the old tables are offline to avoid losing data modifications to the old tables while the new tables are loading.
Design views that support existing applications. The code that Web Listing 3 shows contains the Cellar and Visitor views that will recreate virtual copies of the CELLAR and VISITOR tables. Before I can create these new views, I need to rename the original CELLAR and VISITOR tables, so I have to take them offline.
In the Cellar view, WINETYPE and WINESTORAGE are joined together as an inner join to recreate the layout of the old CELLAR table. However, in the Visitor view, to recreate the phone array that was in the VISITOR table, the join operation has to pivot the data to display HomePhone, WorkPhone, and CellPhone in one record with their associated person.
Implement the changes. If the client agrees with my suggestions for normalizing the Cellar database, and the DBA gives me time to work during a maintenance window when no production programs will be accessing the tables, I implement the code in Web Listing 3. After I make the changes, I can diagram the database as Figure 5 shows.
Monitor performance. After all the restructuring is finished, I ask the DBA to track performance to determine whether any appreciable degradation has occurred. The DBA needs to pay special attention to the Visitor view because the number of queries required to recreate the old VISITOR table has increased from one to three. If performance has dropped, the client has two options: He can enhance performance with more powerful hardware and a greater bandwidth network, if necessary, or he can have his development programmers rewrite the production programs to take advantage of the new user table architecture.
New Life
Normalizing a database is easiest at the very beginning of development, before the database goes into production. However, such an ideal situation doesn't happen often. To normalize an existing production database, you need to know what the client uses the database for, what each of the tables mean, and how each field in each table relates to other fields in the table so that the changes you make won't adversely affect existing production. The changes that you make to a production database can make it scalable and extendable and can reduce the instances of data corruption. By properly normalizing a production database, you can give it a new lease on life.
About the Author
You May Also Like