Seven Deadly Sins
Avoid these common data-modeling pitfalls
October 22, 2002
Several readers have asked me to list the typical pitfalls that beginning database designers encounter when they undertake a conceptual model for the first time. The conceptual model is so named because the entity-relationship model that you create is a concept of what the database will look like when you implement it. Even though novice database designers can avoid some pitfalls by reading about them, they're bound to make mistakes. Let's look at some of the most serious errors.
1. Misunderstanding Requirements
Before you begin to model a database, you need to understand the business and system requirements. Data modelers must interview people within the organization who understand what kind of data needs to be captured and how, where, and when it will be used. Without this information, data modelers can't create a plan for a database that will work for their organization. When modelers encounter conflicting requirements (e.g., to one department, a customer is a person who buys goods, but to another, a customer uses services), they must resolve those conflicts so that the project can proceed. Then, modelers have to match system specifications to the requirements.
2. Using Inconsistent Naming Conventions
Using inconsistent naming conventions (or using no naming convention at all) for database objects can lead to confusion and errors in data retrieval. Databases evolve over time, and the names of the database objects (tables, columns, keys, indexes) evolve with the database. Too often, naming inconsistencies become so prevalent that they cause confusion when users attempt to access data, and they eventually inhibit productivity. Here are some of the rules for creating useful names for database objects:
Use descriptive names that are meaningful to the entire organization; don't incorporate jargon that's familiar to only a small section of the company.
Use a name that clearly identifies the database object. Consider using a variation of the notation scheme that Microsoft introduced in Access 1.1, in which you prefix objects with three-letter identifiers (e.g., you designate the Employee table as tblEmployee).
Use the minimum number of words necessary to convey the meaning of the database object. In SQL Server, you're limited to 32 bytes for any object name—which is still too long to repeat in your SQL Server queries and stored procedures.
Don't confuse the meaning of the name by adding redundant terms to it (e.g., tblRedundantTable).
Don't use acronyms, and use abbreviations judiciously. You can't count on the availability of a business metadata repository to document and decode the meaning of acronyms and abbreviations.
Don't use names that implicitly or explicitly identify more than one topic (for tables) or characteristic (for columns).
Use the singular form of a name wherever possible, especially for entities and tables; it helps you correctly distinguish the entity relationships (1:1, 1:M, M:N).
Don't include spaces in database object names (such as Employee ID). No other database management system (DBMS) supports these spaces, and you might have to integrate another DBMS with your SQL Server environment someday.
As you choose naming conventions, keep in mind that no convention is perfect and that any naming convention is better than no convention. Pick one convention and use it.
3. Duplicating Attributes
Creating the same attribute in more than one entity is a temptation that many novice data modelers can't resist. Neophytes are often slow to accept data normalization or to understand that SQL joins can successfully reunite data fields that need to be included together on a report or form. The only attributes you have to duplicate are primary and foreign keys.
4. Using a Printed Report as a Table Template
Novices often try to use a printed report as a template for a table by transferring the report's fields into a table or entity. Perhaps the most obvious example is the invoice. I've seen entities and tables named Invoice containing many different arrangements of attributes that belong in three separate tables—Sale, Product, and SaleItem. A printed report is both a business requirement for and one of the outcomes of your database design. To produce the report, you need to store the attributes that compose the report in the database. A printed report is rarely a template for a table layout.
5. Failing to Use Supertypes and Subtypes
Data modelers use supertype and subtype entities to distinguish the different categories of an entity, such as the PERSON entity and its subtypes EMPLOYEE and AUTHOR, as Web Figure 1 (available online) shows. When you're modeling, you need to demonstrate the differences between an employee and an author, and you have to show which attributes modify each of the three entities (PERSON, EMPLOYEE, AUTHOR). The supertype/subtype paradigm forces you to identify the attributes and relationships that interact with both the supertype and the subtypes. The common attributes—which in Web Figure 1 are PersonID, FirstName, MInit, LastName, Address, Phone, and Email—modify the supertype entity PERSON. You need to record these attribute values for all people—whether the person is an employee or an author. Some relationships involve only the supertype, rather than the subtypes. For example, the PERSON entity is involved in the relationship PERSON_PUBLISHER. The relationship shows you that all people—employees and authors alike—work for a publisher.
For each subtype entity, you have to identify the attributes or relationships that are specific to that one subtype. In Web Figure 1, an employee is involved in publication activities and also has specific skill sets. The author writes articles and receives royalties.
If you need to represent categories of an entity in your design but you fail to use the supertype/subtype structure to help analyze the data requirements for your database, you risk incomplete understanding of your data. You also risk building modification anomalies into your database. For instance, in Web Figure 1, a person can be an employee of a publishing house and an author for that firm at the same time. If you didn't use the supertype/subtype structure in this design (i.e., if you included only an EMPLOYEE entity and an AUTHOR entity), duplicate data about one person who is both employee and author would be stored in the EMPLOYEE table as well as the AUTHOR table. This duplication would lead to non-key data redundancy within the database, which can result in insert, update, and delete anomalies that cause unsynchronized data and loss of data integrity.
6. Failing to Recognize a Multivalued Attribute
Failing to recognize and resolve a multivalued attribute might be one of the most common mistakes a data modeler can make. As I've said before, a multivalued attribute contains a list or a collection of values for each instance of that attribute. Typically, neophyte designers treat these lists or collections as separate attributes of an entity or separate columns in a table, as the SERVER table (entity) in Web Figure 2 shows. This flattened design might initially be easy to program against (it's especially compatible with the detail report that displays the general system configuration), but this same design is difficult to manipulate for summary data and is counterproductive when you manage or update data in the database. A multivalued attribute needs to become a child table in which each of the multiple values becomes a row in the child table and each row of the child table contains a foreign key that links to the parent table's primary key. In Web Figure 3, the data model for the transitions from multivalued attributes to child tables, the multivalued attributes in Web Figure 2 are broken out into the DISK_DEVICES and INSTALLED_COMPONENT child tables.
7. Failing to Model to 3NF
Failing to model to third normal form (3NF) at the conceptual design stage might be a controversial point in this article, especially to working database professionals who are pulled into database design without background in data-modeling theory. To understand a data model's business requirements and the data items' relationships, you must model to 3NF at the conceptual stage. Then, when you implement the design, you can denormalize the design back to 2NF for production efficiencies or for programming convenience. When you denormalize the implementation of a 3NF data model, you know where to expect the synchronization problems caused by insert, update, and delete anomalies that the denormalizing operation introduces. And you can more easily create the compensatory programs that keep the redundant data in check and synchronized. For more information about data normalization, see "Why You Need Database Normalization," March 1999.
Avoidable Pitfalls
All new data modelers, including me, committed one or another of these seven deadly data-modeling sins when they were learning the skill. If you avoid making these mistakes, you'll discover that your projects go faster and more smoothly. You'll also find that your database designs will endure for years, sometimes outlasting the company for which they were created and providing good service to the querying public.
About the Author
You May Also Like