Smart Data Validation
Improve data validity by applying core database modeling and design concepts
September 20, 2006
When someone asks you to describe the data in your organization, a given system, or a particular database, how often do you answer in terms of storage size, growth rate, the number of tables, or the number of rows? These characteristics are certainly important for designing and maintaining an efficient and supportable solution, but I propose that instead of answering only in terms of quantity or volume, your response should also include a characterization of the data's quality. After all, it's not the amount of data that matters: it's the data's quality that's important. An organization can have the largest and fastest growing database in the world, but if it can't trust or quantify the quality of its data, the data is useless.
The term "smart data validation" describes the concept of setting up the database and processing environment to enforce validation automatically. When you're designing, implementing, and maintaining your own databases, you should consider several important aspects of database design: data ownership and stewardship; data definitions; data modeling; data normalization; nulls and data types; and data cleanup and integrity. As you read through this article's overview of these aspects, one caveat you should remember is that there are no absolutes when talking about design and implementation options. Instead, you should always consider the features, advantages, and tradeoffs of each aspect as they relate to your particular environment. If you keep this rule in mind as you read, you'll get lots of ideas for increasing your data's quality by implementing smart data-validation strategies.
Basic Concepts: Definition, Ownership, and Stewardship
Before digging into the technical details of data validation, we must begin by discussing three crucial factors that affect data quality: data definition, data ownership, and data stewardship. Failure to address these factors can lead to an inability to maintain the data's consistency and validity over time. Definition describes, in business terms, the meaning of the data.Your definitions should include not only a brief textual description of the data, but also any limiting constraints such as valid values or relationships to other data. These definitions form the basis of any rules used during data validation. (I discuss data definitions in more detail in a moment.)
Ownership refers to the people responsible for making decisions affecting the data. Some of the most important decisions that database owners make include validation of the data's definition and approving the data's acceptable
validity level (which I define later in the article).Stewardship refers to the group responsible for maintaining the data. This group includes the application developers responsible for any systems that acquire and maintain the data, the DBA responsible for storing and ensuring the data's availability, and any users who consume, create, or enter the data.
This article primarily focuses on validation as it relates to data definition and stewardship, but if you haven't established ownership, you have no place to turn for answers when you're trying to establish what data is valid or invalid. So as you begin to implement smart data-validation strategies in your own organization, start first by establishing database ownership and stewardship because they lay the foundation for all the design and implementation topics that follow.
When you keep the concepts of data definition, ownership, and stewardship in mind, defining data validation isn't as straightforward as you might expect.The definition of what constitutes "valid data" varies from one organization to another. For example, a financial organization such as a bank, credit card company, or brokerage might have stricter validation rules for its data than a government agency such as the United States Census Bureau. Furthermore, the definition of "valid data" often varies within an organization from one system to another. In a bank, the transactional checking and savings account systems might require a state of "absolute correctness," in which account balances are accurate to the penny and available to customers 24 × 7. Within the same banking organization, analytical warehousing systems might be more tolerant because account-balance information for sales pipeline analysis is required only weekly and reflects average daily balances rounded to the nearest $1,000.
Defining data validity is often a case of defining a reasonable tolerance level rather than an absolute rule for correctness. Therefore, validity levels aren't one-size-fits-all. Instead, each organization must create a validity definition that meets its particular needs and the demands of the system under consideration. I give some examples of determining "absolute correctness" in a discussion about maintaining domain integrity later in this article. Unfortunately, examples of "validation by trending" are beyond this article's scope. Now however, we must dig deeper into the idea of data definition.
The First Step: Data Definition
As I mentioned, the data's definition forms the basis for any rules you use for subsequent validation. In other words, you can't validate what you can't define. So the first step in establishing smart data validation is to acquire and verify definitions for the data.
In a perfect world, the definitions are already captured and published as part of your data repository's dictionary or as part of the system's requirements and specification documentation. However, in many organizations, the definitions are often undocumented, buried deep within the code base, and long forgotten by a staff that has since moved on in support of other systems. If you're in such a situation, consider starting a dictionary by defining the following attributes for the data entities and the items within each entity: Name, Type (Entity or Attribute), Description, Owner, ExampleValues,ValidValues, Uniqueness, an indicator of whether the value is optional or required, and a description of any known constraints. In addition, try to capture any data relationships. For examples of what the definitions of a Customer, Account, and an attribute of each in a banking organization's account management system might look like, see Web Figure 1.
After establishing data definitions, the next step in applying smart data-validation strategies is moving from requirements to design.You need to use data-modeling techniques to maintain the validity of data and incorporate appropriate metric-gathering functionality into the system's various operational processes. These processes might include Extraction, Transformation, and Loading (ETL) designs that you can use to measure changes in data volume and capturing metrics during transactional operations to assist with validation through trending.
Leveraging the Data Model
You can't ignore the effect a data model has on maintaining valid data. Poor modeling techniques and the failure to apply best practices to a model's design are the root cause of most validation and integrity problems. In her SQL by Design column, Michelle Poolet explores many concepts that are central to leveraging the data model to maintain data integrity and validity. In "The Four Integrities" (November 1999, http://www.sqlmag.com, InstantDoc ID 6200), Poolet explains how to maintain entity, referential, domain, and business integrity.
Maintaining entity integrity means validating the data within a row. Primary keys and unique constraints let developers uniquely identify each row in a table, thereby eliminating duplicates.Take a look at Figure 1, which shows the data model for three tables: Account_Type, Customer, and Account. The complete DDL code that creates these tables is available in Web Listing 1. The code snippet in Listing 1 shows that the entity Account_Type contains a primary key, account_type_id, which I implemented by using a primary key constraint. I also defined an alternate key on the description column by using a unique constraint.This design ensures unique rows in two ways: Rows are differentiated by account_type_id and by description. This implementation-provides the advantage of ensuring that account types are unique by name and it improves performance and minimizes storage by using the account_type_id to enforce any potential foreign key relationships.
Another example of entity integrity involves using a table-level check constraint to validate or compare multiple columns in a row. For example, at callout C in Listing 2, the account_balance_ck constraint in the Account table ensures that for a row to be valid, the sum of opening_balance plus current_balance must be greater than or equal to zero.
Foreign key constraints aid developers in enforcing referential integrity.As you can see at callout D in Listing 2, a foreign key constraint enforces referential integrity on the Account.account_type_id column. This constraint ensures that all values in the child column have a single corresponding value in the parent column.
Several options exist for enforcing domain integrity. Column-level check constraints enforce domain integrity by limiting a column's value to a certain range. For example, the account_opening_balance_ck constraint on the Account table at callout B enforces that the opening_balance is always greater than or equal to zero. Likewise, combining a User Defined Type (UDT) with check constraints limits any column defined by the UDT to the given domain of values. In addition, you can use foreign key constraints to enforce domain integrity by limiting a column to the defined set of values in the parent table.
Choosing between a check constraint and a foreign key constraint to enforce domain integrity can be tough. Although both move you closer to smart data validation, both impose performance and maintenance overhead related to their implementations. Carefully weigh which might be best for your particular situation. As a general rule of thumb, use a check constraint when the number of values is small and static, and use foreign key constraints through referential integrity when the number of values is larger and the values tend to be dynamic.
Within the realm of smart data validation, triggers can offer a means of enforcing business integrity (the idea of enforcing complex rules across rows and tables) and any of the other three integrity types already discussed. Although poorly written triggers are often at the heart of many performance problems, triggers can be a good tool for validating data. Just use them with care.
Normalization
In considering the data model's effects on smart data validation, we can't overlook the role of normalization—or more accurately, the affect of an incorrect normalization level on maintaining data integrity. For insight into this idea, we again look to Michelle Poolet for advice. Poolet's article "WhyYou Need Data Normalization" (March 1999, InstantDoc ID 4887) explains four goals of normalization, two of which relate directly to our concept of smart data validation.The concepts Poolet covers include "... minimizing the amount of duplicate data stored in a database" and "organizing the data such that, when you modify it, you make the change only in one place."
Similar to the importance of enforcing data integrity through modeling, correctly normalizing data also has a dramatic effect on data quality. For example, data duplication increases the cost of maintenance and the corresponding risk of introducing inconsistencies. Not all data models must be in the third or fifth normal forms, but as with the other concepts in data validation, you need to carefully consider the affect of a particular normalization level on maintaining data integrity and performance.
Nulls and Data Types
A data model assists in ensuring smart data validation in two additional ways. First, a data model can help you decide whether an attribute is required. If the attribute is required, then you can force the schema to reject NULL values by setting the attribute to NOT NULL. Likewise, you can improve data validation by using constrained data types, which force data to be within a set range of values.For example,instead of using an integer to represent a two or three-state value (e.g., "null", "yes", "no") consider using a bit because by definition, a bit will allow only up to three possible values: 1, 0, and NULL. As another example, given that SQL Server represents all date values as date + time by means of the datetime data type, in situations in which you need to guarantee users are accessing "date-only" values, you can expose a standardized date, rounded to 12:00 A.M., by using computed columns instead of requiring users to remember that each query referencing the date value must manually account for the time portion as well. The opening_date and opening_datetime attributes in the Account table illustrate this technique, as the code at callout A in Listing 2 shows. Using calculated columns lets you capture exactly when an account was opened (including the time of day) while exposing a constrained version of the same value, opening_date, that lets users write simple queries without having to worry about inadvertently excluding accounts that don't happen to have been opened at exact day boundaries.
Sometimes, even with valid data, you can use small data model changes to increase reporting consistency and accuracy. This strategy focuses on incorporating a "single version of the truth" into our data models by moving complex, frequently used calculations out of our report's code and into a persisted state within the data model. Leveraging a "single version of truth" is most often associated with business intelligence solutions such as data warehouses and data marts. However, the technique is also applicable in transactional systems that use calculations such as the derivation of age from a date of birth or a more complex calculation such as the quarter-to-date average net assets within a financial system. Applying this technique often comes at a high price because of the increased performance cost of maintaining the calculated value. However, avoiding the frustration that occurs when multiple people arrive at different results for a calculation involving the same set of values often far outweighs any added performance affect.
Handling the Trash
A final rule to remember as you explore smart data validation is the often-overused but ever-true phrase "Garbage in, garbage out." Systems today do a much better job than they used to of minimizing free-form text through the use of lookup tables and edit controls for validating data that users enter in GUIs. However, users are increasingly demanding the ability to accept and capture data through direct interfaces with systems outside the database (e.g., through Web services or third-party applications).The idea of applying validation rules to each of these new entry points has not been as widely accepted. Blindly accepting data from an external system potentially reduces data quality by putting control of your data's integrity in someone else's hands. Taking appropriate precautions, such as staging and screening incoming data to ensure it adheres to the same validation rules used by the rest of the system, is essential before integrating new data and existing data sets. Similarly, interface validation must also ensure a complete data transmission, meaning the receipt of all expected rows. Failing to validate completeness introduces the risk of consistency problems as well as the potential understatement of values.
The Challenge
By looking at the topic of smart data validation, we've explored several ways a system leverages many of its own physical constructs to increase the validity and value of data. So, are your organization's systems doing all they can to maintain the highest degree of data quality? If so, then congratulations! If not, then I challenge you to take the next step to increase your data's validity by applying some of the concepts presented here.Then, the next time you describe the data in your organization, your description might even include a characterization of its quality as well.
About the Author
You May Also Like