Data Warehousing: Horizontally Partitioning the Fact Table
A horizontally partitioned fact table’s rows are divided into discreet, non-overlapping sections. These sections are defined by a range of values in the FOR VALUES clause. To horizontally partition a fact table you need to create a partition function, a partition scheme, and a partitioned table.
March 27, 2008
Fact tables tend to grow very large, very fast. Sometimes, a fact table can become so large that it can be difficult to manage, and SQL queries can be adversely affected by the sheer size of the data set involved. However, you can horizontally partition a fact table to make it easier to work with.
Related: Optimizing Queries for Fact Tables
A horizontally partitioned table is one in which the rows are divided into discreet, non-overlapping sections. Each section in a horizontally partitioned table is defined by a range of values, such as by date, geographic area, or customers’ last names. (Note: You might consider storing each section on a different hard disk to enhance query performance.) Although it isn’t difficult to horizontally partition a table, it does require some advance planning because you’re dealing with large amounts of data. Let’s explore when and why you’d want to horizontally partition a fact table, and how to create a partition function, a partition scheme, and a partitioned table using SQL Server 2005’s built-in functionality.
Why Partition a Fact Table?
Large tables (i.e., tables with hundreds of millions of rows) can be difficult to manage because of their size and the amount of time it takes to do anything with them (e.g., rebuild an index). In a transactional database, the associative tables (i.e., those tables that involve the many to many—M:N—relationship) are often the tables with the most rows in the database. In dimensional modeling, a fact table is the equivalent of an associative table. Like an associative table in a transactional database, a fact table often has many more rows than its related dimensions, perhaps even as many as (# of rows in dimension 1) x (# of rows in dimension 2) … x (# of rows in dimension n) rows.
Related: Data Warehousing: Measuring the Facts
Partitioning breaks these monster tables into manageable chunks. If your maintenance time windows are shrinking or the amount of data to be processed is growing, you can partition the table and perform tasks, such as backup and restore operations or use the Database Contingency Checker (DBCC) to update table statistics, by partition instead of for the entire table. SQL Server 2005 treats the many sections of a partitioned table as a single logical entity, and the multiple partitions appear as a single table to end users.
The following are some reasons to horizontally partition a table:
You can better control where each partition is placed in storage and leverage multiple read/write heads for fast query resolution.
You can back up and restore by partition, indexes can be rebuilt and reorganized by partition, and the indexes themselves can be partitioned.
You can direct queries that include a WHERE clause that contains either the partitioning column or an indexed column to the appropriate partition for resolution.
You can reduce lock escalations and lock-management overhead because locking is limited to partitions.
You can merge or split partitions fairly easily if multiple partitions are in the same file group.
So which tables are the best candidates for horizontal partitioning? Very large tables, tables that you expect to grow very large in the near future, and tables that can be intuitively partitioned based on their business value (e.g., by fiscal year). These tables must include a column whose values are NOT NULL and that can be used to divide the rows into discreet, non-overlapping sections such as a column containing sales dates.
If your database contains a large table against which queries and updates aren’t performing the way you think they should, consider testing how partitioning might affect query performance in your environment. SQL Server 2005 is partition-aware, meaning that if slow-running queries include a WHERE clause that contains the partitioning column or the indexed column (and the index is also partitioned), only the relevant partition is accessed for query resolution. This functionality can significantly help performance.
Creating a Partition Function
To partition a table, you need to use a function that’s composed of a partitioning column and a set of boundaries. To create the partition function for the SALES fact table shown in Figure 1, you’d run the command
CREATE PARTITION FUNCTION MyPartitionFunctionLeft (datetime) AS RANGE LEFT FOR VALUES (‘1/01/2003’, ‘1/01/2005’, ‘1/01/2007)
MyPartitionFunctionLeft is the name of the partitioning function, (datetime) is the data type of the partitioning column, and RANGE LEFT stipulates how to divide up the data values that are bound by the FOR VALUES dates.
There’s two ways to set the boundaries: RANGE LEFT or RANGE RIGHT. The RANGE LEFT clause divides the data from the lowest value to the highest value (i.e., in ascending order). The RANGE RIGHT clause divides the data from the highest value to the lowest value (i.e., in descending order).
The partitioning column is often a datetime data type such as the Date_of_Event column (shown in Figure 1). Separating datetime records into non-overlapping groups is straightforward. For example, if your business rules and known operational queries indicate that partitioning the table on the date of a sales event is reasonable, then you could partition the data into two-year groupings, as I did in the previous partition function command. Partitioning RANGE LEFT divides the data into the value ranges shown in Figure 2.
If the date of a sales event was June 23, 2004, you’d find that record in partition 2 (P2). If you want to create the partition function with the RANGE RIGHT clause, you’d run the command
CREATE PARTITION FUNCTION MyPartitionFunction datetime AS RANGE RIGHT FOR VALUES (‘1/01/2003’, ‘1/01/2005’, ‘1/01/2007)
Partioning RANGE RIGHT divides the data into the valve ranges shown in Figure 3. I recommend, for the sake of consistency and for the ease of querying, that you choose one range declaration (i.e., RANGE LEFT or RANGE RIGHT) and stick with it throughout all the partitioned tables that you create in your environment.
Each range of values in a partition is restricted by boundaries that are specified in the FOR VALUES clause. Note that if you’re using datetime data types for boundary values and your company has an office in Europe, you’ll have to decide on an international standard for datetime, so that it’s uniform across your company. SQL Server assumes that us_english is the default language for the session, so if that’s not the case, you’ll want to create a user-defined function (UDF) that will convert various date formats into us_english, and reference that UDF in the FOR VALUES clause. You don’t have to use literals in the FOR VALUES clause; you can reference variables, functions, and UDFs.
Creating a Partition Scheme
Now that you’ve created a partition function, you need to create a partition scheme. The partition scheme maps partitions to various file groups, as shown in the following command:
CREATE PARTITION SCHEME MyPartitionScheme AS MyPartitionFunction TO (MyFilegroup1, MyFilegroup2, MyFilegroup3, MyFilegroup4, MyFilegroup5)
MyPartitionScheme is the name of the partitioning scheme, and MyPartitionFunction refers to the partition function. This command maps the boundary values into partitions that are then assigned to one or more file groups. Data rows with Date_of_Event datetime values prior to 1/01/03 are assigned to MyFilegroup1. Rows with datetime values greater than or equal to 1/01/03 and prior to 1/01/05 are assigned to MyFilegroup2. Rows with datetime values greater than or equal to 1/01/05 and prior to 1/01/07 are assigned to MyFilegroup3. All other rows with datetime values greater than or equal to 1/01/07 are assigned to MyFilegroup4.
For each set of boundary values (remember, boundary values are in the FOR VALUES clause of the partition function statement), there will be (# of boundary values + 1) partitions. The previous CREATE PARTITION SCHEME statement includes three boundary values and four partitions. No matter whether the partitions are created RANGE RIGHT or RANGE LEFT, there will always be (# of boundary values + 1) partitions, up to 1000 partitions per table. So why are there five file groups instead of four in this example? (Remember, the partition function statement had only three boundary values.) The fifth file group is the optional “next-used” file group. Let’s look at how this file group is used and how important it is in the partition scheme.
In the CREATE PARTITION FUNCTION example, the last boundary value is 1/01/2007, so records with a Date_of_Event value greater than or equal to the 1/01/2007 will be stored in partition 4 (P4). When 1/01/2009 rolls around, you’ll have to create a new partition to maintain the storage scheme that you’ve established. If your original CREATE PARTITION SCHEME statement didn’t include a next-used file group, you wouldn’t be able to split P4 into P4, which will hold data from 2007 through the end of 2008, and partition 5 (P5), which will hold data from 1/01/2009 and beyond. If your partitioning plan involves periodically creating new partitions to hold new data, as in this example, make sure you include the next-used file group in your CREATE PARTITION SCHEME statement.
You don’t have to assign one partition to one file group, as I did in Figure 4. Instead, you can map multiple partitions to a single file group or you can map all the partitions to one file group. However, you can’t map one partition to multiple file groups.
Creating the partition scheme might be the most important step in the partitioning process. In the future, you might want to combine the data from two adjacent partitions into one partition, add a boundary value to an existing partition, or move data from a populated partition into an empty partition. To perform these operations, you’ll want to do some advance planning and create the partition scheme so that it will support these activities. SQL Server 2005 Books Online (BOL) provides some good planning resources, including “Planning Guidelines for Partitioned Tables and Indexes."
Creating a Partitioned Table
Creating a partitioned table isn’t all that different from creating a regular table; you just have to reference the name of the partition scheme in the ON clause. To do so, run the command
CREATE TABLE SALESFact (SalesFact_KEY bigint identity (1, 1) primary not clustered NOT NULL, … all the other columns in the table, ending up with the partitioning column… , Date_of_Event datetime NOT NULL) ON MyPartitionScheme (Date_of_Event)
By specifying the name of a partition scheme, you’re indicating that this table is a partitioned table. Obviously, the partition scheme and the partition function have to exist in the database before you can create the table.
You can combine the data from multiple partitions into one partition. However, you can only merge two adjacent partitions at a time, so if you’re trying to “unpartition” a table that has multiple partitions, you’ll have to repeat this step many times. To merge two partitions, run the command
ALTER PARTITION FUNCTION MyPartitionFunction() MERGE RANGE (‘1/01/2003’)
This command will merge the partitions as follows: partition 1 (P1) will merge into P2, meaning that P2 will then contain all rows with a Date_of_Event value prior to 1/01/05. Internally (i.e., in the sys.partitions system table), the partitions will be renumbered, starting with one (not zero). P1 and P2 will become P1, P3 will become P2, and P4 will become P3. I recommend sketching out the DDL before you merge any partitions because you could potentially cripple your operations for sustained periods of time if you’re not careful when you perform these merges. In fact, Eric Hanson, Microsoft’s lead program manager for query processing on the SQL Server Relational Engine Team, recommends performing splits and merges only on empty partitions, even if that means you have to temporarily empty a partition.
Easily Manage Monster Tables
SQL Server 2005’s ability to horizontally partition a table into non-overlapping sections and place each section on a separate partition makes it easy to manage large fact tables. Because the SQL Server 2005 Enterprise and Developer Editions are partitionaware, properly-written queries will access only the rows within the relevant partition, thus running much faster than if they had to traverse the contents of the entire table.
About the Author
You May Also Like