Leveling the Load
Five million records can take a long time to load into a data warehouse’s Orders table. But readers know what steps Frank should take to streamline load time, speed up query execution, and archive the data.
March 11, 2002
Congratulations to Kavil S. Subramanian, chief architect at Alpharetta, Georgia–based Alogent and Daniele Pasian, engineer and technical manager at Info Studi Vicenza in Vicenza, Italy. Kavil won first prize of $100 for the best solution to the March Reader Challenge, "Leveling the Load." Daniele won second prize of $50. Honorable mention goes to Ronel Fernandez, who was the first-place winner in our February Reader Challenge. Here’s a recap of the problem and the solution to the March Reader Challenge.
Problem
Frank is the data warehouse architect for a company that hosts its databases on SQL Server 2000 and 7.0. Every week, the company imports between 1 million and 5 million orders into the data warehouse’s Orders table. The weekly import contains orders placed only during that week, but the Orders table contains an aggregate of data from many weekly imports. The following statement creates the Orders table and its relevant columns:
CREATE TABLE Orders ( OrderID int NOT NULL PRIMARY KEY CLUSTERED, CustomerID int NOT NULL REFERENCES Customers, OrderDate datetime NOT NULL ) CREATE NONCLUSTERED INDEX Orders_CustomerID ON Orders ( CustomerID )
Frank is particularly interested in reducing the time required to load the orders data, increasing the efficiency of the queries that access the table, and streamlining data archival. The queries that he wants to optimize typically access several weeks of data. Assuming that Frank has an efficient loading mechanism that uses the SQL Server BULK INSERT API, how can he redesign the table to meet his objectives?
Solution
Frank can achieve his goals by using the partitioned views feature in SQL Server 2000 and 7.0 to partition the table into weekly segments. To perform the partitioning, he first needs to create a table every week that looks exactly like the primary Orders table. Then, in each of the duplicate Orders tables (partitions), he adds a CHECK constraint on the OrderDate column that restricts the data to a specific week. The following statements create duplicate Orders tables for the January 6, 2001, and January 13, 2001, data imports:
-- Week of "2001-01-06"CREATE TABLE Orders_W_20010106 ( OrderID int NOT NULL PRIMARY KEY CLUSTERED, CustomerID int NOT NULL REFERENCES Customers, OrderDate datetime NOT NULL CHECK( OrderDate >= '2001-1-6' And OrderDate < DATEADD( day, 7, '2001-1-6' ) ) )CREATE NONCLUSTERED INDEX Orders_W_20010106_CustomerID ON Orders_W_20010106 ( CustomerID ) -- Week of "2001-01-13"CREATE TABLE Orders_W_20010113 ( OrderID int NOT NULL PRIMARY KEY CLUSTERED, CustomerID int NOT NULL REFERENCES Customers, OrderDate datetime NOT NULL CHECK( OrderDate >= 2001-1-13' And OrderDate < DATEADD( day, 7, '2001-1-13' ) ) )CREATE NONCLUSTERED INDEX Orders_W_20010113_CustomerID ON Orders_W_20010113 ( CustomerID )
Next, Frank can create the following local partitioned view on these duplicate tables:
CREATE VIEW OrdersASSELECT o1.OrderID, o1.CustomerID, o1.OrderDate FROM Orders_W_20010106UNION ALLSELECT o2.OrderID, o2.CustomerID, o2.OrderDate FROM Orders_W_20010113
After Frank uses BULK INSERT to load the data into this view, the query optimizer automatically loads the data into the correct partition according to the defined CHECK constraints. Using partitioned views lets Frank pour data directly into an empty table every week, thereby reducing load time and achieving optimal performance. Partitioning also speeds up the queries in two ways. First, because the query optimizer can use the CHECK constraint on the OrderDate column, the optimizer can eliminate irrelevant tables while executing the plan. Second, the queries can read smaller tables instead of one large table. Thus, the partitioned views feature lets Frank effectively distribute data into smaller tables, speed up his queries, and create a convenient framework for archiving data.
Now, test your SQL Server savvy in the April Reader Challenge, "Returning the Rows" (below). Submit your solution in an email message to [email protected] by March 20. SQL Server MVP Umachandar Jayachandran, a SQL Server Magazine technical editor, will evaluate the responses. We’ll announce the winners in an upcoming SQL Server Magazine UPDATE. The first-place winner will receive $100, and the second-place winner will receive $50.
Here’s the challenge: Alex is the programmer for a company that uses SQL Server 2000, 7.0, and 6.5. To find the customers who bought all the products of a specific group, he needs to write a query that asks whether rows of a particular type are the only rows in a table. The code has to work on all three SQL Server releases that the company is running. The following statement, which includes sample data, creates the table he’s querying:
CREATE TABLE #Test ( i int, c varchar(10), PRIMARY KEY ( i, c ) ) INSERT INTO #Test VALUES( 1, 'Value #1' ) INSERT INTO #Test VALUES( 1, 'Value #2' ) INSERT INTO #Test VALUES( 1, 'Value #3' ) INSERT INTO #Test VALUES( 2, 'Value #1' ) INSERT INTO #Test VALUES( 2, 'Value #2' ) INSERT INTO #Test VALUES( 3, 'Value #1' )
Given a set of values for column c, Alex needs to determine the rows that contain only and all those values for a particular value of i. For example, if Value #1 and Value #2 are the specified values for c, the query should return only rows where i equals 2. Similarly, if Value #1 is the specified value, the query should return only the row where i equals 3. Also, if Value #1, Value #2, and Value #3 are the specified values, the query should return only rows where i equals 1. Help Alex write an efficient query that returns the rows he seeks.
About the Author
You May Also Like