DTS and the Data Warehouse
Learn the secrets behind DTS's powerful Data Driven Query task, which can help you track information as it slowly changes in your data warehouse.
July 23, 2002
Demystify slowly changing dimensions with the Data Driven Query task in SQL Server 2000
About 3 years ago, I participated in a data warehouse course for trainers. Stewart McCloud, one of Microsoft's Data Transformation Services (DTS) architects, gave a session about the Data Driven Query (DDQ) task in SQL Server 7.0. I was fascinated with DTS in general and the powerful DDQ task specifically. SQL Server 2000 enhanced DTS's capabilities with a wealth of new tasks, but from my experience, the DDQ task remains one of the toughest tasks for students to learn.
One of the main uses of the DDQ task is implementing a slowly changing dimensions solution in a data warehouse environment. The DDQ task can examine data retrieved from a source system and decide, at runtime, which action to take at the destination system based on the input. The phrase slowly changing dimensions refers to various changes that attributes of dimensions undergo infrequently compared to changes that the fact table's measures undergo. Examples of dimension attributes that might change slowly include a company name, an employee's supervisor, or the volume of a product's package. For more information about the theoretical aspects of slowly changing dimensions, see two books that Ralph Kimball coauthored: The Data Warehouse Toolkit (Wiley, 2002) and The Data Warehouse Lifecycle Toolkit (Wiley, 1998).
In this article, I won't cover the theoretical aspects of slowly changing dimensions in depth. Instead, I look at the technical and practical implementations of slowly changing dimensions and discuss the DDQ task and some other new DTS features in SQL Server 2000. I'm assuming that you're familiar with DTS package creation, the Transform Data task, DTS flow elements, basic ActiveX transformations, the ADO Recordset object, Visual Basic (VB) scripts, and writing T-SQL queries and stored procedures.
To handle slowly changing dimensions, you can use any of the three classic techniques, imaginatively named Type 1, Type 2, and Type 3. Handling changes to different attributes or changes to the same attribute in different scenarios might demand different techniques. As an example, let's examine how you can use different techniques to deal with a change in an attribute: a product's package volume.
Suppose a soft-drink company wants to promote a certain product. The company increases the drink's package volume by 10 percent, thus providing more liquid for the same price. Let's look at each of the three slowly changing dimensions techniques and their implications in this situation.
Type 1. Type 1 handling overwrites an old attribute with a new one. This type is usually used for correcting historical errors or when you don't care about losing historical data. If you overwrite the product's old package volume with the new one, you won't be able to correctly calculate volumes of historical product sales. This type of change is technically the simplest to implement.
Type 2. With Type 2 handling, you create a new dimension row with a new dimension key but also keep the old dimension row, with the old dimension key. Note that you'd probably want to keep two keys in your data warehouse: the application key and the dimension key. The application key is the online transactional processing (OLTP) environment's production key, which is unchanged; the dimension key is a surrogate key that correlates between rows in the fact table and rows in the dimension table. The dimension key gets a new value.
You also need to keep track of the effective date of each dimension key. You might store a range of effective dates as two values—from_date and to_date—for simpler data manipulation. Type 2 handling lets you correctly calculate volumes of liquid in historical sales, for example, but might require you to maintain several keys per product. Let's call the application key in our example productid_app and the data warehousespecific dimension key productid_key.
Type 3. Type 3 handling keeps track of attribute changes in the same dimension row. The key remains the same as in Type 1, yet you keep track of history as in Type 2—almost. The number of changes you can track is finite. You predetermine the number of changes or versions that you want to track—let's say three—and duplicate the column that many times. You also store an effective date for each attribute. Each time a product attribute changes, you "push" to the right the existing versions of the attribute and effective dates in first in/first out (FIFO) order, storing the new values in the columns that hold the attribute's current version. This type is suitable when you want to keep track of historical changes but you care about accurately tracking only the recent changes. It has the benefit of keeping only one key per instance of an entity (as opposed to keeping several dimension keys as in Type 2) and also of maintaining a history to a certain extent. In the example, we'll track three versions of the product's package volume changes.
A Slowly Changing Dimensions Solution
Now let's look at the practical implementation of a slowly changing dimensions solution—that is, a way to keep track of information as it changes slowly in your data warehouse. Say your source OLTP system has a Products table, and you want to implement a Products dimension table in your destination data warehouse. You also want to implement a refresh process—the process that actually makes the changes to the dimension attributes—that uses whichever of the three slowly changing dimensions techniques that best suits your needs. For demonstration purposes, we'll name the destination dimension tables Products_type1, Products_type2, and Products_type3 and implement three processes that use the three slowly changing dimensions techniques. We'll implement the refresh process through a DTS package, using the DDQ task as the package's main tool. A database called Source represents the OLTP system, and a database called Destination is the destination data warehouse. You can create these databases in the same SQL Server installation, on two instances of SQL Server on the same machine, or on separate machines.
First, you need to prepare the source OLTP system's infrastructure. Connect to the source server and run the script in Listing 1 to create the Source database and the Products table. To track the changes, you can create a log table that records all transactions issued against the Products table, including all information that the refresh process requires. You can think of the log table as a kind of custom transaction log. Later, we'll use the log table as the source for the refresh process.
Next, making sure you're still connected to the source server, run Listing 2's script to create the Prod_log table. The lsn column maintains log serial numbers, which are automatically generated consecutive values that represent the chronological order of transactions. The lsn values will determine the order in which the refresh process processes the transactions. It's important to process the transactions in the destination system in the same order as they occurred in the source system. The log_date column stores the effective dates, and the tran_type column stores the transaction type: I for INSERT, D for DELETE, and U for UPDATE. The productid, productname, and package columns store the product's original attributes. Namechg and packagechg are bit columns that store 1 if the column they represent has changed and 0 if it hasn't.
Next, you create a set of triggers on the source Products table that record the transactions in the Prod_log table. The INSERT and DELETE triggers are fairly straightforward; they record all the data from the inserted and deleted tables, respectively, plus an I or a D as the tran_type in the Prod_log table. While connected to the source database, run the script that Listing 3 shows to create the INSERT and DELETE triggers. Note that SQL Server uses the IDENTITY column attribute and GETDATE() default value, respectively, to generate the lsn and log_date values. The namechg and packagechg values are irrelevant when you're inserting or deleting values, so they get NULLs.
The UPDATE trigger is more complex. First, it checks whether an attempt to change the product ID occurred, and if so, it rolls back the transaction. (A well-designed OLTP system shouldn't allow a change to the product ID once it's set.) The UPDATE trigger also needs to compare the old image of the data (the deleted table) with the new image (the inserted table) to make sure that the product's name or package changed and to determine the new values of the namechg and packagechg columns. Using the UPDATE() function to determine whether those columns changed isn't a good idea for two reasons. First, your trigger should handle multirow updates, but product attributes in some rows might change while other products' attributes remain the same. Second, the UPDATE() function returns TRUE when you specify a column in an UPDATE statement's SET clause, regardless of whether the column value actually changed. You can run the script that Listing 4 shows to create an UPDATE trigger on the Products table.
To test your triggers, run the following INSERT statements against the Products table:
INSERT INTO Products(productid, productname, package) VALUES(101, 'prod1', 10)INSERT INTO Products(productid, productname, package) VALUES(102, 'prod2', 10)INSERT INTO Products(productid, productname, package) VALUES(103, 'prod3', 10)
Wait a moment to simulate a period of time passing before changes are made, or you might not be able to see the difference in the effective dates of the attribute changes. Then run the following UPDATE and DELETE statements:
UPDATE Products SET package = 20WHERE productid = 103DELETE FROM ProductsWHERE productid = 102
You can issue the query
SELECT * FROM Prod_log ORDER BY lsn
to verify that the triggers did their job correctly. Table 1 shows the correct contents of the Prod_log table after these modifications.
At this point, you're done with the source system infrastructure. The recording process runs automatically, so you don't need to worry about it. Everything else, including the creation of the dimension tables and the refresh process, happens in the destination data warehouse and in the DTS Package Designer.
Destination Data Warehouse
Next, you need to connect to the destination data warehouse and create three versions of the Products dimension table so that you can practice the three handling techniques. After connecting to the destination server, run the script in Listing 5 to create the Destination database and three versions of the Products dimension table.
The Products_type1 table is the simplest. It holds the product ID in the productid_app column and a surrogate key in the productid_key column. Note that in a Type 1 implementation, you don't necessarily need a surrogate key, but holding such a key in your table and using it as the correlating key in the fact table is a good practice for a couple of reasons. First, you accommodate future changes that might require a more complex type of handling. Second, the application key, which holds the original key from the source system, might be large, whereas the surrogate dimension key is of an integer data type, which is small. Keep in mind that the dimension key will be used as a foreign key in all rows of the fact table, which usually has a huge number of rows—in some systems, tens or even hundreds of millions. The second argument for using a surrogate key doesn't apply in the soft-drink scenario because the sample application key—the product ID—is also of an integer data type, but the first argument alone justifies adding a surrogate key.
Also note that uniqueness is enforced in both keys because you don't keep several versions of a product in a Type 1 implementation. The other columns in the Products_type1 table are productname, package, and discontinued—a bit column that shows 0 (false) if the product is active in the production system and 1 (true) if it was deleted from the production system. For suggestions about tracking discontinued products, see the sidebar "Discontinued Products."
The Products_type2 table holds the effective_date and to_date columns in addition to the columns that Products_type1 holds. Uniqueness isn't enforced on the productid_app column because, by definition, in a Type 2 implementation you can have several versions of products with one application key and different surrogate keys.
The Products_type3 table holds three versions of a product in the same row. Notice that the package and effective date columns are duplicated three times. You don't need separate effective_date and to_date columns because one version's "effective date" is the previous version's "to date." A Type 3 implementation has one row for each product, so you should enforce the uniqueness of both the production key and the surrogate key.
At this point, you're done implementing the infrastructure in the destination data warehouse. You can now move to the main task: creating the DTS package that implements the refresh process.
The Refresh Process
In the refresh package, you'll use several important DTS features, some of which are new to SQL Server 2000. You create a package that looks like Figure 1 by following the process I describe next.
Set the source and destination connection properties. Open the Properties dialog box of the connection called Source OLTP, then set the Server property to your source server and the Database property to the Source database. Open the Properties window of the connection called Destination DW, and set the Server property to your destination server and the Database property to the Destination database.
Make all the data-modification tasks part of one transaction. Making all tasks that modify data part of the same transaction ensures the atomicity of the refresh process. If a step fails, the package will roll back all activity. Only if the last step—the Execute SQL task called Clear Handled Trans—completes successfully will the transaction commit. Right-click each task (except the Connections and the Dynamic Properties tasks) and choose Workflow, Workflow Properties. Click the Options tab. Make sure that Join transaction if present and Rollback transaction on failure are selected for all tasks and that Commit transaction on successful completion of this step is selected only in the Execute SQL task. Also, make sure that the Microsoft Distributed Transaction Coordinator (MS DTC) service is active. This service manages the DTS transactions.
Create a global variable called max_lsn. Next, you create a global variable to store the current maximum lsn in the Prod_log table. The refresh processes that the DDQ tasks implement will go through the Prod_log table, reading all the transactions and performing modifications at the destination dimension tables. After the refresh processes are finished, the Execute SQL task will delete the transactions from the Prod_log table. Between the time the refresh processes finish and the time the Execute SQL task performs its cleanup, new transactions can enter the Prod_log table. Storing the current maximum lsn in a global variable before the refresh processes begin lets you limit the transactions that the refresh processes handle and that the Execute SQL task deletes, thus avoiding deletion of unprocessed transactions. To create the variable, right-click an empty area in the package, choose Package Properties, then click the Global Variables tab. Create a global variable called max_lsn that has an integer data type and a zero default value.
Use the Dynamic Properties task to store the maximum lsn in the global variable. The Dynamic Properties task is a cool new DTS feature in SQL Server 2000 that lets you dynamically change properties in the package during its run. You can use an .ini file, a query, a global variable, an environment variable, a constant, or a data file as the source for the changed property value. In this case, we'll use a query to retrieve the maximum lsn from Prod_log and store it in the max_lsn global variable. Open the Properties dialog box of the Dynamic Properties task called Store Max LSN in Var and click Add. Choose the max_lsn global variable from the left pane and click Set. Choose Query in the Source property, choose Source OLTP in the Connection property, and type the following query in the Query box:
SELECT MAX(lsn) FROM Prod_log
Click OK twice.
Delete the processed transactions from Prod_log. The next step is to write, in the Execute SQL task called Clear Handled Trans, a DELETE query to delete all the transactions that the refresh processes handled. Open the Properties dialog box of the Execute SQL task and make sure that Source OLTP is selected in the Existing Connection property. Enter the following query in the SQL statement box:
DELETE FROM Prod_logWHERE lsn <= ?
Another valuable new DTS feature in SQL Server 2000 is the ability to embed parameters in your queries. You specify a question mark (?), representing a parameter, in the query and map the question mark to a global variable. DTS replaces the question mark with the global variable's value at runtime. To map the question mark to the max_lsn global variable, select Parameters and make sure that max_lsn is mapped to Parameter 1. Click OK twice.
Set the DDQ tasks' properties to implement the refresh processes. Before you start implementing the refresh processes, open the Properties dialog box of one of the DDQ tasks to familiarize yourself with the task. You'll find it's similar in several ways to the Transform Data task (aka the Data Pump). DDQ lets you use a source and a destination like the Transform Data task does. It also lets you use an ActiveX transformation that manipulates the source data in the DTSSource recordset before setting the data to the columns in the DTSDestination recordset.
However, the DDQ task adds another important feature. It lets you dynamically determine the type of activity to be performed at the destination based on an evaluation that the ActiveX script performs on the source row, such as checking the contents of the tran_type column in the Prod_log source table. You determine the type of activity by setting the return value of the Main() function to one of the following values: DTSTransformstat_InsertQuery, DTSTransformstat_DeleteQuery, DTSTransformstat_UpdateQuery, or DTSTransformstat_UserQuery. You write parameterized queries in the Queries tab for Insert, Delete, Update, and Select, and DTS invokes the appropriate query based on the return value of the Main() function. You map the values that you set in the DTSDestination recordset to the queries' parameters.
Try It
If the preceding explanation isn't clear yet, don't worry. Things will become clearer as you set the properties of the DDQ tasks a step at a time. Start by selecting the Source tab in the Properties dialog box of the Type1 Refresh DDQ task. Make sure that Source OLTP is selected in the Connection property, then type the following query in the SQL query box:
SELECT *FROM Prod_logWHERE lsn <= ?ORDER BY lsn
As I mentioned, we'll handle all the transactions in the Prod_log table that have an lsn that's less than or equal to the maximum lsn (i.e., all transactions that existed in the Prod_log table when the Dynamic Properties task retrieved the maximum lsn value). Follow the same procedure you used in the Execute SQL task to map the query's parameter to the max_lsn global variable. Moving to the Bindings tab, make sure that Destination DW is selected in the Connection Property and that the Products_type1 dimension table is selected as the destination table. Later, you'll use the same Source and Destination settings in the other two DDQ tasks, except that each will use the appropriate Products_typeN dimension table as the destination table. On the Transformations tab, create an ActiveX transformation according to the mappings that appear in Figure 2.
Now, to create the Main() function that determines which action to take based on the input, type the code that Listing 6 shows in the ActiveX Script Transformation Properties dialog box. The columns package, productname, and productid_app in the destination recordset are set to the values of the package, productname, and productid columns from the source recordset, respectively. For a Type 1 transformation, we're not interested in the rest of the columns. Also, remember that SQL Server automatically generates the surrogate key in the destination dimension table. The values in the destination recordset will be mapped to parameters in the queries you write in the Queries tab.
The code in the Main() function uses a Select Case command to evaluate the value of the tran_type column from the source row and determines which query type to invoke by setting the function's return value. Click OK twice, then click the Queries tab. Use the information in Table 2 to fill in the parameterized queries and map the parameters to the destination recordset's columns in the Query tab.
The Insert query inserts a new row into the Products_type1 table. The Update query overwrites the productname and package values in the row whose productid_app is being handled. The Delete query sets the discontinued value of the product deleted from the OLTP system to 1.
That's it—you're done implementing the refresh process for the Type 1 transformation. Now click OK and open the Type2 Refresh DDQ task's Properties dialog box. Fill in the properties of the Source and Bindings tabs with the same settings you used in the Type1 Refresh DDQ task, except use Products_type2 as the destination table. On the Transformations tab, create an ActiveX transformation according to the mappings that appear in Figure 3.
Note that with the Type 2 transformation, you're also retrieving the log_date, which will be the effective date, and the packagechg value, which determines whether to create a new dimension row. Figure 3 also shows that the namechg column is retrieved, but in this case, let's use a Type 1 handling for the product name. That way, we can always overwrite the previous product name whether it changes or not. You can omit namechg from the source columns selection. Type the code that Listing 7 shows in the ActiveX Script Transformation Properties dialog box to create the Main() function.
Notice that a value of 1 (minus one) is stored in the destination package value if the source product's package didn't change. The reason for this assignment is that the queries on the Queries tab have access only to the destination recordset (which contains the columns that appear in the Products_Type2 table) and not to the source recordset. This trick lets you tell the query parameters that must map to columns in the destination recordset that a package wasn't changed. For an alternative solution, see the sidebar "Dummy Bindings Table."
Next, run the script that Listing 8, page 30, shows in a Query Analyzer connection to the destination database to create the usp_Update_Products_type2 stored procedure, which I'll describe shortly. Move to the Queries tab and use the information that Table 3, page 30, shows to fill in the parameterized queries and map the parameters to the destination recordset's columns. The Insert query inserts into the Products_type2 table a new row that includes an effective_date column. The Delete query sets to 1 the discontinued value of the product that was deleted from the OLTP system.
The Update query has a more complex task. You implement it through the stored procedure usp_Update_Products_type2, which the code in Listing 8 created. The stored procedure first checks whether the package changed (@package <> -1). If so, the stored procedure uses an UPDATE statement to set the to_date of the most recent product version to the current effective date, then uses an INSERT statement to generate a new version of the product. If the package didn't change, the product name must have changed, so the stored procedure uses an UPDATE query to overwrite the existing product name in all product versions. Remember that we're using Type 1 handling for all product names in this scenario.
Now you're done implementing the refresh process for the Type 2 transformation. Click OK, then open the Type3 Refresh DDQ task's Properties dialog box. Again, fill in the properties on the Source and Bindings tabs with the same settings you used in the Type1 and Type2 Refresh DDQ tasks, except that the destination table is Products_type3. On the Transformations tab, create an ActiveX transformation according to the mappings that appear in Figure 4.
In the ActiveX Script Transformation Properties dialog box, enter the same code you used in Listing 7's Type 2 transformation. Run the script from Listing 9, page 32, in a Query Analyzer connection to the destination database to create the usp_Update_Products_type3 stored procedure, which I'll describe shortly. On the Queries tab, use the information from Table 4, page 32, to fill in the parameterized queries and map the parameters to the destination recordset's columns.
Note that the Update query in the Type 3 transformation also executes a stored procedure—in this case, usp_Update_Products_type3. This stored procedure first checks whether the package changed. If so, the stored procedure uses an UPDATE statement to overwrite the productname value and, in FIFO order, "pushes" the package and effective date values and updates the new ones. If the package didn't change, an UPDATE statement overwrites only the existing product name.
Click OK, then save and run the package; make sure that all the tasks complete successfully. Each of the TypeN Refresh tasks should have processed five transactions. Execute the package again to make sure that no transactions are processed if no transactions took place after the previous run. Issue a SELECT * query against the Prod_log table in a Query Analyzer window to verify that the Prod_log table is empty. Now issue SELECT * queries against the three dimension tables and examine their contents. Tables 5, 6, and 7 show the contents of the three dimension tables at this point. You can see that Type 1 changes overwrote original values, Type 2 changes added dimension rows, and Type 3 changes pushed attributes to the right.
Next, issue a couple more modifications against the Products table:
INSERT INTO Products(productid, productname, package) VALUES(104, 'prod4', 50)UPDATE Products SET package = 40WHERE productid = 103
Execute the DTS package again and reexamine the dimension tables to verify that all the refresh types ran correctly.
Final Touches
Generally, you'd create a SQL Agent job to run the DTS package that implements the refresh process on a nightly schedule to apply changes that occurred in the source system to the data warehouse. Row-at-a-time processing doesn't yield good performance when it has to handle a very large number of rows. In that case, you need a set-based solution. However, in this case, although the refresh processes work on a row-at-a-time basis, the package should complete its run quickly. Each night, the refresh works on a relatively small number of rows—one day's worth of transactions.
DTS is a powerful and fascinating transformation tool, but as with any other programming tool, you—the designer and programmer—are the key to the project's success. Learning more about how DTS works can give you an edge in designing and implementing new projects.
About the Author
You May Also Like