Creating DTS Custom Transformations
Custom transformations in DTS offer powerful maintainability, functionality, and performance. Here's an example showing how you can use a Visual C++ extension to build a custom transformation that generates XML documents from database tables.
Getting fast, functional performance is as easy as C++
Whether you use wizards to develop a simple package or you need the flexibility of a custom package, Data Transformation Services (DTS) lets you tailor solutions to the situation. In "Constructing DTS Custom Tasks" (September 1999), we examined how custom tasks increase a package's execution speed and help reduce its complexity by providing a handy container for utility tasks. This month, we expand on the extensibility theme by examining the role and development of custom transformations.
Like custom tasks, DTS custom transformations offer a significant reduction in execution speed and can be completely customized. Using an extension to Visual C++ that we created (and that you can reuse), we build a custom transformation.
In this example, we use the Active Template Library to build a transformation that generates Extensible Markup Language (XML) documents from database tables. (XML is an important technology for data interoperability, and its presence is growing in all aspects of distributed computing.)
What Is a Transformation?
Before you build a custom transformation, you need to understand transformations. A transformation is the process of modifying data after its extraction and before it reaches its target storage destination. Note that this definition doesn't address where the data originates, where it ends up, or how the data travels. Instead, this definition focuses on how the data is modified or transformed during its travel between the data stores. This distinction is important, because it directly determines what you can and can't do with your custom transformation.
DTS transformations take place in the Data Pump, which Figure 1 illustrates. The Data Pump is a multithreaded, high-speed, in-process COM server that programmatically exposes the data stream for direct manipulation. The Data Pump hides the complexity of data movement, letting developers focus on manipulating row sets. We talk more about the Data Pump and its role in the transformation process later in this article.
By default, the Data Pump supports only two types of transformations: a Copy Column transformation and an ActiveX Script transformation. In April and May, we developed packages with the DTS Package Designer and programmatically with Visual Basic (VB) to demonstrate ActiveX scripting and the Copy Column transformations. In some situations, though, these two types aren't flexible or robust enough to handle the complexity of your transformation. To address this limitation, custom transformations give you a mechanism to extend these basic transformations: using C++ to create new transformations.
Why Custom Transformations?
To discover where and when custom transformations might fit in your solution, let's examine their maintainability, functionality, and performance. All three attributes are interrelated, but each has a distinct impact on the decision process.
Maintainability. You implement custom transformations only via C or C++. Microsoft's reason for specifying these languages is speed; because transformations are processed one row at a time, the performance overhead in calling an automation interface once for each row is too overwhelming if you implement them with any language other than C or C++. Therefore, Microsoft restricted custom transformation development to the Data Pump's native OLE DB interfaces as defined in dtspump.h. Because many companies don't have C or C++ developers, this restriction often eliminates custom transformations from any proposed solution.
Although you need the right staff to maintain custom transformation code, you gain significant maintenance advantages when you use C++. For example, complicated algorithms are often easier to express in C++ than in script. Additionally, you have a much greater opportunity to reuse C++ code than VBScript.
Functionality. Of the three categories, functionality is the strongest reason for using custom transformations. Although scripting languages have evolved into handy tools, they don't begin to match the functionality that C and C++ offer. Compare VBScript to C++ as an example. VBScript has a limited number of intrinsic functions, can access COM objects only through automation interfaces, and supports only the variant data type. Most notably, VBScript can't call external DLLs (including the Windows APIs).
Performance. Your choice of development tool affects application performance. Your choice of development language also directly affects the performance of your transformation. However, the component interaction within a task, and consequently within a transformation, most directly influences performance. The best way to illustrate this point is to examine the layers of abstraction, which Figure 2 shows, that are involved in implementing a package. All packages perform their work through tasks. Transformation tasks (Data Pump tasks or Data Driven Query tasks) call the Data Pump for processing. The Data Pump calls a transformation to perform the data manipulation. If an ActiveX script performs the transformation, the transformation must pass control to the scripting language.
At the Data Pump level and above (tasks, connections, etc.), each component generally is called only once within a package. However, the Data Pump calls the transformation row by row. That is, all objects or interfaces in the transformation code need to be instantiated each time a row is processed. As you might guess, the performance impact of using an ActiveX Script to process a large number of rows can be substantial. In addition, certain scripting functions, such as string manipulation, are inherently slow. Any transformation that performs many string operations (as when data is copied from source to destination columns) is very slow if you write it in script. This slowdown is further magnified if the transformation moves a large amount of data; in data-warehousing scenarios, transformations commonly move several hundred thousand rows.
Building a Custom Transformation
Consider the following example, which demonstrates how to use a custom transformation. On a recurring basis, you need to pull data from an external data source and use XML to display that data in Internet Explorer (IE) 5.0. For your solution, you decide to use DTS to extract the data and a custom transformation to transform each row into an XML-formatted string that you output to a text file.
For this demonstration, we used the Active Template Library (ATL) to create a transformation. To maximize reusability and save development time, we created an ATL Add-In object module for a Custom Transformation object. (WebSQL subscribers can download the files for this custom transformation from the article index on the SQL Server Magazine Web site, http://www.sqlmag.com.) When you execute this add-in, it creates a default DTS Transformation object and its interfaces. (Later in this article, we'll describe the object and each of the interfaces in detail.)
First, install and register the Custom Transformation ATL Add-In by unzipping the contents of XForm Object.zip into Program FilesMicrosoft Visual StudioCommonMSDev98 TemplateATL and running the Registry file, CustomXForm.reg. In addition, you need to define the SQL Server Include directory MSSQL7 DevToolsInclude in your development environment. (You might have created this directory when you installed the development tools from the SQL Server CD.) Note that all the code that this add-in creates (beyond what most ATL objects need) has a comment denoting that DA/BL (the authors) added it. This marker gives you a searchable string to identify what we added.
Establish the Project Environment. To begin, create a new C++ project. Choose File, New... from the menu. Then, switch to the Projects tab, and select the ATL COM AppWizard, add a Project Name (we used Xform), and click OK. The AppWizard then prompts you to confirm the type of object you want to add to the project. Because custom transformations run as in-process COM servers in the Data Pump, choose DLL (the default) as the object type. Click Finish, and then OK to confirm the object creation. The last step in establishing the project environment is adding ,_ATL_NO_UUIDOF (don't forget the comma!) to the Preprocessor definitions in the C/C++ tab of the Project, Settings dialog box. You need to add this definition because of a minor version incompatibility between the SQL Server DTS files we use and Visual C++ 6.0. (For a more complete explanation, refer to the Microsoft Knowledge Base article Q192561 at http://support.microsoft.com.) Now that you've established the project environment, you're ready to add the Custom Transformation objects and interfaces.
Add Objects and Interfaces. To use the Custom Transformation ATL Add-In Object, choose Insert, New ATL Object... from the menu. Then, scroll to the bottom, select the Custom Transformation Object, as Screen 1 shows, and click Next. On the Names tab, fill in the C++ Short Name (we used XMLDoc; everything else defaults). To support error handling, switch to the Attributes tab, enable Support ISupportErrorInfo, and click OK. (Building verbose error-handling code that COM can use to pass error information to the DataPump is beyond the scope of this article. Send us email if you want to see an article on it.) The AppWizard creates a default Custom Transformation object, CXMLDoc, which is inherited from IDTSDataPumpTransform, and declares all the interfaces necessary to implement the transformation. WebSQL subscribers can download WebSQL Listing 1 at http://www.sqlmag.com, which contains the complete header file for the transformation.
Customizing the Transformation
All transformations need to support the IDTSDataPump-Transform interface. Figure 3 shows the properties and methods of this object. Because it inherits this interface, the Data Pump controls the transformations. The Custom Transformation ATL Add-In Object handles all these declarations, letting you concentrate on the specific functionality of the code. In the transformation, this specialization begins by establishing three properties (m_TableName, m_RowName, and m_XSLPath) and the member variable m_WrittenHeaderInfo. The properties are implemented as standard C++ GETS and PUTS. When you implement a package at design time, transformation properties that implement a GET method are exposed for input. In this example, m_WrittenHeaderInfo is private to the transformation; the package designer exposes the others.
Now, you're ready to begin adding custom code. The Initialize method fires only once, when the object is first instantiated. Here, you can establish the object's state. This example creates an .XML file for output. Before you write out the individual data rows, you need to create a header record to define the location of the .XSL file. (An XSL file is the type of style sheet XML uses to define the data layout and data formatting.) Listing 1 contains the code in the .XSL file. In the Initialize method, set the m_WrittenHeaderInfo flag to FALSE, signifying that the header needs to be written.
The AddVariable method fires next in the transformation's execution sequence. This method's purpose is to let application (package) variables pass through to the transformation process. You don't have any variables to pass here, so leave the method empty.
The next method to fire is ValidateSchema, which validates the source and destination columns for consistency with the Execute method's expectations. Based on the output from this method, the Data Pump determines whether to call the Execute method. Because the destination data store is an XML file, the first validation rule checks that the destination data store defines only one column. The second rule validates that the destination column is a string data type. The final rule this method performs confirms that each source column is a string data type. If any rule fails, the return code becomes DB_E_SCHEMAVIOLATION, and the Data Pump terminates the transformation. If the schema is valid, the return value is NOERROR and processing continues.
After the ValidateSchema method completes, the Execute method runs. This method performs the data transformation, firing once for each data row in the data source. As you might guess, this method does the bulk of the processing work. Data validations and data manipulations occur here. In the example, you create the XML strings for the file here. After establishing local variables to store the data, determine whether the header has been written. If not, the custom transformation creates the header and adds it to the destination string. One limitation of the Data Pump is that it handles only one data source and one data destination for each conversion. So, for example, you can't create or modify a corresponding XSL file to manage the way an XML document displays in an XSL file. Also, this limitation implies that the transformation has only one data row as input and one row as output and thus directly affects your custom transformation. As a workaround for this problem, the transformation creates the destination string as a single string value. It concatenates the various tables and data, and separates multiple lines of XML data by inserting carriage returns and line feeds into the destination stream until the entire source row is processed. Then, the transformation sends the string to the destination data store (in this case, the XML file). This processing continues until an error occurs or until all the source data is exhausted.
The remaining two methods act as events. The OnRowComplete method fires after each successful fetch operation. The OnTransformComplete method fires after all rows have been transformed and after the Data Pump terminates. You might use it to free up memory allocations that are introduced throughout the transformation, but we didn't use either of these interfaces in this project.
Another item to note about this XML custom transformation is that at package design time, you need to provide code that adds the final XML tag to the output file. So far, the Execute method has accounted for all the formatting of the XML file. However, because of a design weakness in the combination of Data Pump and custom transformation, a custom transformation can't tell when to write the final XML tag to the destination output file. No event fires after all rows have been transformed but before the transformation loses access to the destination data stream. The Data Pump hides all the complexity of data management (pulling data from the source and sending it to the destination), so the transformation can't know when it is processing the last data row from the source connection. It knows only when an individual transformation is complete (via the OnTransformComplete method). After the Data Pump processes all the source data, it closes the destination data stream before calling OnTransformComplete. As we discuss implementing the package later in the article, we describe one way to work around this limitation. With that, the transformation is complete and ready for compilation.
Using the Transformation
Now that you've built the transformation, put it to work. First, create a new package. Select the Data Transformation Services folder for your server in Enterprise Manager, then right-click and select New Package. Although you registered custom tasks in the Package Designer, you simply register the transformation component with regsrv32 to make it available to any Data Pump or Data Driven Query task you might use in your package. To handle registration, we enabled the ATL Add-In to generate the COM component and SQL Server Registry information for you. For details, see the DLLRegisterServer() and DLLUnregisterServer() methods in Xform.cpp. (This file is part of XForm.zip, which we mentioned earlier.)
Returning to the example, add a data connection to the Pubs database. Screen 2 shows the Pubs connection. Next, define your destination connection. Because the XML output file is nothing more than a specially formatted text file, use the Text File (Destination) data source, which the Package Designer offers. Direct the output data to a file called ArticleAuthors.xml, as in Screen 3. Formatting of the output file takes place in the custom transformation. Therefore, in the Text File Properties dialog box, you need to ensure that the fixed field option is enabled and that the other options remain at their default values, as Screen 4 shows.
Now use a Data Pump task to add the transformation. First, select the Pubs connection you added. Next, select the XML File connection, and choose Workflow, Add Transformation.
After you add the Transform Data task, you need to define the data sets. Right-click the task, and select Properties. This action opens the Data Transformation Properties fialog box. This example pulls data directly from the Authors table and places it into the XML file. On the Source tab, choose the Table Name option and select the Authors table. You can then define the destination file. The custom transformation handles all the string concatenation and special XML formatting issues, so the destination file will have only one output column. After you switch to the Destination tab, DTS prompts you to define the columns of the output file. Click Define Columns, then set the fields on the pulldown menu on the resulting dialog box. By default, the list includes all rows from the source connection. For this example, override the default, setting all the fields except one to ignore, as you see in Screen 5. Rename the remaining column to XMLRow, change its type to not quoted, and ensure that it is large enough to include all the data from the data row, plus the XML tags. (We chose a size of 300 characters for the example.) You can see the resulting output file definition in Screen 6. And to avoid cluttering the Execute routine with error-processing code, the example detects only buffer overruns on length.
Now that you've defined the source and destination data sets, define the transformation. You map the columns on the Transformations tab of the Data Transformation Properties dialog box, which Screen 7, page 45, shows. By default, DTS maps each source column to a destination column in column order. Before you add the transformation, you must delete the default column mappings: Select the transformation line between the source and destination tables, then delete it. Next, redefine the mapping by selecting au_lname, au_fname, address, city, and state in the source table and their corresponding columns in the destination table (in our example, the destination column is equivalent to one XML row). After selecting the columns, expand the New Transformation drop-down list box, select the transformation type XMLDoc Class, and click New to create the custom transformation. At this point, the custom transformation prompts you to define its properties, as in Screen 8. These properties, which you defined earlier in the custom transformation, are case sensitive, so be careful.
The final piece of the package accounts for the custom transformation's inability to identify the end of the data stream and the resulting lack of a final XML tag. To fix the problem and write the tag, simply add the ActiveX Script code from Listing 2 to the package. And to ensure that the task executes only after the transformation successfully completes its job, add an On Success precedence constraint to the workflow.
Tying It All Together
You can see the completed package in Screen 9. Running the package generates the XML file Authors.xml. When you combine this file with the Authors.xsl style sheet and use Internet Explorer 5.0 to open the result, you will see the nicely formatted Web page that Screen 10 shows.
Custom transformations provide another powerful way to extend the capabilities of DTS. Even with their limitations, custom transformations offer developers a significant advantage over the standard ActiveX Script transformations in maintainability, functionality, and performance.
About the Authors
You May Also Like