A Look at the SSIS Object Model
Build a custom data flow component
January 2, 2014
SQL Server Integration Services (SSIS) is the extraction, transformation, and loading (ETL) tool of choice for many administrators and developers. It connects to a wide range of data sources and destinations, and it supports an ever-growing list of data transformations. The reason for this versatility is its robust object model. The SSIS object model provides a consistent framework to build disparate components for disparate data sources.
Related: SSIS Deployment Strategy for SQL Server 2012
Data flow components are an important part of the SSIS framework because they let you connect to a variety of data sources, then transform and route the data at high speed. SSIS includes out-of-the-box data flow components, but you aren't limited to using only those components. The extensible SSIS object model lets you create your own custom components to satisfy your ETL needs. You can, for example, create a source component that reads from an Apache Hadoop, MongoDB, or Couchbase data source or any of the NoSQL databases out there. You can also create SSIS components that read from or write to third-party cloud applications such as Salesforce.com and Workday. The possibilities are limited only by your imagination.
I'll introduce you to the types of data flow components and their basic structure and modes. I'll also show you how custom data flow components function at the object level and describe what you need to know to build them.
Types of Data Flow Components
All data flow components are grouped under the Data Flow Task heading in the SSIS Toolbox. Data flow components are different from other SSIS components in that they're grouped into three categories:
Source components. Source components let you connect to external data sources and read data. An example is the OLE DB Source component.
Transformation components. With transformation components, you can manipulate or transform data. An example is the Sort component.
Destination components. Destination components let you write data to external data sources. An example is the Excel Destination component.
Besides the functionality differences, data flow components differ from other components in the inputs and outputs allowed. For example, consider the OLE DB Source component in Figure 1.
Figure 1: OLE DB Source Component with Two Outputs
It has one output and one error output. If you try connecting the output of another component to this OLE DB Source component, you'll get the error message shown in Figure 2. This error occurs because source components can't have any inputs. Their role is only to read data from an external data source.
Figure 2: Error Message Received When Trying to Connect a Component's Output to an OLE DB Source Component
Similarly, destination components can't have any outputs because their role is to write data to an external data source, not pass the data further downstream. Transformation components can have both inputs and outputs because they receive data, manipulate or transform it, then pass it on to other components.
Synchronous and Asynchronous Components
Each data flow component uses data buffers for managing the flow of data. You can think of a data buffer as an in-memory tabular data structure that's used by SSIS to move data between components. Source components read data from external data sources and write data to data buffers, which are passed on to downstream components. Destination components read data from data buffers for their input and write data to external data sources. Because each data buffer has a maximum size limit, source and destination components are designed to process data buffers sequentially.
Transformation components are different from source and destination components in that they read data from data buffers for their input as well as write the transformed data to data buffers as output. Based on how they process data buffers, transformation components can be classified as synchronous or asynchronous transformation components.
Synchronous transformation components read data from data buffers, transform it, and immediately write the modified data back to the same data buffers as output. The component is called synchronous because the modified data gets written back to the data buffer as soon as possible. The component doesn't wait to get all the data from multiple sequential buffers before writing it out.
Asynchronous transformation components don't immediately write data to data buffers as output. An example is the Sort component. The Sort component waits until it has received all the rows from the data buffer containing the input before it sorts the rows, after which it writes the sorted rows to a new data buffer as output.
Design and Runtime Modes
All data flow components work in two modes: design mode and runtime mode. In design mode, all the actions (e.g., editing component properties, connecting component inputs and outputs) take place in the SSIS package designer. In runtime mode, all the actions (e.g., fetching data, accessing it in the data buffer) take place during package execution.
It's important to note that all custom SSIS components need to provide functionality to support both modes. As you'll see shortly, this translates into implementing the respective methods of the design-time and runtime interfaces.
Basic Structure of a Custom Data Flow Component
Before you can work with the design-time and runtime interfaces, you need to know about the basic structure of a custom data flow component. Here's the class definition for a barebones data flow component that currently doesn't do anything:
public class MySSISComponent : Microsoft.SqlServer.Dts.Pipeline.PipelineComponent{}
As you can see, a data flow component is basically a class that's derived from the Microsoft.SqlServer.Dts.Pipeline.PipelineComponent base class. The PipelineComponent base class provides a lot of the plumbing required for the design-time and runtime interactions of the component. The PipelineComponent class implements the IDTSDesignTimeComponent100 and IDTSRunTimeComponent100 interfaces. As the name suggests, the IDTSDesignTimeComponent100 interface provides methods that are invoked when working with the component in design mode, whereas the IDTSRunTimeComponent100 interface provides methods that are invoked during package execution. When building a data flow component, you should override these methods and provide custom implementations whenever applicable.
Figure 3 provides a high-level look at the various classes and interfaces that are part of PipelineComponent. (Note that it doesn't include the objects and methods that are outside the scope of this article.) As you can see, PipelineComponent provides base class implementations of various methods, including AcquireConnections, Validate, and PrimeOutput. All these methods are invoked by the SSIS designer during design time or by the SSIS runtime engine during package execution.
Figure 3: High-Level Look at the PipelineComponent Class
Each design-time method in PipelineComponent gets called at different points when you edit the component. For example, SetComponentProperty is called when you set the value of a custom property of the component, whereas ProvideComponentProperties is called when you first add the component to the package. It's important to note that the SSIS designer is built with knowledge of these interfaces, so it knows what method to call for each type of event.
In Figure 3, notice the ComponentMetaData property, which implements the IDTSComponentMetaData100 interface. As the name suggests, ComponentMetaData contains the component's metadata, including the component's name and custom properties. The most essential properties are two collections: InputCollection and OutputCollection. These arrays store the component's inputs and outputs. When you save an SSIS package to file, the ComponentMetaData object is serialized to XML and stored in the .dtsx file.
Component Inputs and Outputs
Each data flow component can have any number of inputs or outputs. In the SSIS designer, an input or output is represented as an arrow going into or out of the data flow component. In the underlying code that implements the data flow component, an input is an object that implements the IDTSInput100 interface. An output is an object that implements IDTSOutput100 interface. In fact, all the items that you work with in the SSIS designer—inputs, outputs, properties, error output, and so on—are all objects that implement specific interfaces from the SSIS framework.
As I mentioned previously, the SSIS designer is built to understand these interfaces. For example, when the SSIS designer finds an object that implements IDTSOutput100, it knows to look at the OutputColumnCollection property to get the columns in the output. After it gets the columns, it knows that each column will be of type IDTSOutputColumn100. It also knows to look for each column's properties (e.g., Length, Precision).
IDTSOutput100 and IDTSInput100 are complex objects. I'll provide more information about IDTSOutput100, but the concepts are similar for IDTSInput100.
Figure 4 shows the object model for IDTSOutput100. As you can see, IDTSOutput100 is at the top of the hierarchy. It contains an output column collection (IDTSOutputColumnCollection100), which can contain any number of output columns (IDTSOutputColumn100), including none (i.e., an empty collection). Each output column, in turn, contains column properties (e.g., DataType, Length, Precision), which are highlighted at the bottom of Figure 4.
Figure 4: Object Model for IDTSOutput100
When you create your data flow component, you need to set the column properties so that they match the property values in the external data source or destination. Failure to do so can result in errors due to column type incompatibility. For example, if you set an output column's DataType property to String but the data flow component tries to write an Int value to it, you'll get an error.
In addition to the output column collection, IDTSOutput100 has other properties. The most noteworthy properties are:
Buffer. Each IDTSOutput100 object is allocated a data buffer object, which is used for data transfer. This property stores the integer ID of the buffer.
CustomPropertyCollection. This property contains the collection of custom properties for the output.
ExternalMetadataColumnCollection. This property contains the collection of external metadata (more on this later).
SynchronousInputId. This property contains the ID of the input object that feeds data into the current output.
Data Buffers
Each input and output for an SSIS component has a data buffer associated with it, which is simply an object of type PipelineBuffer. As I just mentioned, the ID of the data buffer is stored in the Buffer property of the input or output object. The PipelineBuffer object is an in-memory, two-dimensional object with rows and columns. Data buffers are shared by all the components in an execution tree in an SSIS package. An execution tree is basically used to group together components in a data flow. It starts with a source or asynchronous transformation component and ends with a destination or another asynchronous transformation component. Different types of components use the PipelineBuffer for different purposes. For example, destination components read data from the PipelineBuffer and write it to an external data source. Source components do the opposite—they read data from an external data source, write the data to the PipelineBuffer object, then pass the object to a downstream component.
It's important to note that the columns in PipelineBuffer aren't the same as the input or output columns in the component to which PipelineBuffer is linked. To understand why, consider the SSIS package in Figure 5. In this package, the OLE DB Source component reads a single column, OLEDB_Col, from an external data source. The Character Map transformation adds one more column, CMap_Col, that converts the text in OLEDB_Col to lowercase. The execution tree here begins at the OLE DB Source component and ends at the Flat File Destination component. Shown alongside the execution tree is the state of the data buffer. Note that the column names in the data buffer are shown only for reference. In practice, data buffer columns don't have names.
Figure 5: Comparison of the PipelineBuffer Object's Columns and the Components' Input and Output Columns
Character Map is a synchronous transformation, which means it modifies the buffer that it receives as input and immediately passes the modified buffer to the output. Interestingly, anytime the SSIS runtime engine creates a data buffer for the OLE DB Source component, it automatically adds an extra column to the buffer for Character Map. This saves the overhead of adding an extra column to the data buffer during runtime. So, when the OLE DB Source component starts writing data to the buffer, the extra column is already available but is left blank. The Character Map transformation simply converts the value in OLEDB_Col to lowercase and populates the extra column with it.
As a result, the PipelineBuffer object for an input or output will have more columns than the columns in the input or output. For example, the output for the OLE DB Source component has only one column (OLEDB_Col) but the corresponding data buffer has two columns. Given that the columns in the data buffer don't have names, there's no apparent way to map the OLEDB_Col output column to a data buffer column. Fortunately, the BufferManager class provides the FindColumnByLineageID method. This method finds the column in the data buffer based on the output column's LineageID property.
Because of the overhead in mapping input and output columns to data buffer columns, your component should maintain a mapping between the columns in the PipelineBuffer object and those in the component's inputs and outputs. I'll describe how to build this mapping later.
Sample Code for a Custom Data Flow Component
Now that you have an understanding of the SSIS object model, let's look at some sample code. Listing 1 shows the code for a basic custom SSIS data flow component. The first thing to note is that the component class needs to have the DtsPipelineComponent attribute. This attribute mainly provides information for the SSIS designer, such as the component's name, type (input, output, or transform), and description. The DtsPipelineComponent attribute is required for any custom SSIS component.
using Microsoft.SqlServer.Dts.Pipeline;using Microsoft.SqlServer.Dts.Pipeline.Wrapper;using Microsoft.SqlServer.Dts.Runtime;using Microsoft.SqlServer.Dts.Runtime.Wrapper;using System.IO;namespace Microsoft.Samples.SqlServer.Dts{ [DtsPipelineComponent(DisplayName = "SampleComponent" , ComponentType = ComponentType.SourceAdapter)] public class MySSISComponent : PipelineComponent { public override void ProvideComponentProperties(){ ; } public override void AcquireConnections(object transaction) { ; } public override IDTSCustomProperty100 SetComponentProperty(string propertyName , object propertyValue) { ; } public override DTSValidationStatus Validate(){ ; } public override void ReinitializeMetaData(){ ; } public override void PreExecute(){ ; } public override void PrimeOutput (int outputs , int[] outputIDs , PipelineBuffer[] buffers) { ; } }}
Another thing to note is the namespaces imported with the using directive. These namespaces provide the various classes and interfaces that make up the SSIS object model. Table 1 shows the assemblies that you need to reference in your project so that you can use these namespaces.
Assembly to Reference |
---|
Table 1: Assemblies to Reference in the SSIS Project |
Microsoft.SqlServer.PipelineHost |
Microsoft.SqlServer.DTSPipelineWrap |
Microsoft.SqlServer.ManagedDTS |
Microsoft.SqlServer.DTSRuntimeWrap |
Also notice the public override statements. A custom component is expected to override the methods inherited from PipelineComponent. These methods can be invoked multiple times, both when editing the component in design mode in Visual Studio and during package execution. It's important to get a clear understanding of the purpose of each method.
In Microsoft's SSIS documentation, the methods are classified as design-time methods and runtime methods. This classification is somewhat misleading, because many of the design-time methods are actually invoked both during design time and runtime. So, I like to classify them as "design-time and runtime methods" and "runtime-only methods."
Although there are many methods in PipelineComponent, some are more important to know about than others. The most important design-time and runtime methods are AcquireConnections, ProvideComponentProperties, ReinitializeMetaData, and Validate. The most important runtime-only methods are PreExecute, PrimeOutput, and ProcessInput. Let's take a look at each of these methods.
The AcquireConnections Method
Most source components are designed to read data from a data source, such as SQL Server databases or Excel files. The purpose of the AcquireConnections method is to establish a connection to the data source and store the connection in a local variable. For example, if your component is connecting to a SQL Server database, you can create a SqlConnection object and use the AcquireConnections method to establish the connection, as shown in Listing 2.
namespace Microsoft.Samples.SqlServer.Dts{ [DtsPipelineComponent (DisplayName = "SampleComponent" , ComponentType = ComponentType.SourceAdapter)] public class MySSISComponent : PipelineComponent { private SqlConnection sqlConn; public override void AcquireConnections(object transaction) { if (sqlConn == null) { sqlConn = new SqlConnection(connectionString); . . } } }}
The main purpose of this method is to "cache" the connection object to the external data source. Also, if the connection object is null for some reason, the AcquireConnections method should reconnect to the external data source.
The ProvideComponentProperties Method
The ProvideComponentProperties method is called only once—when the component is first added to the package in the SSIS designer. You use this method to add inputs, outputs, and custom properties to the component.
Note that if you're designing a source component, you don't need to add any inputs. Similarly, you don't need to add any outputs for destination components. As a best practice, the first thing I do is remove any inputs, outputs, and custom properties already attached to the component (i.e., start with a clean slate). Afterward, I add new inputs, outputs, and custom properties as required. For example, the code in Listing 3 adds a single output and one custom property to the component.
public override void ProvideComponentProperties(){ // Remove all existing inputs, outputs, and // custom properties. base.RemoveAllInputsOutputsAndCustomProperties(); // Add a custom property. IDTSCustomProperty100 logFilePath = ComponentMetaData.CustomPropertyCollection.New(); logFilePath.Description = "Log File Path"; logFilePath.Name = "Log File Path"; logFilePath.Value = String.Empty; // Add a single output named "Output". IDTSOutput100 output = ComponentMetaData.OutputCollection.New(); output.Name = "Output";}
Figure 6 shows what this component will look like in the SSIS designer.
Figure 6: Component Created by the Code in Listing 3
The Validate and ReinitializeMetaData Methods
When you're working with a component in the SSIS designer, the component is constantly being validated. What this means is that the SSIS designer periodically calls the Validate method in the background.
As the name suggests, the purpose of the Validate method is to validate the package and detect any abnormal situations. It's important to understand that you're responsible for providing the validation logic. It could be as simple as checking the inputs and outputs. For example, the code in Listing 4 validates a data source component by checking to see if the component has exactly one output and zero inputs.
public override DTSValidationStatus Validate(){ // Validation #1: Make sure there's only one output // and no inputs. bool pbCancel = false; if (ComponentMetaData.OutputCollection.Count != 1 || ComponentMetaData.InputCollection.Count != 0) { ComponentMetaData.FireError(0, ComponentMetaData.Name ,"The component should have only one output and no inputs." , "", 0, out pbCancel); return DTSValidationStatus.VS_ISCORRUPT; }}
The Validate method returns a DTSValidationStatus enumeration. There are four possible return values:
VS_ISVALID. The component is correctly configured and ready for execution.
VS_ISBROKEN. The component is incorrectly configured. Typically, this means that a property's value is set incorrectly.
VS_ISCORRUPT. The component is irreparably damaged and must be completely reset. The designer calls the component's ProvideComponentProperties method in response.
VS_NEEDSNEWMETADATA. The component's metadata is outdated or corrupt. The designer calls the component's ReinitializeMetaData method to repair the component.
In Listing 4, if the check is not met, the Validate method returns the VS_ISCORRUPT value. In addition, it displays an appropriate error message.
Let's take a closer look at when you might use the VS_NEEDSNEWMETADATA value when writing the validation code. Suppose you have a source component that's reading from a database table and building its output column collection based on the source table columns, as Figure 7 shows.
Figure 7: Source Component That's Building its Output Column Collection Based on the Source Table's Columns
If one of the columns in the source table is dropped, the component should also drop the corresponding output column from its output column collection. In this scenario, the Validate method should have logic to check for this anomaly. If there's a discrepancy between the columns in the source table and the component's output columns, the Validate method should return the VS_NEEDSNEWMETADATA value. However, the downside of this check is that the Validate method must now repeatedly establish a connection to the external data source or destination every time it's called. This can make your component's design experience extremely slow, especially if the external data source or destination is hosted in the cloud.
The SSIS object model lets you reduce these round trips by creating a metadata collection that inherits from IDTSExternalMetadataColumnCollection100. This collection of columns should mirror the columns from your external data source, in the sense that the attributes (e.g., DataType, Length, Name) should match those in the external data source. After this collection is populated, your component will validate its input and output columns against the external metadata collection, without actually connecting to the external data source. It will only periodically connect to the external data source to ensure that the columns in its external metadata collection are accurate. The code for building an external metadata collection is outside the scope of this article, but the high-level logic that should be used is shown in Figure 8.
Figure 8: Validation of Input and Output Columns Against an External Metadata Collection
The PreExecute Method
As I mentioned in the "Data Buffers" section, a custom SSIS component should maintain a mapping between the columns in the PipelineBuffer object and those in the component's inputs and outputs. The best practice is to use the PreExecute method to build this mapping.
PreExecute is a runtime-only method that's called only once per execution. It's called right before the actual execution methods, which is a great place to do some housekeeping, including building a mapping between the input and output columns and the data buffer columns. For example, the code in Listing 5 builds a ColumnInfo structure that maps each output column in a component to the corresponding column index in the PipelineBuffer object. In this code, the FindColumnByLineageID method is key. Given an output column's LineageID, the FindColumnByLineageID method finds the corresponding column in the PipelineBuffer object.
public struct ColumnInfo{ public int BufferColumnIndex; public String columnName;}public override void PreExecute(){ IDTSOutput100 output = ComponentMetaData.OutputCollection[0]; foreach (IDTSOutputColumn100 outputCol in output.OutputColumnCollection) { ColumnInfo ci = new ColumnInfo(); ci.columnName = outputCol.Name; ci.BufferColumnIndex = BufferManager.FindColumnByLineageID(output.Buffer, outputCol.LineageID); colInfo.Add(ci); }}
The PrimeOutput Method
The only purpose of the PrimeOutput method is to add data rows to the output buffers. This runtime-only method is called for source components and asynchronous transformation components. It's not used for destination components because they don't have any outputs. It's not used for synchronous transformation components because they don't add any new rows to the output buffers.
The PrimeOutput method requires three parameters:
(int) outputs. This parameter specifies the number of outputs in the component.
Int [] outputIDs. This parameter is an array containing the IDs of the component's outputs.
PipelineBuffer[] buffers. This parameter is an array of the output buffers. There's one buffer per output.
Listing 6 contains a sample implementation of the PrimeOutput method. First, the data is read and loaded into a regular .NET DataTable. Then, the data is added to the PipelineBuffer object row by row.
public override void PrimeOutput (int outputs, int[] outputIDs, PipelineBuffer[] buffers){ PipelineBuffer buffer = buffers[0]; IDTSOutput100 output = ComponentMetaData.OutputCollection[0]; // Call your data retrieval method. DataTable dt = GetDataTable(); foreach (DataRow row in dt.Rows) { buffer.AddRow(); // Locate the column index in the data buffer. foreach (ColumnInfo ci in colInfo) { buffer[ci.BufferColumnIndex] = row[ci.columnName]; } } buffer.SetEndOfRowset();}
Note the use of the ColumnInfo structure that was created in the PreExecute sample code in Listing 5. The PrimeOutput method code uses it to map a column in the buffer with the correct column in the DataTable. At the end of the code, the SetEndOfRowset method is called to signal that all the data rows have been successfully added.
The ProcessInput Method
The ProcessInput method provides components with a PipelineBuffer object. Using this method, the rows in the buffer are iterated and consumed by the component. For destination components, the rows in the buffer get written to an external data destination. For asynchronous transformation components, the rows are processed per the logic in the transformation and written to a new data buffer to be passed on to downstream components.
An individual column in the buffer can be read or written one of two ways. One approach is to use the array indexer access method using the [] operator. Alternatively, you can use one of the Get or Set methods. The Get and Set methods are more efficient and should be used when the data type of the column in the buffer is known. In Listing 7, the ProcessInput method is processing incoming rows using both of these approaches.
public override void ProcessInput ( int InputID, PipelineBuffer buffer){ List lst = new List(); while( buffer.NextRow()) { for(int x=0; x < inputColumns.Length; x++) { if(!buffer.IsNull(inputColumns[x])) { String str1 = (String) buffer[inputColumns[x]]; String str2 = buffer.GetString(inputColumns[x]); lst.add(str1); } } } // Write lst to an external data destination. . . .}
The Heart of SSIS
The object model is the heart of SSIS. As such, it provides a solid base to build different types of ETL components. In this article, I showed you the concepts you need to know to build a custom data flow component. In an upcoming article, I'll discuss how to apply these concepts to build a custom data flow component that connects to a data source and reads its data.
About the Author
You May Also Like