Use SSIS for ETL from Hadoop

In this blog post, Mark Kromer walks you through using SSIS as a way to use ETL techniques using Microsoft's Hadoop on Windows (HDInsight) as a source using Hive connectors

Mark Kromer

May 8, 2013

3 Min Read
Use SSIS for ETL from Hadoop

HDInsight is the Microsoft version of Hadoop on Windows which provides most of the commonly-used aspects of the Apache Hadoop Big Data platform including the HDFS file system, sqoop for data import/export, Hive for SQL queries, the MapReduce distributed programming infrastructure and ODBC drivers to connect to your data in HDFS from tools like Excel and SQL Server. In this demo, I am going to show you how to use those ODBC drivers to extract data from Hadoop using HDInsight as the data store, transform it in SSIS, and load the data into SQL Server (extraction, transformation and loading or ETL).

First, you need to create a simple schema on top of your data in Hadoop. I have a simple text file that I've loaded into HDInsight (Microsoft's Hadoop on Windows) by generating a schema in Hive (using HQL) and loading that flat file into that schema. Also note that I am only loading a few simple rows into the Hive table from a flat file for this simple example:

create external table ext_sales(  lastname string,  productid int,  quantity int,  sales_amount float)row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input';LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
Kromer123555Smith567125Jones123999James11121Johnson45622.5Singh45613.25Yu123111

 

We can view the data in Hadoop from that Hive table using the HDInsight Web user tool:

Now, in Visual Studio (SSDT, to be precise), I've created a new Integration Services project that will use that Hadoop-based data store as the source by using the ODBC driver for Hive as the source. Note that I first created a user DSN in Windows with that ODBC connector which I can than reference in any tool uses ODBC such as SSIS, Excel, etc:

I used that ODBC source to connect with HDInsight as a source, used a Derived Column tranformation to add a 7% sales tax each sales row and then add that to the total sales amount, creating 2 new fields which will be added to the data flow and inserted into the target SQL Server 2012 database, which is using the normal SQL Server destination in SSIS:

Mapping the fields, changing data types between unicode and DTS data types all are the same actions that you take in SSIS as if your were using a file or another RDBMS as a source. No different here in that SSIS just sees Hive as an ODBC source:

Now you can run the package and the rows are tranfered into SQL Server, including the new additional columns from my Derived Column transformation:

That's pretty much it. Pretty simple & straightforward. With this technique for ETL, you can use the power of Hadoop's distributed nodes and MapReduce to crunch very large complex and unstructured data and then ETL portions of that into SQL Server for other systems, analytics, etc. It also is a more natural way to import data into SQL Server from Hadoop without needing the learn or use sqoop, which should benefit data warehouse and business intelligence developers and architects who are already well-versed in tools like SSIS.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like