Dramatically Decrease Data Warehouse load times

Introduction Recently, I needed to optimize a nightly batch process for a SQL Server data warehouse that involved loading over 250,000 flat files. With a bit of experimentation and research, I was able to decrease the overall load time from 14 hours to under 15 minutes. To do so, I used: something old - the T-SQL BULK INSERT command, something new - Memory-Optimized tables in SQL Server 2014, and

Tyler Chessman

May 12, 2015

9 Min Read
ITPro Today logo in a gray background | ITPro Today

Introduction

Recently, I needed to optimize a nightly batch process for a SQL Server data warehouse that involved loading over 250,000 flat files.  With a bit of experimentation and research, I was able to decrease the overall load time from 14 hours to under 15 minutes.  To do so, I used:

  • something old - the T-SQL BULK INSERT command,

  • something new - Memory-Optimized tables in SQL Server 2014, and

  • something borrowed - a trick to execute T-SQL commands asynchronously (and in parallel).

 

Let’s walk through this implementation together.


Background

If you’ve read any of my past articles, you know I’m a bit of an economics junkie.  The Federal Reserve Bank of ST. Louis (http://research.stlouisfed.org/) makes available a vast economic data set – over 250,000 United States and international time series from 79 sources, consisting of over 38 million rows.  Some of these series are updated daily, and older values can be changed.  Storing this data in a relational database presents a bit of a maintenance challenge in that each series must be reloaded/merged in its entirety.  My first attempt at reloading the data, which leveraged a basic SQL Server Integration Services (SSIS) package (figure 1) completed in 14 hours – not exactly a great solution.

 

 

Storing a bunch of time series data (which consists of just two columns – date and value) in a database may seem unnecessary.  I did consider keeping them in their original format (i.e. as individual files) or using a NoSql (e.g., Hadoop) database.  But, one of my key requirements was to calculate the correlation of a time series against this entire dataset.  With the power of a SQL database (and set based queries), I can run this kind of query in about a minute - using a moderately powered server.

 

Improving Performance – Modifying the SSIS package

To improve performance, I first tried optimizing the SSIS package.  Testing with a smaller subset of data (1000 data series), the package in its initial state executed in 3.5 minutes.  By changing the DataFlow task to use a SQL Destination (rather than an OLE DB Destination), I was able to get the package to complete in 1.5 minutes (figure 2).

 

Next, I looked at various techniques to run the SSIS package in parallel.  I modified the package to concurrently process different file batches (via multiple Data Flows tasks), I created a master package to run the base packages in parallel (figure 3), and tried various permutations thereof.  Unfortunately, I was not able to achieve any improvements in performance.   Even though my machine didn’t indicate stress in terms of CPU, Memory, or Disk, SSIS throughput decreased proportionately to the number of files being simultaneously processed.

 

Though I was (and still am) certain there was something I could do to improve SSIS parallel processing, I decided to look at alternative means of loading the files.

 

Something Old - the BULK INSERT command

Available since SQL Server 7.0, the BULK INSERT command is used to load data from a file into a table/view.  Though not as flexible as an SSIS package, it is fairly customizable – with the ability to load from both local and remote files in a number of formats.  BULK INSERT arguments are available to control transaction size, redirect errors (and specify the maximum number of errors allowed), and tweak things like locking behavior and table trigger firing.

As shown in listing 1, I created a T-SQL script with a cursor to grab the file path for each series (note that in the SSIS package, I iterated through the file paths via a ForEach loop task).  Within the cursor, I called the BULK INSERT command to load each series into staging tables; after the cursor completed, I ran a stored procedure ([dbo].[spI_SeriesValue]) to merge the staged results with the destination table.

The first time I ran this script in Management Studio, nothing seemed to happen – I sat nonplussed as the query result window displayed an empty grid for several seconds.  Then, as if roused from a deep slumber, SQL Server seem to roar in response.  I didn’t believe what I was seeing, so I re-ran the script.  There it was again (even a bit faster the 2nd time) – the script completed in 5 seconds!  Here was a solution that could potentially reload the entire series in 25 minutes.

After testing with a larger number of series, however, I discovered my script did not quite scale linearly (e.g., 5000 series took about 32 seconds to process – roughly a 20% decrease in performance).  At the same time, I also wanted to see if I could eliminate the amount of data/log file writes related to the use of staging tables.

 

Something New - Memory-Optimized Tables

Most of the new capabilities in SQL Server 2014 are on the transactional side of the house; one such feature, originally codenamed Hekaton, provides a means of storing a table in memory.  These Memory-Optimized tables are targeted at OLTP applications, where a heavily accessed table can benefit from the inherit performance benefits of memory over disk.  These tables can be defined as either durable (data inserts and modifications are, in addition to being held in memory, written to disk, and available after a system restart) or non-durable (data inserts/modifications are not written to disk; the table is empty, though the schema is still available, after a system restart).  Listing 2 shows an example of creating a non-durable table.

I modified the code shown in listing 1 to use memory-optimized, non-durable, staging tables.  With the 1000 series test, the script performance was about the same as before (~5 seconds).  When I re-ran with a larger set (5000 series), the script completed in 22 seconds.

Now I had a solution that scaled linearly – and minimized writing to the data/log files (no disk writes, other than the final merge command).  As a final optimization, I wanted to see if I could run multiple versions of this script in parallel – and still maintain linear scale.

 

Something Borrowed - a trick to execute T-SQL commands asynchronously

I modified the script to accept a data series start/end range, along with a batch value to perform periodic commits from the staging to destination table, and then saved it as a stored procedure.  While I could have gone back to an SSIS package (with multiple Execute SQL tasks calling the stored procedure) to achieve parallel processing, I wanted to find a T-SQL approach that could be called/executed from the database server.  By default, a T-SQL stored procedure/script executes each command synchronously.  An internet search for “asynchronous T-SQL” yields several possibilities - including the SQL Service Broker, CLR-based stored procedures, and dynamic creation/execution of SQL Agent jobs.

In the end, I used the SQL Agent, setting up four jobs – three of which are called asynchronously from the fourth job.  The T-SQL for the fourth job is shown in listing 3 (note that after this script is run, you can check on progress by using typical SQL Agent tools, like the Job Activity Monitor).  I discovered/borrowed this technique from on online post written by Antonin Foller.  Note: Antonin’s code snippet actually creates an SQL Agent job on the fly, executes the job, and then deletes it!  For the time being, I’ve taken a slightly more “hard-coded” approach.

With my final solution, I was able to re-process all data series in under 15 minutes.  Interestingly, my “server” (actually, a VM running on a laptop) still had plenty of available CPU/Memory, so it’s likely I could run additional jobs in parallel to further reduce processing time.  But now, the bottleneck is no longer SQL Server – it is the (lack of) network bandwidth available to re-download the data series!

 

Conclusion

A nightly re-load of 250,000+ delimited files is likely not a common use-case.  But, with the three techniques I’ve described (BULK INSERT, memory-optimized tables, and asynchronous T-SQL execution), you may be able to leverage one (or more) of them in your environment.

 

 

Resources

 

Listing 1 – T-SQL Script with BULK INSERT command

-- Clean up prior run

TRUNCATE TABLE dbo.stgSeriesValueNonMemoryOpt;

 

-- Track Performance of this script

DECLARE @tStart DATETIME2, @tEnd DATETIME2;

SET @tStart = GETDATE();

 

 

-- Declare a Cursor to grab the Series file path, define a starting and ending series

DECLARE @Start int = 0, @Finish int = 5000;

DECLARE @FilePath VARCHAR(255), @SeriesName varchar(50), @SeriesId int;

DECLARE @Path varchar(100);

 

-- BULK INSERT can grab from a local our remote (UNC) path

SET @path = '\132.132.0.5c$downloadedfilesfredFRED2_csv_2data';

-- We will build the BULK INSERT TSQL command dynamically, and then call via an EXEC command

DECLARE @sql NVARCHAR(4000);

 

-- Temporary Staging Table

IF OBJECT_ID('tempdb..#stgSeriesValue') IS NOT NULL

  TRUNCATE TABLE #stgSeriesValue;

ELSE

  CREATE TABLE #stgSeriesValue ([Date] date NOT NULL, Value decimal(28,10) NULL);

 

DECLARE db_cursor CURSOR FOR 

SELECT @path + [File], SeriesName, SeriesId FROM  dbo.Series WHERE SeriesId BETWEEN @Start AND @Finish ORDER BY SeriesId;

 

OPEN db_cursor  

FETCH NEXT FROM db_cursor INTO @FilePath, @SeriesName, @SeriesId 

 

WHILE @@FETCH_STATUS = 0  

BEGIN  

       Print @FilePath;

         -- First BulkLoad the series into a staging table containing Date and Value

         SET @sql = 'BULK INSERT #stgSeriesValue FROM ''' + @FilePath + ''' WITH ( FIRSTROW = 2, FIELDTERMINATOR ='','', ROWTERMINATOR ='''', MAXERRORS = 500 )';

         EXEC(@sql);

        

         -- Update a 2nd staged Table with the Series Id

         -- BULK INSERT doesn't let you append additional columns to a source file, so this is the best workaround I could come up with...

         INSERT INTO dbo.stgSeriesValueNonMemoryOpt (SeriesId, Date, Value) SELECT @SeriesId, Date, Value FROM #stgSeriesValue;

        

            -- Delete from the first staging table to prepare for the next series....

            DELETE FROM #stgSeriesValue;

 

       FETCH NEXT FROM db_cursor INTO @FilePath, @SeriesName, @SeriesId;

END  

 

CLOSE db_cursor;  

DEALLOCATE db_cursor;

 

-- Data is loaded in the staging table; now we call a stored procedure to merge everything into the destination table

--    Note: this stored proc make use of the MERGE command introduced in SQL Server 2008.

EXECUTE [dbo].[spI_SeriesValueNonMemoryOpt] @Start, @Finish;

 

-- Clean Up staging tables

DROP TABLE #stgSeriesValue;

TRUNCATE TABLE dbo.stgSeriesValueNonMemoryOpt;

 

-- Get Performance Results

SET @tEnd = GETDATE();

SELECT DATEDIFF(second,@tStart,@tEnd) AS elapsed_seconds;

 

 

Listing 2 – Creating a non-durable Memory-Optimized table

CREATE TABLE [dbo].[stgSeriesValue]

(

      [SeriesId] [int] NOT NULL,

      [Date] [date] NOT NULL,

      [Value] [decimal](28, 10) NULL,

 

CONSTRAINT [stgSeriesValue_primaryKey] PRIMARY KEY NONCLUSTERED

(

      [SeriesId] ASC,

      [Date] ASC

)

)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_ONLY )

 

GO

 

 

Listing 3 – Calling a SQL Agent Job from T-SQL to achieve asynchronous operations

 

EXEC msdb.dbo.sp_start_job N'Process_SeriesValue_01';

EXEC msdb.dbo.sp_start_job N'Process_SeriesValue_02';

EXEC msdb.dbo.sp_start_job N'Process_SeriesValue_03';

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like