A Data Purge Process

By purging the data that's no longer needed to meet legal and business requirements, companies can improve the overall daily performance of their SQL Server systems and more precisely estimate future storage and server needs.

Sanjay Joshi

February 11, 2011

8 Min Read
ITPro Today logo in a gray background | ITPro Today

To meet legal requirements, companies might need to retain SQL Server data for a certain length of time. Companies might also want to retain SQL Server data for a certain time period for their own business requirements. Companies can control when the data can be purged from the system as long as the boundaries enforced by the legal requirements aren't affected. By purging the data that's no longer needed to meet legal and business requirements, companies can improve the overall daily performance of their SQL Server systems and more precisely estimate future storage and server needs. Thus, ROI will improve for existing hardware, maintenance, and future hardware requisitions.

Related: 4 Ways to Purge Non-Partitioned Data

An application's software development life cycle (SDLC) documentation should detail how long its data should be available. However, you seldom find data retention guidelines included in SDLC documentation, unless a business function depends on that data. The picture is even bleaker for vendor-supplied software. Unless there's a business function dependency, most of the vendor software will facilitate storing data in history tables indefinitely.

DBAs mainly concentrate on daily server operations and upkeep, so they seldom have time to become familiar with every application and keep track of the data that's no longer needed to meet legal and business requirements. However, when you don't regularly purge this data, tables can snowball to a size that impacts performance and that can only be purged in a maintenance window. To avoid this, I created a purge process that automatically purges the data that's no longer needed to meet legal and business requirements. After explaining this process, I'll walk you through a sample implementation.

The Purge Process

The purge process moves data between and deletes data from three categories of data, or data sets:

  • The current data set, which contains data that needs to be available to users. Users can change or review the data.

  • The history data set, which contains data that can only be reviewed or aggregated by users. Users can't change the data.

  • The dormant data set, which contains data that falls out of the availability threshold but needs to be stored to meet legal or business requirements. This data is archived, so it can't be changed, reviewed, or aggregated by users.

The purge process is configurable and dynamic, so new categories or subcategories can be added to accommodate changes in legal and business requirements. For example, the current data set could be further classified into read-only data and updateable data. By hosting the read-only data on read-only file groups, you can improve SQL Server performance.

A purge database and a stored procedure named purgeManager are used to control the purge process. The purge database stores the configuration information that purgeManager needs to purge data from the data sets.

When purgeManager is invoked, up to three purge tasks (one purge task for each data set) can execute. Whether or not a purge task executes is determined by the schedule (aka frequency) you set and the last time the purge task ran (aka last run date). This design allows for a rerun if necessary and makes the purge process re-entrant. Repeated executions (intentional or otherwise) won't have any negative effects.

Dynamic queries are used to purge the data. Some of these queries use the OUTPUT clause in a DELETE statement to move the data that's being purged. The purge is performed in batches, whose size you can configure. That way, you can avoid problems associated with long-running transactions.

The purge process captures any errors. When one occurs, it makes a note in the SQL Server error log and gracefully exits.

An Example

Using the three data sets I just mentioned, let's walk through an example. Suppose an international bank defines the following categories for its SQL Server 2005 platform based on its legal and business requirements:

  • Current data set: This data set contains data from bank transactions that occurred in the last 45 days.

  • History data set: This data set contains data from bank transactions that occurred in the last 13 months, excluding the most recent 45 days.

  • Dormant data set: This data set contains data from bank transactions that occurred in the past eight years, excluding the most recent 13 months, to comply with IRS rules. Unlike the current and history data, the dormant data doesn't need to capture all the transaction details. However, only US currency is allowed in the dormant data set, so when appropriate, the transaction amount is multiplied by the exchange rate during the purge process.

The bank decides to use the following purge schedule:

  • Every day, purge data older than 45 days from the current data set, moving it to the history data set.

  • The first day of every month, purge data older than 13 months from the history data set, moving it to the dormant data set.

  • Every day, purge data older than eight years from the dormant data set.

To demonstrate the purge process for this scenario, I created five scripts, which you can download by going to the top of this page and clicking the Download the Code Here button. Let's look at what the scripts do and how to customize them for your environment:

S1_createDB.sql. This script creates four databases: sqlMagSamplePurge (the purge database), sqlMagSampleBankCurrent (the database for the current data set), sqlMagSampleBankHistory (the database for the history data set), and sqlMagSampleBankArchive (the database for the dormant data set). In the sqlMagSamplePurge database, the script creates a table named objectPurgeSchedule. In the other three databases, it creates a table named BankTransaction.

All three BankTransaction tables include the "eyesOnly" column, which provides the ability to simulate data volume tests. This column has the nchar(n) data type and has no meaningful data. You can change the column size to control however many data rows that you would like to accommodate on one data page. Before you run S1_createDB.sql, you might want to customize the size of the "eyesOnly" column to suit your environment or to simulate a data volume test.

S2_insertData.sql. This script creates sample data for all four tables. In the three BankTransaction tables, the sample data includes transaction dates that span beyond each data set's boundaries (e.g., the current data set includes transactions older than 45 days) so that the purge process can be adequately tested. (If you want to see an efficient way to manipulate dates, check out the date functions in the script.)

To create the sample data, S2_insertData.sql uses the @dataSize variable to create blocks of data. By default, the @dataSize variable is set to 5, which means the script creates five blocks of 49 rows for the current data set, five blocks of 21 rows for the history data set, and five blocks of 119 rows for the dormant data set. You can change the @dataSize variable's value to a different size to create a smaller or larger volume of sample data.

Besides populating the BankTransaction tables, S2_insertData.sql populates the objectPurgeSchedule table with configuration information that the purgeManager stored procedure will need to execute the three purge tasks. This information includes a batch size. When a purge task executes, multiple rows might qualify for purging. The value in the objectPurgeSchedule table's "batch" column controls the number of rows in one DELETE operation, helping to avoid problems that result from long-running operations (e.g., the log file growing too large, other processes not getting enough resources).

For each purge task, S2_insertData.sql sets the batch size to two rows because there aren't many rows or columns in the sample BankTransaction data. Plus, a small batch size allows multiple runs, which showcases the design of the purge process. In a production environment, the batch size needs to be fine-tuned to account for the number of rows and columns in the tables.

The objectPurgeSchedule table also includes the "lastRun" column, which specifies the last run date of each purge task. S2_insertData.sql sets the last run date to the day before the current date.

S3_purgeMgr.sql. This script creates the purgeManager stored procedure. This stored procedure uses the information in the purge database (in this case, sqlMagSamplePurge) to perform the valid purge tasks. As mentioned previously, whether or not a purge task executes is determined by that task's schedule and last run date, which allows a rerun if necessary. To rerun a purge task, you simply update the runLast column for that task and rerun the stored procedure.

With dynamic queries, debugging can sometimes be challenging, so purgeManager has an optional input parameter named @debug. When this input parameter is set to 1, the stored procedure will print the dynamic T-SQL statement before executing it.

S4_execAndReview.sql. This script shows the three data sets before the purge, executes the purgeManager stored procedure, then shows the three data sets after the purge. By default, the purgeManager stored procedure's @debug input parameter is set to 0. If you want to see the dynamic queries before they execute, set @debug to 1 in S4_execAndReview.sql. Web Listing 1 shows an example of what the dynamic queries look like. I added the comments.

S5_cleanup.sql. This script cleans up by removing the four databases.

Purging Helps Performance

Using the example I presented as a template, you can implement the purge process in your environment. Because the process has a data-driven design, you can easily customize it. By regularly purging the data that's no longer needed to meet your company's legal and business requirements, you can help keep your SQL Server systems performing well.

Web Listing 1: Sample Dynamic Queries

-- Purge the data older than 45 days from the current data set,-- moving it to the history data set.SELECT @runNext = DATEADD(dd,1,@runLast)DELETE TOP (@batch) sqlMagSampleBankCurrent.dbo.BankTransaction  OUTPUT deleted.* INTO sqlMagSampleBankHistory.dbo.BankTransaction  WHERE trnDt < @runNext - 45 SELECT @rowCnt=@@ROWCOUNT-- Purge the data older than 13 months from the history data set,-- moving it to the dormant data set.SELECT @runNext = DATEADD(mm,1,@runLast)DELETE TOP (@batch) sqlMagSampleBankHistory.dbo.BankTransaction  OUTPUT deleted.\[trnDt\],deleted.\[trnSequence\],deleted.\[acNo\],  deleted.\[amount\]*deleted.\[fxRate\],deleted.\[eyesOnly\]  INTO sqlMagSampleBankArchive.dbo.BankTransaction  WHERE trnDt < DATEADD(mm,-13,@runNext) SELECT @rowCnt=@@ROWCOUNT-- Purge the data older than 8 years from the dormant data set.SELECT @runNext = DATEADD(dd,1,@runLast)DELETE TOP (@batch) sqlMagSampleBankArchive.dbo.BankTransaction  WHERE trnDt < dateadd(yy,-8,@runNext) SELECT @rowCnt=@@ROWCOUNT

 

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like