Synchronizing on Demand
Use forced replication to synchronize data in transactional replication
October 18, 2004
After the World Trade Center attacks of September 11, 2001, many organizations reconsidered how they manage and protect crucial business data. Having a remote data center with a second set of databases and application servers has become a common practice. Businesses have been willing to accept the cost of supporting additional personnel, data centers, and changes in application design because they want to be able to switch operations to a different location quickly in the event of a disaster. For example, after the September 11 attacks, a large bank where I recently worked as a consultant with Microsoft Consulting Services began requiring that all crucial applications be able to function from either of two locations hundreds of miles apart with no more than 2 hours downtime.
These tighter business requirements are a challenge to database architects. Most organizations' databases constantly grow, but maintenance and downtime windows are kept at a minimum. Databases must be highly available and ready to function in distributed environments, and data in primary (Publisher) and secondary (Subscriber) databases must always be synchronized.
Several solutions support high availability for SQL Server 2000 databases in different locations, including log shipping, third-party hardware or software solutions such as geographical clusters or EMC's SRDF remote-storage replication solution, and transactional replication. In most high-availability solutions, the Subscriber is completely or partially unavailable during the data-synchronization process. For example, in log shipping, the Subscriber database is available for reading only when new transaction logs aren't being applied. In the expensive SRDF hardware solution, the Subscriber database is unavailable all the time; it's available only for restoring data if the Publisher database is down. Only transactional replication allows full use of the Subscriber database all the time. This solution is available in all SQL Server editions, so you can avoid extra expenditures on third-party software and still get the maximum use of the hardware you already have. That's why many companies choose transactional replication as their high-availability solution.
But transactional replication doesn't automatically solve the problem of data synchronization; it requires efficient business processes and a highly skilled DBA team to create and support the solution. Let's look at the high-availability problem the bank I worked with faced in implementing transactional replication and continuous data synchronization and see how I solved the problem by using a new method I call forced replication.
Confronting the Problem
Using replication for very large databases (VLDBs) isn't simple when a business requires the ability to shift operations to a secondary site on short notice. Database personnel must be sure that data is synchronized all the time, and they usually have only a small window of time to fix any problems that arise in a disaster situation.
Several possible scenarios in transactional replication might cause data on the Publisher and Subscriber to become out of sync, which could create problems for the business if a disaster occurred. For example, say the remote Distributor (the server that stores metadata and history data and temporarily stores replicated transactions) crashes and, for business reasons, the company can't immediately stop accepting new data on the Publisher (as is the case with our bank). The Publisher will continue to accumulate new data, and the databases on the Publisher and Subscriber will become out of sync. Or, a subscription might expire for some reason—for example, failure of a replication agent—and changes on the Publisher wouldn't be propagated to the Subscriber. Or, a user might inadvertently—or maliciously—run a DELETE command on the Subscriber and remove already propagated data. In such a case, the Subscriber would have fewer records than the Publisher. If you've used transactional replication, you've probably faced this situation at least once.
To avoid having unsynchronized data, the bank asked me to develop a mechanism of online data synchronization for the banking system that replicates two VLDBs with a combined size of approximately 500GB from a server in New York to a server in Delaware, as Figure 1 shows. Additionally, each site has reporting servers that contain identical reporting data—a subset of the data from the VLDBs. In an emergency, the bank has to be able to easily shift its data load from the New York server to the Delaware server and reverse the replication data flow between the two OLTP servers (i.e., from the Delaware server to the New York server). In the bank's original replication plan, a scheduled job runs every hour to compare the number of records in corresponding tables on the Publisher and Subscribers. Of course, the job algorithm takes delivery latency into account. If the job finds an inconsistency in the databases, the bank needs to immediately synchronize the data.
To accomplish this quick synchronization, I couldn't use conventional techniques such as applying publication snapshots, traditional backup and restore, or Data Transformation Services (DTS). Because a snapshot requires that you copy all the data, not just the part you need, preparing and applying snapshots for large databases by using SQL Server Enterprise Manager's Snapshot Agent can take hours or even days, and the bank doesn't have the luxury of time. The bank needs to synchronize only the data that has changed since the last time the scheduled job ran successfully—just a few hours of new information. Preparing a special differential snapshot of the affected tables, which have hundreds of thousands of rows, would be a waste of time and resources.
Traditional backup-and-restore technology wouldn't work either because it would replace the entire Subscriber database with the Publisher's information, which in the case of the reporting server would be inappropriate. The bank's reporting servers contain only a subset of tables replicated from the OLTP servers, along with additional tables for reporting. Another reason I couldn't use a typical restore from backup is that such a restore requires exclusive access to the Subscriber. If the Publisher accepted new data during the restore, the data would immediately be out of sync again. But the bank couldn't afford to shut down the Publisher.
We also couldn't use the DTS Import/Export Wizard. The wizard gives you the option of using a query to specify the data to transfer so that you can copy to the Subscriber a subset of the Publisher's data. But you have to recreate the destination tables or DTS will append the data you're restoring to the existing tables. Appending the rows might create duplicate records on the Subscriber if your query doesn't exactly define all the missing rows. In addition, with DTS, you need to define a query for each table, which could be a big task if you have to synchronize many tables. And the task becomes even less convenient if you have several Subscribers (as our bank has) and the wizard must run several times. The bank needed a new approach.
As a solution to the problem, I created and successfully implemented a methodology I call forced replication. This technique can synchronize the data between selected articles without stopping online operations on the Publisher and Subscribers, and you can schedule forced replication to run any time. You can apply this methodology to any database system that uses transactional replication, but it especially benefits applications that work with large databases (100GB or more) and that would require too much time for data synchronization with snapshots or other backup-and-restore methods.
What Is Forced Replication?
In solving the bank's replication problem, I found that the best tool for synchronizing data is replication itself, so I developed forced replication to synchronize on demand only the parts of the database that need updating. This technique is fast and doesn't disrupt user access to data. To use forced replication after a disaster occurs, I first verify that both the Publisher and the Subscriber are available again and the system's replication configuration is restored. For example, if a Distributor failed, I would make sure it's back online. Then, at a time that won't interfere with business processes, I run commands that resend the parts of the data that have changed since the last time the data was synchronized. These commands either insert missing data or replace existing data without duplicating it.
The forced-replication technique is based on default SQL Server replication behavior. SQL Server replicates updates that are issued against any column from the table's primary key (except an IDENTITY column) as a combination of DELETE and INSERT commands. SQL Server replicates updates against any other table columns simply as UPDATE commands. Let's look at how this behavior works.
Say the Pubs database includes a publication called Test that contains just one article, Table1. Listing 1 shows the code that creates Table1, which has a primary key defined on the integer column f1, the varchar column f2, and the datetime column f3. You want to synchronize Table1's data on the Publisher and all Subscribers. (Listing 2's code creates the Test publication, the Table1 article, and a subscription.) Assume that the Distributor, Publisher, and Subscriber servers are already configured. Subscribers might have different numbers of records or different data in some columns because they aren't synchronized yet.
If you run on the Publisher the following simple UPDATE command
UPDATE Table1 SET f1=f1 WHERE f3 < '01/03/2004'
SQL Server doesn't change the existing data on the Publisher server but instead delivers the command to the Subscribers as two new stored procedure calls for DELETE and INSERT:
{CALL sp_MSdel_Table1 (1)}{CALL sp_MSins_Table1 (1, 'AA', '01/03/2004')}
Listing 3 shows the code that creates these stored procedures, which are similar to those that SQL Server 2000 automatically generates if you use a publication creation wizard instead of running Listing 2's scripts to configure the Test publication.
If a record that you're replicating already exists on the Subscriber, the stored procedures will delete and immediately replace the existing record. If a record is missing, the stored procedures will insert it where it belongs in Table1. Note in Listing 3 that the custom stored procedure sp_MSdel_Table1 doesn't generate an error message if it doesn't find a corresponding record on the Subscriber (as SQL Server does by default); I removed that part of the code from the system-generated stored procedure to avoid creating an error message that would stop the Distribution Agent. This change is the only difference between the system-generated stored procedures and the custom ones I use for forced replication.
To view how the Distributor delivers commands to the Subscriber, stop the Distribution Agent in Replication Monitor, as Figure 2 shows. Next, run the preceding UPDATE command against the Table1 article again. Then, on the Distribution server, run the following command in Query Analyzer:
EXEC distribution.dbo.sp_browsereplcmds
As Figure 3 shows, SQL Server transforms the UPDATE command on the Publisher into two CALL commands—one for a deletion and one for an insertion—on the Distributor.
Scenarios for Using Forced Replication
You can use forced replication to synchronize data from the Publisher to the Subscriber or from the Subscriber to the Publisher. In most cases, when data is out of sync, it's because the Publisher has more data than the Subscriber—in other words, INSERT or UPDATE commands on the Publisher haven't been propagated to the Subscriber. To synchronize the data, you simply need to run on the Publisher server an UPDATE command against one of the primary key columns, replacing the column's value with an identical value. The UPDATE command will either delete and reinsert the records without duplicating them or insert missing records on the corresponding Subscriber tables. You can use a WHERE clause to restrict the amount of affected data. For example, as Figure 3 shows, you can update only records that occurred before January 3, 2004 (f3 < '01/03/2004').
In rare cases, you might have more data on one or more Subscribers than you do on the Publisher. For example, if a user runs a malicious INSERT command on a Subscriber or if a DELETE statement on the Publisher wasn't propagated to Subscribers, you'd need to remove the unnecessary data from the Subscribers. To handle such situations, I added an extra datetime column called repl_time to each article on every Subscriber. This method might sound like it requires a lot of additions, but you'd do it only in rare situations. Just once, before configuring replication, you run the script that callout A in Listing 4 shows, which adds the extra datetime column to each image table (the table that's part of replication) on the Subscriber. You don't need to add the column to the Subscriber's own tables. Every time a record is inserted or updated on the Subscriber, the proper custom stored procedure (either for insert or update) will change the value in the repl_time column to the default value, getdate(). The code at callout B in Listing 4 shows the stored procedures sp_MSins_ Table1 and sp_MSupd_Table1, which run on the Subscriber and change the value of repl_time. (Note that the insert and update stored procedures are different in Listing 4 than they are in Listing 3, but the sp_MSdel_Table1 stored procedure remains the same as it is in Listing 3).
When you run on the Publisher UPDATE statements that replace a value with the same value—such as the preceding UPDATE statement that sets the value for column f1—all Subscriber records that have corresponding records on the Publisher will get new values in their repl_time columns. Subscriber records that don't match any records on the Publisher will keep the old values in this field. To remove any extra records from the Subscriber, you can run a DELETE command that includes the repl_time column in a WHERE clause. For example, if you run forced replication at 11:00 a.m. on August 27, 2004, the following command
DELETE Table1 WHERE repl_time < '2004-08-27 11:00:00'
will remove from the Subscriber the records that have repl_time values before 11:00 am on August 27, 2004.
Note that you can also create the repl_time column on the Publisher and insert default NULL values into that column during the data load. During replication, data in the repl_time column would change only on Subcribers. This technique will keep table structure identical across all servers. For information about how to maintain data integrity when tables are related through foreign key constraints, see the sidebar "Handling Referential Integrity."
Using Forced Replication to Synchronize Large Tables
To implement forced replication, you need only two new stored procedures: one for small (or lookup) tables and another for large tables. Depending on your table's size, you might decide to run the UPDATE command either against selected rows (by using a WHERE clause) or against the entire table (by leaving out the WHERE clause). If your table is large (many thousands of rows), you should run the UPDATE command against a limited number of records to avoid long-running transactions and to decrease the chance of blocking other users.
If your table has columns that allow data partitioning, you can use those columns to replicate (and synchronize) only part of the table. In most cases, you can use a datetime column to filter the data. You can create a stored procedure, such as the sp_ForceRepl stored procedure that Listing 5 shows, to filter records for replication. The sp_ForceRepl stored procedure takes four parameters: TableName, BatchSize, StartDate, and EndDate. Let's take a look how this stored procedure works. First, you tell the code to select into the temporary table #Table1 all primary keys for records that satisfy the criteria you specify in the WHERE clause on the f3 column. In Listing 5, the code is selecting records with values in column f3 between @sStartDate and @sEndDate. Then, you select into the #Table1_Temp temporary table a certain number of primary keys from #Table1, which you specify in the @iBatchSize parameter, and update the records in Table1 that have the keys you specify. At the end of Listing 5's loop, the code removes the processed primary keys from #Table1 and truncates the #Table1_Temp table in preparation for the next cycle. You can create a similar filtering process for any other large table.
Using Forced Replication to Synchronize Small or Lookup Tables
Sometimes, the tables you're working with are relatively small or can't be partitioned, as with lookup tables, which are usually much smaller than operational tables that you can partition by such criteria as ID or time. If you need to replicate a table that can't be partitioned, you can run against all the table's records an UPDATE statement that doesn't change the value in the selected column. Because forced replication implementation for lookups doesn't require the preliminary step of selecting records that satisfy given criteria (as with the sp_ForceRepl stored procedure in Listing 5), you can write a generic stored procedure that works against all lookup tables. Listing 6 shows the code that creates such a stored procedure, sp_ForceRepl_Lookup. Note that Listing 6's code runs one UPDATE command for an entire lookup table instead of breaking the process into several batches, as you would with a large table.
The code in Listing 7 creates in the msdb database a table called ForcedReplTblExclude that stores the names of the large tables from all the publications that you need to process separately by using batches. Then, sp_ForceRepl_Lookup loops through all tables except those that are listed in the ForcedReplTblExclude table, finds the first column in the selected tables' primary keys, and constructs and executes the UPDATE statements that synchronize the data.
Putting It All Together
Now you're ready to implement forced replication. You need to follow several steps.
Verify that all foreign keys on the Publisher are configured with the NOT FOR REPLICATION option.
Add the repl_time column to each table on the Subscriber, and modify the replication stored procedures for INSERT, UPDATE, and DELETE commands for all articles on the Subscriber. (To learn more about the options you can choose when configuring replication, see the Web sidebar "Additional Facts About Configuring Transactional Replication" at InstantDoc ID 44150.)
Create the ForcedReplTblExclude table to hold a list of all non-lookup tables (e.g., those that you can partition by datetime or ID).
Create the sp_ForceRepl stored procedure, which executes for each partitioned table an UPDATE command that replaces a value with the same value.
Create the sp_ForceRepl_Lookup stored procedure to handle forced replication for all other tables.
Create a DTS package, a data link file, and configuration INI file, as the Web sidebar "Creating a DTS Package to Run Forced Replication" (InstantDoc ID 44149) describes.
Create a job that executes the DTS package, but don't schedule this job to run. The DBA should start it manually, whenever you need data synchronization.
Having in your arsenal a tool that implements forced replication will make your life easier. Such a tool lets you quickly synchronize the data at different sites after a disaster has happened—but don't wait for disaster to strike. Use the scripts I provide in this article to implement a forced-replication solution in your development environment. There, you'll see the benefits of the tool because in a development environment, you're likely to have more replication interruptions than in your well-monitored, stable production environment. Having the tool ready for production will give your company a valuable, cost-saving technique for switching database operations to a secondary data center.
About the Author
You May Also Like