After the Crash
Backing up and restoring transactional replication setups
August 23, 2004
If your system must be highly synchronized but can tolerate a certain amount of latency as well as some data loss when conflicts arise, SQL Server 2000's transactional replication with queued updates (TRQU) can be a useful way to replicate data to multiple, always-on database servers. In the first two articles in this series about TRQU-"Queuing Up," December 2003 and "Looking Under the Hood," I showed how to set up TRQU and looked at how SQL Server replicates data and resolves data conflicts. In this article, the last in the series, I examine the intricacies of backing up and restoring TRQU setups.
The following steps, which Figure 1 shows, summarize the TRQU process. For the transactions that follow a forward path from Publisher to Subscriber:
1. SQL Server applies transactions at the Publisher server to the Publisher database log.
2. SQL Server specially marks for replication the transactions that involve the replicated articles (tables).
3. The Log Reader agent running at the Distributor server periodically reads the Publisher database log, picks up the transactions marked for replication, then applies them to a table in the Distributor database on the Distributor server.
4. The Distribution Agent then runs on the Distributor server (for a push subscription) and applies those transactions to the Subscriber database on the Subscriber server.
For the transactions that follow a reverse path from Subscriber to Publisher:
1. SQL Server records external transactions made by end users or end-user applications at the Subscriber database in the queue table in the Subscriber database on the Subscriber server.
2. The Queue Reader agent runs periodically on the Distributor server to pick up the transactions from the queue table on the Subscriber server, checks for and reconciles data conflicts, then applies the transactions to the Publisher database on the Publisher server.
In a basic transactional-replication scenario that uses SQL Server's default settings, the Log Reader agent reads a transaction that SQL Server has marked for replication in the Publisher database log. Then, after moving the transaction to the Distributor database, the Log Reader agent unmarks the transaction. Unmarking the transaction is a signal for SQL Server to remove the transaction from the log after the log backup's next run. This backup process is consistent and highly effective-until disaster strikes. Then, the quirks inherent in TRQU implementations cause anomalies that require creative problem-solving. (For suggestions about how to prepare your replication setup for a disaster, see "Preparing Disaster Scenarios.")
Publisher Database Crash
In most setups, the Log Reader agent runs in continuous mode and reads marked transactions from the log more frequently than SQL Server backs up the log. For example, let's say that two transactions that need replicating take place at the Publisher. During the next Log Reader agent run, the agent picks up these transactions quickly from the log, moves them to the Distributor, then unmarks them.
If the Log Reader agent run takes place at time A, these transactions typically remain in the Publisher's active database log until the next log backup takes place (time B). At time B, SQL Server backs up and truncates the transactions from the active Publisher log.
Problem. What happens when the Publisher database crashes between times A and B, and SQL Server can't back up the log? In this situation, SQL Server has copied the transactions to the Distributor and perhaps to the Subscriber, but not to the Publisher database's log backup. Therefore, if you restore the Publisher to the last good log backup, the Publisher database doesn't contain the transactions, even though SQL Server has copied these transactions to the Distributor and perhaps to the Subscriber-or will copy them to the Subscriber during the next Distribution agent run. This type of failure compromises the system's data consistency, and, depending on the nature of the production data, serious problems can arise.
Solution. SQL Server 7.0 and earlier releases offer no easy way around this problem. DBAs who work with these releases have to manage the problem more through access and timing control than through an integrated SQL Server mechanism. The good news is that SQL Server 2000 provides a feature for resolving this inconsistency easily. You merely set an option called sync with backup by running the following T-SQL command in the master database on the Publisher server:
EXEC master..sp_replicationdboption @PubDBName ,'sync with backup', 'true'
This setting tells the Log Reader agent to read only those marked transactions in the log that SQL Server has backed up. After the log backup is complete, SQL Server updates the log record by noting that the transactions have been backed up. Then, during its next run, the Log Reader agent reads the transactions and unmarks the transactions that were marked to denote they needed replication. The next log-truncation process removes those transactions from the Publisher database's active log. This small, clever feature maintains data consistency in the Publisher backup.
Distributor Database Crash
Now, let's say that the Log Reader agent has read the transactions and SQL Server has backed them up on the Publisher and deleted them from the Publisher log. At this stage, the transactions are on the Distributor in the Distributor database, and the Distribution agent will apply them to the Subscriber during the next Distribution agent run.
Problem. What happens when the Distributor database crashes before the Distribution agent can run and SQL Server hasn't backed up those transactions on the Distributor? Data inconsistency again exists-the Publisher has copies of the transactions, but the Distribution agent hasn't copied these transactions to the Subscriber. Restoring the Distribution database to the last good log backup won't retrieve those transactions because SQL Server never backed them up.
Solution. Using sync with backup again is the solution. This time, however, you need to set the option for the database on the Distributor server. When you set this option, SQL Server can't delete the transactions from the Publisher database log until it backs them up at the Distributor server as part of the Distributor database's log backup.
At this stage, the Log Reader agent has read the transactions, the Publisher log backup has backed up the transactions on the Publisher server, but SQL Server hasn't backed up the Distributor database log. Under these circumstances, SQL Server doesn't delete the transactions in the active Publisher database log until it has backed up the transactions on the Distributor. After the Distributor database's log backup is complete, SQL Server marks the transactions in the Publisher's active log. Then, the transactions are deleted from the Publisher database's active log during the next truncation process. Therefore, if the Distribution database crashes before SQL Server has backed up the transactions on the Distributor, data-consistency problems don't occur. After the Distributor database restore, the Log Reader agent picks up the transactions from the active Publisher log based on the transactionID.
Subscriber Database Crash
Now you know how you can prevent disaster when the Publisher and Distributor databases crash. However, the possibility of disaster doesn't end with those two scenarios.
Problem. Sometimes in a TRQU project, updates occur at the Subscriber as well as at the Publisher. What do you do when the Subscriber database crashes?
Solution. As long as the replicated transactions are stored in the Distributor database, you haven't lost anything. You can use the @min_distretention option of the sp_adddistributiondb stored procedure that sets the minimum transaction retention period for the Distributor database. The distribution-cleanup process ensures that the process doesn't delete transactions from the msrepl_commands table in the Distributor database that occur within that minimum retention period. So, if the Subscriber database crashes, you can restore it to the last good backup. And, assuming that your minimum transaction-retention period is more than the time elapsed since the last good backup of the Subscriber database, the replicated transactions are safe. The Distribution agent determines the state of the Subscriber database and starts reapplying transactions from the point to which the database has been restored. Therefore, you should consider using options such as sync with backup and minimum history retention when you estimate the log sizes and size of the database (e.g., for the msrepl_commands table at the Distributor, the Queue table at the Subscriber, or conflict tables at the Publisher and the Subscriber).
Queue Complications
The sync with backup solutions to the Publisher and Distributor database crashes work when the updates take place only at the Publisher. However, as I noted earlier, TRQU updates can take place simultaneously in more than one database. What happens in a TRQU scenario in which updates are applied simultaneously at the Subscriber and at the Publisher when the sync with backup option is on? Visualize the following scenario: An update takes place at the Subscriber and this transaction is recorded in the queue table on the Subscriber. The Queue Reader agent reads this transaction and applies it to the Publisher (at which stage SQL Server makes a log entry in the Publisher log). The Queue Reader agent then deletes the transaction from the queue table on the Subscriber, but before you can back up the Publisher log, the Publisher database crashes, and you can't back up the transaction log.
Problem. Unlike with the forward path (from Publisher to Subscriber), in which the transactions are preserved in the distribution database, no transactions are preserved in the reverse direction. This loss of data occurs because the Queue Reader deletes the transactions from the queue table after applying them to the Publisher.
Solution. Typically, if you want to minimize the loss of transactions in such a disaster scenario, you undo replication, get the failed SQL Server database up and running, then set up replication again with the roles reversed. The original Subscriber then becomes the Publisher, and the original Publisher becomes the new Subscriber.
However, even when the Publisher database has crashed, you can salvage your data if you can back up the Publisher log. To accomplish this backup, you need to set up the TRQU to allow the Log Reader to read transactions from the available Publisher database log (after the log backup), stop all the replication agents, then restore the Publisher database to the last backed-up log. After you restart all the agents, you can resume replication.
Prepare, Rehearse, and Shine
he scenarios for backing up and restoring replication setups that I've described in this article are only a few of the situations you might encounter. Remember that your work isn't done after you set up TRQU and get it working. You must also prepare thoroughly for disasters and rehearse the backup and restore plans so that if a production disaster does strike, your data is safe and quickly recoverable.
About the Author
You May Also Like