Exchange 2007 SP1's Standby Continuous Replication
Add site resilience to your business continuity options with this new feature
July 7, 2008
Standby continuous replication (SCR) is the big new feature in Microsoft Exchange Server 2007 SP1. SCR uses continuous log replication, or log shipping. You configure servers in a remote location (typically another data center) as targets to accept replicated transaction logs from source servers and to use the data to update local copies of mailbox databases. If catastrophe strikes the source server, you can use the copies of the mailbox databases to restore messaging service. The value of SCR is that it adds site resilience to the list of options you can consider when you plan for business continuity.
Because it builds on the concepts used by local continuous replication (LCR) and cluster continuous replication (CCR), setting up SCR should be familiar if you've worked with either of those Exchange 2007 features. The magic in SCR is the straightforward process for restoring service when a failure occurs. The procedures used to deploy and use SCR will vary greatly, depending on the unique circumstances of each company, such as the service level agreement that dictates how quickly mail service has to be restored following failure, but the general approach is fairly simple. First let's take a look at the prerequisites for implementing SCR, then I'll show you how to set it up and the steps to recovery in the event of a failure.
Planning for SCR
SCR adds to CCR by providing an extra level of resilience in case of a server failure. Exchange ships logs generated from the source server over the network to the target server, where the Microsoft Exchange Replication Service replays the logs to update a local copy of the database. To use SCR in your environment, you need to ensure all the prerequisites are met.
You can use standalone, single copy cluster (SCC), or CCR Mailbox servers as SCR sources. Think of SCC as “classic clustering” where a single set of data is shared by multiple servers. Storage groups (SGs) for SCR can hold only a single database, which is in line with Microsoft’s current design philosophy to focus on databases as the management object you're building high availability around rather than SGs that can hold multiple databases. Interestingly, a single source can have multiple targets: Exchange sends copies of logs to different points in the network, each of which replays the logs to build its own copy of the database. There's no hardcoded limit to the number of targets you can configure for a source, but every target you add increases the load on
the source server to replicate logs
the network to transport the logs
the target servers to replay the logs
Microsoft recommends that you configure no more than four targets for a single source, but even four might be too many. You need to perform tests to verify the best setup for your environment before you proceed to deployment.
You can use a single server as the target for multiple SCR sources, which presents an interesting problem if you have SGs with the same name on different source servers. Consider the problem of sorting out databases if you replicate a similarly named database from four SGs called First Storage Group to a single target! This situation isn't a problem for Exchange because it uses globally unique identifiers (GUIDs) to identify SGs and databases; it's the administrator who will be confused. Best practice is therefore to ensure that the display names for SGs and databases are unique, perhaps by including the name of the server in the names that you assign to these objects. If you have objects with similar names, you can rename databases and SGs to create a more understandable management environment.
The same problem exists with the folder paths that you assign to SGs. SCR creates the same folder paths on the target server as are used for the SG on the source. Therefore, if you use D:SG1 as the path for an SG on the source server, SCR creates D:SG1 on the target. This scheme works well as long as you don't want to use the same paths for multiple SGs on different servers. There's no way for SCR to resolve the path conflict, so you can't replicate SGs from multiple source servers that use the same paths to a single target server. Clearly, some planning, and potentially some cleanup, is required for both SG names and folder paths before you deploy SCR.
SCR target servers must have the Mailbox server role installed, even if no mailboxes are present. A target can be either a standalone Mailbox server or a passive node in a failover cluster (but not a clustered Mailbox server). If you use a standalone server, you can't run LCR for any SGs on the server because Exchange uses the Replication Service to process incoming logs from the source server. Target servers must be in the same Active Directory (AD) domain as source servers (but can be in different sites), and they must have the same installation paths for Exchange binaries and files as the source server. If the paths don't match, the RecoverCMS option fails when you run the Exchange Setup program to activate the target.
The source and target servers must both be running the same version of Windows (e.g., Windows 2003 SP2, Windows Server 2008) as well as Exchange 2007 SP1. As new service packs and versions become available, you'll probably need to keep source and target servers at the same software level because Microsoft might upgrade the SCR code or something else in the Exchange Information Store that could cause an inconsistency that leads to a failure to recover data. Given that SCR is now part of Exchange and therefore forms part of backward compatibility testing, the likelihood of database mismatches is reasonably small, but it’s better to be paranoid about data than to run the risk that errors might occur.
You can find a checklist of these prerequisites in the sidebar, "Prerequisites for Exchange 2007 SP1's SCR."
Setting Up SCR
You have to manage SCR through Exchange 2007's Exchange Management Shell (EMS) because the current version of Exchange Management Console (EMC) doesn't include any controls for SCR. All the work necessary to set up replication to a target server and then to switch the database copy from target into production is manual: You must either perform the steps one by one through EMS or create a script containing all the necessary commands.
To enable the selected SG for SCR, you use the Enable-StorageGroupCopy cmdlet. For example, entering
Enable-StorageGroupCopy -id 'XYZ-MBX1Super Critical Mailboxes'
-StandByMachine 'SCRTargetServer' -ReplayLagTime 0.0:3:0
-TruncationLagTime 1.0:0:0
instructs Exchange to enable SCR for the Super Critical Mailboxes SG. XYZ-MBX1 is the source server and SCRTargetServer is the target. The ReplayLagTime parameter specifies the time that the Replication Service on the target server waits before it replays copied logs. The default lag time is 24 hours and the maximum you can set is 7 days. If the default value is used, for example, Exchange waits 24 hours after copying a log to a target server before replaying that log into the database copy. If a corruption occurs on the source server, you don't want it to replicate to the copy of the database, so the lag time gives you the opportunity to recognize that a problem exists and switch to the database copy before the corruption is replicated into it.
Waiting 24 hours before replaying transaction logs gives you a lot of protection against corruption, but you might want to have the database copy more up-to-date in case you need to use it. The example above set the parameter to three hours to accelerate log replay. If you want, you can set the lag time to 0.0:0:0 to force Exchange to replay the logs on the target server immediately after the Replication Service inspects and verifies them. This approach keeps the copy database very close to the original, but replication will transport corruption in almost real time to the copy. The trick is to set a time lag that satisfies your need to protect against corruption while keeping the database copy almost up-to-date. Defining best practice for time lags is likely to be a subject of much debate over the next few years. For now, companies seem to be settling on between three and six hours as the interval used in deployment.
The TruncationLagTime parameter defines how long the Replication Service on the source server waits before it truncates (i.e., removes) log files that the target server has successfully replayed into its database copy. Note that you can't change the value of these parameters (TruncationLagTime and ReplayLagTime) dynamically. If you need to change them, you first have to disable SCR for the SG with the Disable-StorageGroupCopy cmdlet:
Disable-StorageGroupCopy id 'XYZ-MBX1Super Critical Mailboxes'
-StandByMachine 'SCRTargetServer'
Then you can re-enable SCR with the desired parameter values.
Enable-StorageGroupCopy creates a copy on the target server of the source SG, database, and closed transaction log files; the copy has the same data paths as the original. Exchange performs this work in the background; depending on network conditions, it might take a few minutes for the command to complete.
After it creates the necessary files and directories on the target server, Exchange begins log replication from the source server. The Replication Service on the target server accepts and inspects the new logs as they arrive to ensure they're valid. If a log fails inspection, the Replication Service attempts to copy it again. If the Replication Service fails to copy the log, it generates an error. The target server always waits to accumulate at least 50 log files before it replays the logs into its database copy. This behavior persists for the entire SCR process: The target server always remains 50 log files behind the source server. In effect, Exchange uses these 50 log files as a buffer to avoid the need to reseed a database following a failure. You can't change the number of logs from 50; this value is hardcoded. So, the actual replay lag time for replicated logs is ReplayLagTime or 50 logs, whichever is longer. If you configure the replay lag time to be more than an hour, you'll find that Exchange replicates more than 50 logs during that time (even on a server that's under a small load), so the replay lag time usually governs when Exchange replays the logs into the database copy.
If you use SCR to protect a new SG, 50 log files might not be available to copy. However, because log files in Exchange 2007 are limited in size to 1MB, you can force Exchange to begin replaying logs on the target server by generating more than 50MB of traffic to mailboxes in the source database. Either move some mailboxes to the source database or send messages large enough to generate the necessary traffic. When the target server accumulates 50 logs, the Store begins replaying the logs to update its database copy.
EMC doesn't display status information about SCR for SGs, so you use the Get-StorageGroupCopyStatus cmdlet to monitor the health of SCR and report on the status of the SG, as follows:
Get-StorageGroupCopyStatus -id 'XYZ-MBX1Super Critical Mailboxes'
-StandByMachine 'SCRTargetServer'
Figure 1 shows the output from this command. The fields reported here are:
SummaryCopyStatus—You'll typically see the status Healthy here, but you could see Suspended (an administrator has stopped SCR temporarily), Service Down (the Replication Service isn't running), or Failed (an error has occurred that requires administrative intervention).
CopyQueueLength—This field shows the number of logs waiting to be copied to the target.
ReplayQueueLength—This field shows the number of copied logs that are waiting to be replayed on the target server.
LastInspectedLogTime—This is the timestamp for the log that the Replication Service on the target server last inspected successfully.
You can get additional information by piping the output of Get-StorageGroupCopyStatus to the Format-List command. Figure 2 shows an example of expanded output for an SG. In this example, the source server has copied 10,146 logs to the target, and no logs are waiting to be copied. The target server has replayed only 10,096 logs into its copy of the database because of the forced 50-log delay, so the replay queue depth is 50.
Of course, standard monitoring procedures apply when you're using SCR; you need to check the application event logs for errors that Exchange reports for the Store or the Extensible Storage Engine (ESE) to ensure that you pick up any early signs that problems lurk in a database. SCR doesn't remove the need for good management practices.
5 Steps to Recovering SCR Mailboxes
With SCR protecting an SG, you can now investigate how to quickly recover from a failure that renders the source server inoperable. Without SCR, you could create a database on another server and use the database portability feature introduced in Exchange 2007 to change the mailbox configuration information in AD to point to the new database. The problem with this method is that the new database is just that: new. Users will be able to send and receive mail after the database comes online, but they'll have lost access to all the data that was in their original mailboxes.
With SCR, you have a copy of the source database on the target server, but no function exists to transform an SCR copy directly into a live database (although this could appear in a future version of Exchange). To make that copy the active database and restore user access to mailbox data, follow these steps:
Create a new SG and mailbox database on the target server. This process gives you stub objects in the Exchange configuration data in AD that you can later use to transfer mailbox configurations.
Delete the mailbox database that you just created.
Use the Move-StorageGroupPath and Move-DatabasePath commands to change paths stored in AD for your stub SG and database to the paths for the SCR logs and database.
Transfer the mailbox configuration data so that Exchange directs users to the SCR copy.
Test whether users can reconnect successfully to their mailboxes.
You can't perform any of these steps unless the Mailbox server role is present on the target server, which is why you must install the Mailbox role on any server you want to use as an SCR target.
Expect some data loss with SCR. After all, you don't go through a recovery exercise for fun: Something catastrophic makes the transfer necessary. However, given an appropriate replay lag time, the copy of the database on the target will be reasonably up-to-date, and the vast majority of user data will be available.
Recovery Process in Detail
Although the recovery operation requires only five logical steps, the devil is in the details, so you need to examine the commands and potential pitfalls you can encounter during a typical recovery operation. It's important to note that you can automate most of this work with Windows PowerShell, through EMS, to enable a faster recovery that isn't exposed to the potential of mistakes people make when working under stressful conditions—which are the usual conditions that exist when data centers experience problems.
The first step is to create the stub SG and mailbox database on the target server. Mount the database to ensure that everything is OK, then dismount it. You can perform these actions through EMC or EMS. In fact, you can prepare an SG and database well in advance of any problem, which lets you avoid doing this work under pressure when you need to recover a database. Here are sample EMS commands to create the SG and mailbox database, then mount and dismount the database:
New-StorageGroup -Name 'Recovery SG' -Server XYZ-Target-Server
-SystemFolderPath 'L:Recovery' -LogFolderPath 'L:Recovery'
New-MailboxDatabase -Name 'Recovery MBX' -StorageGroup 'Recovery SG'
-EdbFilePath 'L:RecoveryMailbox Database.edb'
Mount-Database -id 'Recovery MBX'
Dismount-Database -id 'Recovery MBX' -Confirm $False
You can now delete all the files for the database you just created. Later, you'll update the configuration data in AD to move the paths for the stub database to point to the SCR copy.
An outage now occurs, and you can no longer access the source database. You need to instruct Exchange to disable SCR for the source database, and you need to make the target copy mountable by using the Restore-StorageGroupCopy command:
Restore-StorageGroupCopy -id 'Super Critical Mailboxes'
-StandByMachine 'SCRTargetServer' -Force
Restore-StorageGroupCopy verifies that all necessary logs are available in the target copy location. If any log files are missing, the task tries to copy them from the source if the source is available. You use the -Force parameter to tell Restore-StorageGroupCopy that it should continue even if it can't contact the source. Restore-StorageGroupCopy gives an indication whether you should expect some data loss, but because the exact nature of that loss depends on the contents of missing transaction logs, it's hard to predict the impact on users. In most cases, the loss is likely to be just a few logs, which is an acceptable amount of loss to get a live database running.
You've now switched to the target database. You expect some data loss because the target server has probably not had the chance to copy some transaction logs to the target. Therefore, the target copy is incomplete because the Store hasn't committed these transactions; this is known as a dirty shutdown state. The Store can't mount a database if it's in dirty shutdown. Before you can mount the database on the target, you need to check its shutdown status by running Eseutil with the /mh switch:
eseutil /mh "L:SG1First Storage GroupMailbox Database.edb"
As Figure 3 shows, Eseutil reports the database state as Dirty Shutdown. You can fix the problem by running Eseutil with the /r switch to recover whatever data is possible from available transaction logs and to prepare the database to be mounted:
eseutil /r E00 /l "L:SG1First Storage Group" /d "L:SG1First Storage Group"
For details about running Eseutil and what the various command-line switches mean, see the Microsoft article "Eseutil," technet.microsoft.com/en-us/library/aa998249(EXCHG.80).aspx. If a lossy condition exists (i.e, you've lost some bits of data), Eseutil tells you to run the utility again and include the /a switch to let Exchange mount the database even if some transaction logs are missing:
eseutil/r E00 /l "L:SG1First Storage Group" /d "L:SG1First Storage Group" /a
Microsoft introduced the Eseutil /a switch in Exchange 2007 for just this purpose. Eseutil now recovers all the data that it can from any transaction logs that are available but haven't been replayed into the database. If the load on the source server was low when the outage occurred—meaning that it wasn't replicating many logs to the target—you might have just the default 50-log buffer to replay. If the load was heavy or you use an extended lag time, there will be more logs to process. In either case, Eseutil should complete its process in a few minutes. If you run Eseutil /mh again, you should find that the database state is now Clean Shutdown, so you can proceed to switch the database to take the place of the stub database you created earlier.
To make the switch, use the Move-StorageGroupPath and Move-DatabasePath commands to change the path for the SG (the log files) and the stub database to the location of the recovered database and its logs. You can use the -ConfigurationOnly parameter to instruct Exchange to update only its configuration data in AD; no physical files are moved. You use the Set-MailboxDatabase command to update the properties of the mailbox database object. This action tells Exchange that you have replaced the file in much the same way as if you had restored the database from a backup. Here's the sequence of commands you need to enter:
Move-StorageGroupPath -id 'Recovery SG'
-SystemFolderPath 'L:SG1First Storage Group'
-LogFolderPath 'L:SG1First Storage Group'
-ConfigurationOnly -Confirm $False
Move-DatabasePath -id 'Recovery MBX'
-EdbFilePath 'L:SG1First Storage GroupMailbox Database.edb'
-ConfigurationOnly -Confirm $False
Set-MailboxDatabase -id 'Recovery SGRecovery MBX' -AllowFileRestore $True
You should now be able to mount the recovered database:
Mount-Database -id 'Recovery SGRecovery MBX'
With the database mounted, you can update the configuration data for the mailboxes in the original database so that Exchange points to the recovered database. You do this by piping the output of the Get-Mailbox command (which you use to fetch details of all user mailboxes in the original database) to the Move-Mailbox command, specifying the -ConfigurationOnly switch, as follows:
Get-Mailbox -Database 'XYZ-MBX1Critical Mailboxes' |
Where {$_.ObjectClass -NotMatch '(SystemAttendantMailboxExOLEDbSystemMailbox)'} |
Move-Mailbox -TargetDatabase 'SCRTargetServerRecovery MBX'
-ConfigurationOnly
The Where filter in this example serves to exclude system mailboxes from the move. Because Exchange needs to update only AD (no mailbox data moves), this operation should complete quickly. You can then use Outlook Web Access (OWA) to log on to a mailbox in the recovered database to check that the mailbox is available. OWA is easier to use for this purpose than Outlook because you don’t have to configure a profile for the mailbox and you can use OWA on any PC that has a suitable Web browser. Data from missing transaction logs is lost, but the vast majority of user folders and items will be intact.
AD replication can delay users from connecting to their mailboxes because the configuration updates you've made must be replicated throughout the organization before you can guarantee that every Client Access server knows where user mailboxes are now located. Outlook 2007 clients should update their profiles automatically through the Autodiscover service, and Outlook 2003 clients can update their profiles if the old server is still online. If the server is offline—which is likely following a serious data center outage—users have to update their profiles manually to point to the server that now hosts their mailbox.
Of course, it's unlikely that a recovery server will be capable of supporting the same kind of user load as a well-maintained production server, but it will be able to provide mail connectivity until you fix the problem that caused the data center outage and to bring the original server or a replacement online. At that point, you can move the mailboxes back from the recovery server and re-establish SCR.
An Excellent Addition to Exchange
It takes time for any new high availability technology to be implemented and most Exchange administrators haven't yet had the opportunity to fully understand SCR in production. Given the chance to learn—by working through a few disasters!—you'll better understand how this technology really works when it's put under pressure. But for now the signs are promising that SCR is an excellent addition to the Exchange administrator’s armory.
About the Author
You May Also Like