Restoring a Broken Exchange Cluster
When your Exchange Server/Microsoft Cluster Server configuration is irretrievably broken, here are two procedures you can use to restore the cluster.
April 4, 2000
Choosing the best recovery procedure
Editor's Note: Only experienced administrators should implement the procedures this article describes. Microsoft Product Support Services (PSS) might not support the one-node cluster recovery procedure.
What do you do when your entire Microsoft Exchange Server/Microsoft Cluster Server (MSCS) configuration is irrevocably broken? For example, you might have destroyed the shared-disk configuration or damaged the data on the disks, or both local server hard disks might have failed. Or what do you do when one or all Exchange databases are corrupted or when your subsystem disk is full and the Exchange databases use all the disk space—that is, when you need to restore Exchange databases? In these circumstances, you can use one of the recovery procedures I describe here.
Choosing the Best Method
Two restore procedures exist. You can rebuild a two-node MSCS, then restore the Exchange databases on it. Or you can rebuild a one-node MSCS, perform the recovery steps you need to run Exchange services, then rebuild the other node after you've successfully restored Exchange services on the first node.
The two-node cluster recovery procedure is the safest, because you can request Microsoft support if something doesn't work during the recovery steps, but PSS might not help you if you have trouble restarting Exchange in a one-node configuration after a database restore. The main disadvantage of the two-node procedure is that it requires duplicate software installations, which delays the Exchange database restore process. For this reason, I recommend using this method when you have an undamaged two-node cluster configuration or when you're rebuilding the cluster outside of production hours.
Conversely, when you have to minimize your Exchange downtime, you can use the one-node cluster recovery procedure. With this approach, you don't need duplicate installations on cluster nodes to run Exchange services. This procedure is especially useful in situations in which mail and messaging is a crucial service. Moreover, you must use this method when you're temporarily using only one server in the cluster configuration or when you're rebuilding just one cluster node that is irrevocably broken. However, as you'll see, you must be very careful when rebuilding the second node; don't attempt a one-node recovery unless you have considerable experience with clusters.
To simplify the steps, I assume that you must rebuild an MSCS that runs only Exchange services; I don't discuss recovery steps for other cluster-aware services. If you have additional services, add recovery steps for them to the procedures I describe. To cluster Exchange, you must be running Windows NT Server, Enterprise Edition (NTS/E) with at least Service Pack 4 (SP4) and Exchange Server 5.5, Enterprise Edition (Exchange 5.5/E) with at least SP2—Microsoft recommends SP3.
Two-Node Cluster Recovery Procedure
To restore a two-node cluster, you must make preparations, install a new cluster with the same cluster name, install Exchange Server in an appropriate temporary configuration, and restore the appropriate Exchange data-bases (i.e., both the Directory Store and the Information Store—IS). For more information about installing an Exchange cluster server, refer to Dennis Lundtoft Thomsen, "Tips for Clustering Exchange Successfully," April 2000.
Make preparations. Before you begin the recovery, you must have these elements in place:
The minimal Exchange database backup. In other words, you need a regular online backup or an offline backup that contains the dir.edb, pub.edb, and priv.edb files of the broken Exchange cluster server. These databases must be consistent; that is, a 1013 error (Database corrupted) must not appear in the event log before you make the backup. You can ensure that the Directory and IS are consistent by running the DS/IS Consistency Adjuster. (For more information about this process, see Tony Redmond, "The Infamous DS/IS Consistency Adjuster," May 1998.)
A running PDC/BDC of the NT domain where the Exchange Service Account Admin resides. Otherwise, you must recover this PDC before you start the restore procedure.
A working RAID configuration for the array where the shared-disk system resides. If you've lost the RAID configuration, you must rebuild it before you restore the Exchange cluster. You don't need to configure the subsystem disk (e.g., RAID level, number of disks) exactly as it was before the failure; the only requirement is that you have enough shared-disk space to install Exchange Server and restore its databases.
Install MSCS. To begin the restore, you need to install MSCS with the same cluster name as the failed cluster. Follow these steps for a two-node restore:
Rebuild a two-node MSCS with the same configuration (i.e., NT domain role and network name) as the broken cluster.
Update each cluster node to the same NT service pack you were running when you made the last backup.
If you use backup software other than NT Backup, install the backup software on each node and apply any fixes that were present when you made the backup.
Install Exchange Server in a temporary configuration. To ensure that your Exchange configuration will match the configuration in the production site, you must rebuild Exchange Server so that it's identical to the broken server's configuration. Here's what you do:
Install Exchange on this cluster with the same organization name, site name, server name, and Exchange service NT account SID as the broken Exchange Server. (Using these same names and SID account is essential for successfully restoring the cluster, but these elements aren't related to the cluster technology.)
Install any additional Exchange services (e.g., Event Service, Outlook Web Access—OWA) you had on your broken cluster. (Note that clusters don't officially support OWA, although some administrators have been able to implement it.)
Install the connectors (e.g., X.400, the Internet Mail Service—IMS) that were on the broken Exchange cluster. You don't need to configure these connectors exactly as they were on the broken server.
If the broken Exchange cluster had cluster-aware antivirus software, install and configure it as it was on the broken server.
Update the Exchange cluster server to the Exchange service pack that was running when you made the last backup.
Select one resource (e.g., the Microsoft Exchange System Attendant) in the Exchange cluster group. Select the resource properties, enter the active node in the Possible owners box, and delete the inactive node from the list. This setting, which Screen 1 shows, will eliminate the possibility that the Exchange cluster group will fail over to the inactive node, in case a problem (e.g., a power outage) occurs while you're restoring Exchange on the active node.
From the Cluster Administrator program, select the option to take all Exchange services offline.
Install the Exchange databases (online backup procedure). If you have an online backup tape, follow these steps to restore the Exchange cluster:
Using Cluster Administrator, bring the Exchange System Attendant online. Select System Attendant, right-click, and select Bring online.
Catalog the backup sets on your backup tape, and select the Directory Service (DS) and IS databases that you want to restore.
Select the options not to overwrite the DS and IS transaction logs and not to restart services after the restore has completed.
When the restore has completed, check that the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesMSExchangeDSRestore in Progress Registry key is present on the active node. During an online restore, the backup program adds this key, which the process requires to recover from an online restore.
Reset the Registry checkpoint for the Microsoft Exchange Directory Service. Registry checkpointing is a procedure that provides consistency between the Registry setting for both nodes. When you change a Registry key that is replicated between nodes, that change is replicated to the inactive node.
Registry checkpointing is an important step in restoring an Exchange cluster. When you change a Registry key related to a cluster resource when the resource is offline, Cluster Administrator rolls back changes out of the Registry before it starts the service. This action occurs because the cluster manager in cluster.exe deletes the Restore in Progress key that the restore process creates. When Registry key changes occur, you must repeat the restore operation to recreate this key—a time-consuming process. To preserve Registry changes, you must reset the Registry checkpoint for the cluster resource by following these steps:
From Cluster Administrator, highlight the appropriate service resource.
Select File, Properties, then go to the Registry Replication tab, which Screen 2 shows.
Highlight the Registry key, and choose Modify.
From the resulting dialog box, select the whole key, copy it to the Paste buffer, and click Cancel to close the box.
Click Remove to remove this key.
Click Apply to apply this change. (This step is essential.)
To re-add the key, click Add, then paste into the input box the value you saved.
Click OK twice to close the dialog boxes.
(For more information about Registry checkpointing, see the Microsoft article "Registry Replication in Microsoft Cluster Server" at http://support.microsoft.com/support/kb/articles/q174/0/70.asp.)Bring the DS online, and wait for recovery steps to complete.
As you did for the DS, check that the IS restore operation has created the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesMSExhangeISRestore in Progress Registry key, as Screen 3 shows.
Reset the Registry checkpoint for the Microsoft Exchange Information Store resource by following the procedure in step 5.
Bring the resource online and wait until the recovery steps have completed. The IS Service will restore the log files starting from the low number and ending with the high number in the key cited in step 7. Watch the NT application event log that reports which log the restore is committing to the store.
When the recovery is complete, restart all other Exchange services.
Install the Exchange databases (offline backup procedure). To restore the Exchange databases from an offline backup, follow these steps:
Delete all existing data in the DSADATA and MDBDATA directories.
Restore the offline version of dir.edb to the DSADATA folder and priv.edb and pub.edb to the MDBDATA folder. If you're putting logs and the database on separate volumes, you'll have two folders with the same name, so be sure you put the files in the database folder. Also, restore only the files you need to replace the files you're having problems with; for example, you don't want to overwrite a good priv.edb.
Using Cluster Administrator, bring the DS online.
Open a command window, and set the _CLUSTER_NETWORK_NAME_ environment variable to the network server name that Exchange uses. For example, if your cluster name is ExcClu1, you run the command
C:> set _cluster_network_name_=excclu1
From the exchsrvrbin directory run the isinteg patch command by executing the code
exchsrvrbin> isinteg patch
The isinteg patch command ensures that new objects you create in the IS don't have the same GUID as other objects in your organization.
Reset the Registry checkpoint for the IS resource. (Step 5 in the previous section explains this procedure.)
Using Cluster Administrator, bring the IS back online.
Now you can restart all other Exchange services.
When services are running, perform the following steps to complete the recovery:
Check that the mailboxes' primary NT account is associated with the right NT account.
To be sure that no inconsistencies exist between the DS and the IS, perform a DS/IS consistency adjustment, choosing to correct both the private and the public store and all inconsistencies. Depending on the number of objects in the IS, the consistency adjustment might take a long time to complete.
Clear the active node setting you chose in step 6 of the "Install Exchange Server in a temporary configuration" section.
One-Node Cluster Recovery Procedure
To restore a one-node cluster, you need to rebuild a one-node MSCS, install Exchange in a temporary configuration, and perform the recovery steps you need to run Exchange Services. Then, while Exchange is running, you add and configure the second node in the MSCS.
Rebuild the first node, and install Exchange. To rebuild the MSCS, configure only one node of the cluster (while the second node is turned off). Follow the steps in the two-node cluster recovery section, but omit all software installation steps for the second cluster node.
To restore the Exchange database, follow the steps in the two-node cluster recovery section. When you complete the recovery steps, the first node is running Cluster Services, including all Exchange services, and the second node is turned off. You must be sure that the first node is running correctly because the cluster manager will replicate the first node's configuration to the second node that you're going to rebuild. Cluster manager will use cluster membership and replication of Registry keys related to cluster-aware services.
In this configuration, check that you have only one possible owner of all cluster resources. The Registry replication settings are the same as in the two-node cluster configuration.
Rebuild the inactive node. When the first node is running, you need to rebuild the second node. Here's what you do:
Reinstall NTS/E on the second node with the same NT configuration (i.e., NT domain) that you had before the disaster. The second server doesn't have the Clusterdisk driver installed because it doesn't have MSCS yet. This driver controls access to the disk shared between cluster nodes. While you're rebuilding the second node and before you've installed MSCS, you can access the shared disk even if the other node is accessing the disk and performing I/O on the shared volumes. In MSCS, only one node can access a shared physical disk at a time. The cluster disk driver, which you install with the cluster software, regulates access to the shared disk. Because the inactive node doesn't have the Clusterdisk driver yet, it can access the disk and seriously damage the logical configuration. If this access happens, you must run Chkdsk/f on the active node to correct the problem. Therefore, when you're installing NT, you must choose the correct system drive (i.e., the local hard disk) on which to install the system files, without disturbing the shared disk with any kind of I/O.
The safest way to protect the shared drives in this special and temporary configuration is to disconnect the shared disk array's cables from the second node, if the hardware configuration permits. (The MSCS documentation on the second NTS/E CD-ROM provides more information.) If you can't disconnect the cables, turn off the active node and the shared disk array while you're installing NT and the appropriate service packs on this server.Install the MSCS on this node (choose Join to existing cluster in the Cluster Setup program), and install the last NT service pack that was running on the active node.
Run Cluster Administrator, and choose a cluster resource (e.g., the Microsoft Exchange System Attendant resource) that belongs to the Exchange group resource. (Exchange must run in a resource pool called a group, which contains at least an IP address, a disk, and a network name.) Set the active node as possible owner for this resource. This setting is useful for avoiding cluster failover to the second node during all the next steps; a failover at this time would be unsuccessful because you haven't installed Exchange services locally yet.
Install Exchange on the second node, selecting Update node in the Exchange Setup program.
Install the last Exchange service pack that you were using for the Exchange cluster server.
Remove the Possible owner setting you selected in step 3, and verify that all resources have both the first and the second node as possible owners.
Configure the failover/failback setting on the Exchange resource group.
If you use additional cluster services (e.g, antivirus, backup tools), install them on this node.
Your Last Resource
In some situations, you can't avoid rebuilding an Exchange cluster server, even if you carefully manage your cluster environment every day. These procedures will help you avoid errors in case of a disaster.
About the Author
You May Also Like