Using Database Repair for Disaster RecoveryUsing Database Repair for Disaster Recovery
Salvage data from corrupt databases
August 17, 2011
Every DBA will see corruption at some point in his or her career. Hopefully when you do, you'll have a comprehensive set of backups that let you perform a fast, targeted set of restores with minimal downtime and data loss. Alternatively, if the most important business requirement is to minimize downtime, you might fail over to a synchronously updated, redundant system.
But what if you can't do either? What if your backups are corrupt, your backups have been failing for months and no one noticed, or the piecemeal restore that you want to perform isn’t possible because you’re using the SIMPLE recovery model? There have been many articles and blog posts written about using backups to recover from corruptions (see my blog category Backup/Restore for lots of useful links), but very few about what to do when you don’t have backups or when your restore operations fail. In this article I’m going to walk you through an example of using the repair functionality built into DBCC CHECKDB, plus older backups, to attempt to salvage some data.
One thing to keep in mind is that there are an infinite number of scenarios involving database corruption, and at every stage in the repair process something could crop up that you haven’t seen before. It’s not possible to cover every eventuality in this article, but it will give a good grounding in the techniques used when performing a repair operation.
Scenario Setup
To practice these repair techniques, I created a simple example database that you can download. The database is called SQLMag2010 and simulates a simple sales database. I created the database, performed a full database backup, corrupted the database, and then performed another full database backup. The corruption is on one of the data pages in the sales table’s clustered index. Figure 1 shows the timeline of this scenario.
Figure 1: Timeline of a disaster scenario
The WinZip archive contains the two database backups. To set up the scenario, simply download and unzip the archive and restore the corrupt version of the database using the following code:
RESTORE DATABASE SQLMag2010FROM DISK = N'C:SQLskills CorruptSQLMag2010.bak'WITH REPLACE;GO
This backup restores without errors because the backup wasn't performed using the WITH CHECKSUM option. This option (introduced in SQL Server 2005) lets backup and restore operations test page checksums on pages in the database being backed up and restored, but without this option, no checks are performed. In production, it’s advisable to use this option when performing backups.
Also note that the database was created using SQL Server 2005, but you can restore it on SQL Server 2008 R2 and SQL Server 2008, and everything I describe in this article works the same in all three versions.
Using Database Repair
After you restore the database, run the following
query against the sales table to expose the corruption:
SELECT SUM (salesAmount)FROM SQLMag2010.dbo.salesWHERE salesAmount < $1.00;GO
The error that's returned is shown in Web Figure 1.
Web Figure 1
In the case of a real corruption in your database, you should have some kind of monitoring in place so that you’re alerted when a user or application hits a high-severity error similar to the one shown in Web Figure 1. For example, you can create a SQL Agent alert based on the error severity. (See "Easy Monitoring of High-Severity Errors: Create Agent Alerts," for more information about how to do so.) You don’t want to rely on users notifying you or having to periodically look through the SQL Server error log or Windows event log to see these errors. Also, the sooner you know that you have corruption, the more likely you can recover with minimal downtime and data loss.
Generally, the first thing to do when you encounter corruption is determine how much corruption you’re dealing with. Only one tool will do that for you: DBCC CHECKDB. You need to determine the extent of the corruption so that you can figure out what your options are in terms of restore or repair (or potentially just fail over and deal with the corruption offline).
In any disaster recovery situation, you want only information about the corruptions, and you want to know about all the corruptions, so instruct DBCC to provide all this information using the following command:
DBCC CHECKDB (SQLMag2010) WITH ALL_ ERRORMSGS, NO_INFOMSGS;GO
The results from this command are shown in Web Figure 2. (Note that I removed some of the output to save space in the article. When you run it, you’ll see the full output.)
Web Figure 2
From the errors, it looks like a single page is damaged. At this point, you might choose to examine the page using the DBCC PAGE command (which, although undocumented, is perfectly safe on production systems—I wrote that code, and I told customers the same thing when I worked at Microsoft). Use the following code to produce a dump of the whole page:
DBCC TRACEON (3604);DBCC PAGE (SQLMag2010, 1, 158, 2);GO
The trace flag is necessary to return the DBCC output to your current connection; otherwise, it goes to the error log. You can read more about using the DBCC PAGE command in many posts on my blog.
The best way to repair this corruption with minimal downtime and zero data loss is to perform a single-page piecemeal restore (introduced in SQL Server 2005), starting with the most recent full database backup and then using the most recent working differential backup and all required transaction log backups to bring that page up to the same time as the rest of the database.
In this contrived scenario, however, you don't have any backups other than a single full database backup that was done prior to many other changes (e.g., it’s last week’s backup). As a result, there’s no way to perform a piecemeal restore because you won’t be able to bring the older page version from that backup up to the same time as the rest of the database without transaction log backups. This means you’re going to have to use database repair. As long as the corruptions that DBCC CHECKDB discovered are repairable, the last line of output will tell you what repair option you need to use.
In this example, you’re going to need to use REPAIR_ALLOW_DATA_LOSS. This means that the corruption(s) can be repaired only by what I call “delete-what’s-broken-and-fix-up-all-the-structural-linkages.” When I wrote the repair code for SQL Server 2005, as a part of rewriting all of DBCC CHECKDB, I had to come up with the simplest, fastest, and 100 percent correct way of repairing the various types of corruptions. In most cases, this turned out to be to delete the corrupt structure in the database and then ensure that all the other structures in the database were corrected to take account of the deletion.
By specifying the REPAIR_ALLOW_DATA_LOSS option, you’re acknowledging that DBCC CHECKDB will have to delete some data to perform the repair. There’s no point trying to use REPAIR_FAST or REPAIR_REBUILD if the DBCC CHECKDB output says you need REPAIR_ALLOW_DATA_LOSS—it won’t be able to fix the corruptions if you specify a lower repair level.
At this point, you have no choice but to perform a repair operation. The following are the steps you should take to perform this operation:
Back up the corrupt database so that you have a fall-back option in case anything goes wrong. In this scenario we’ve already got that backup, but in production people often forget to back up the database at this stage just in case.
Put the database into single-user mode (a requirement for using any repair option with DBCC CHECKDB).
Run the repair operation.
Verify that the corruption has been completely repaired by running DBCC CHECKDB again to ensure there are no further corruption messages.
Take any further recovery steps to validate the data in the database. These steps are discussed in more detail later in the article.
Once you’re satisfied with the state of the database, return it to multiuser mode and let the workload resume.
You might find there are some corruptions that DBCC CHECKDB can’t repair. (For more information, see "CHECKDB From Every Angle: Can CHECKDB repair everything?"). You can tell if this is the case by examining the end of the DBCC CHECKDB output: If the stated number of corruptions fixed is equal to the stated number of corruptions found, then DBCC CHECKDB repaired all the errors. If you find there are some corruptions that can’t be repaired, it might be possible to manually repair them, but the details are beyond the scope of this article. See my blog category Repair at SQLskills.com for examples.
Step 4 is necessary because there might be more corruptions that DBCC CHECKDB couldn’t previously find. This can happen if a particular corruption prevents DBCC CHECKDB from deeply examining a portion of the database—once the corruption is repaired, a subsequent deeper examination might further reveal corruptions that must be repaired. I call this phenomenon corruption masking.
For our scenario, the following is the code used to perform the repair operation:
ALTER DATABASE SQLMag2010 SET SINGLE_USER;GODBCC CHECKDB (SQLMag2010, REPAIR_ALLOW_DATA_LOSS)WITH ALL_ERRORMSGS, NO_INFOMSGS;GO
And this produces the output shown in Web Figure 3 (again, curtailed slightly to save space). From the output in Web Figure 3, you can see that DBCC CHECKDB fixed all the corruptions it knew about, and if you run DBCC CHECKDB again, you’ll see that there were no masked corruptions.
Web Figure 3
Step 5 in the list is to perform any further recovery actions because after running any REPAIR_ALLOW_DATA_LOSS operation, you need to be aware that some data was deleted. There are three substeps you might need to take at this point:
Check the consistency of any foreign key relationships that involved the table(s) affected by the repair operation. DBCC CHECKDB doesn’t take constraints into account when performing repairs. You can use the DBCC CHECKCONSTRAINTS command to check your defined foreign key relationships. You’ll also want to verify the logic of your triggers to see if any additional modifications should be made. In general, you need to check all of your business logic surrounding the data that was deleted.
Reinitialize any replication subscribers that were subscribing to publications from the table(s) affected by the repair operation. All repairs are fully logged, but are specifically marked not for replication because of the physical nature of the operations being performed, so the transactional replication log reader job won’t replicate them. Any data deleted (or rows inserted into nonclustered indexes) by the repair won’t cause the same actions to occur on the replication subscribers, so they must be reinitialized. Similarly, merge replication triggers won’t fire because of repair operations, so merge replication subscribers must be reinitialized.
Determine which data was deleted by the repair operations, if any.
Depending on the complexity of the database being repaired, it might be possible to retrieve some data from an older backup, but you need to know what has been lost. DBCC CHECKDB doesn’t tell you which data was deleted (mostly because the deleted data was on corrupt database pages that couldn’t be processed), so you’ll need some way of comparing the postrepair data with a known good copy of the prerepair data. The tablediff tool that ships with SQL Server can do this for each table (see MSDN's tablediff Utility web page for more information).
You might also be able to programmatically determine what has been deleted. For instance, if the table that was repaired has an identity column, and table rows aren’t deleted, then any missing block of rows should represent those deleted by the repair operation. It just so happens that this is exactly the case in our example scenario.
To identify the missing block of rows in this example, you can use the code in Listing 1. You can then restore an older backup and determine if any of the missing rows are present in the restored, older copy of the database. If so, you can make a business decision to copy them into the production database using an INSERT SELECT statement, being sure to bear in mind, of course, that there’s no guarantee that the salvaged rows from the older database are up-to-date because additional modifications could have occurred against these rows after the backup was performed. However, at least you might be able to recover some of the deleted data and have a base on which to investigate what other changes might have occurred.
-- Start of the missing range is when a row does not have a plus-1 neighbor.SELECT MIN (A.salesID + 1)FROM SQLMag2010.dbo.sales AS AWHERE NOT EXISTS ( SELECT B.salesID FROM SQLMag2010.dbo.sales AS B WHERE B.salesID = A.salesID + 1);GO-- End of the missing range is when a row does not have a minus-1 neighborSELECT MAX (A.salesID - 1)FROM SQLMag2010.dbo.sales AS AWHERE NOT EXISTS ( SELECT B.salesID FROM SQLMag2010.dbo.sales AS B WHERE B.salesID = A.salesID - 1);GO
The code in Listing 2 can be used to perform the data copy into the production database (of our example), replacing **MIN** and **MAX** with the values found from running the code in Listing 2.
SET IDENTITY_INSERT SQLMag2010.dbo.sales ON;SET NOCOUNT OFF;GOINSERT INTO SQLMag2010.dbo.sales (salesID, customerID, salesDate, salesAmount)SELECT copy.*FROM SQLMag2010Copy.dbo.sales AS copyWHERE copy.salesID > **MIN** - 1 AND copy.salesID < **MAX** + 1;GOSET IDENTITY_INSERT SQLMag2010.dbo.sales OFF;GO
As you can see, using database repair is by no means as straightforward as using a sequence of RESTORE operations, but it’s not a difficult task, either—as long as you’ve practiced in advance.
Preparing for a Disaster
It's critical to have a comprehensive SQL Server disaster recovery strategy and to have it integrated into the business-wide IT disaster recovery strategy. It’s also critical that the strategy encompasses as many eventualities as possible and that it’s tested as often as possible. All easier said than done—see "Importance of having a good disaster recovery plan” and “Importance of testing your disaster recovery plan."
At the very least, you should practice performing common restore operations regularly so that you aren’t struggling with the syntax of the RESTORE command during a real disaster and you have some level of confidence that your backups do actually restore. But to be really prepared for a disaster, and to be able to stay calm when backups don’t work, practice dealing with corrupt databases using tools other than RESTORE. It will do wonders for your confidence level and could end up helping to save you and your business from a lot of problems when your first line of defense against disaster fails.
Over the past few years I’ve created a collection of corrupt databases you can use to practice performing repair operations. See my blog post "Conference corruption demo scripts and example corrupt databases" for more details. Lastly, if you ever find yourself having to use database repair, make sure that when the dust settles and the crisis is over, you go back and revisit your backup strategy to work out why it failed in the first place. As always, if you have any questions, drop me a line at [email protected].
About the Author
You May Also Like