Don’t Confuse High Availability with Disaster Recovery
I recently spoke at WindowsIT Pro’s Keeping Mission-Critical Applications Available online event. My session, High Availability Isn’t Disaster Recovery, focused on a number of topics and concepts to the point where I wanted to provide a blog post with additional resources and links.
June 29, 2011
I recently spoke at WindowsIT Pro’s Keeping Mission-Critical Applications Available online event.
My session, High Availability Isn’t Disaster Recovery, focused on a number of topics and concepts to the point where I wanted to provide a blog post with additional resources and links for organizations and IT pros who are either managing or contemplating HA solutions – in order to ensure that they’re not exposed to common flaws and misconceptions about how HA compares to DR.
If you didn’t have the opportunity to attend any of the sessions in this event, they’ll be available (free) for roughly 3 months for on-demand viewing starting (roughly) the day after the event is over.
Otherwise, what follows are a set of additional resources, links, and content that anyone who is dealing with HA solutions should skim.
The Importance of RPOs and RTOs
In my presentation I stressed the importance of RPOs and RTOs – something I’ve already blogged about in my post on SQL Server Recovery Time Objectives and Recovery Point Objectives.
Stated simply, if you haven’t defined RPOs and RTOs, then your best efforts in responding to a disaster are still likely to meet with failure, as management has likely been working under the assumption (as crazy as it is) that everything was protected and that there really wasn’t much of an opportunity for down-time and lost-data. As such, RPOs and RTOs represent a ‘reality check’ for management who might not give much thought about the true cost and scope of data-loss and down time while also expressly quantifying acceptable levels of data-loss and down-time per system/database/etc.
Or, in even simpler terms: RPOs and RTOs are essential not only as a CYA mechanism, but as a means for helping define and set expectations – something that’s essential to tackle any time we’re talking about the potential for disaster.
Confusing HA with Disaster Recovery
Another common problem I run into as a SQL Server Consultant is cases where many IT pros mistakenly assume that since they’ve implemented vast/expensive High Availability solutions to ensure the redundancy and fault-tolerance of their systems and data, that they’re protected from disaster.
I’ve written on this before – for SQL Server Magazine: Confusing High Availability with Disaster Preparedness.
As such, a large portion of my presentation was focused upon demonstrating ways in which disasters could ‘defeat’ High Availability Solutions. And, in terms of specifics, my presentation focused on how two kinds of corruption, logical corruption and physical corruption, can defeat many High Availability solutions.
Logical Corruption
Simply stated, logical corruption (something that I also like to call ‘soft’ corruption, because it’s either caused by software glitches or bugs in your applications or by soft-skinned-air-breathers (or ‘meatbags’)) is something like DROPing or TRUNCATEing a table – or doing an UPDATE against a table without a WHERE clause – it’s something that quickly renders your data BAD. And the problem with HA solutions is that they don’t care about the state of your data – they’re simply tasked with making it available and redundant. But highly available and redundant data that’s bad data is still busted – or a disaster.
Physical Corruption
Unlike logical corruption which is almost always software or humans, physical corruption is corruption at the IO subsystem level – or where SQL Server asks the OS to write data (either for your data files OR even for your log file) to disk and the disk, controllers, drivers, or something else mangles that request and renders gibberish instead of well-persisted zeroes and ones.
Third Party Log Reader Agents and Dealing with Logical Corruption
As I pointed out in my presentation, dealing with logical corruption can be a SERIOUSLY ugly task. High Availability solutions just don’t help in dealing with most logical corruption – unless you’re willing to potentially sacrifice substantial LOST data in many cases.
I touched upon some of the ugliness involved in using backups (to create a copy of your non-disaster’d database and to try and copy values out of that copy back over/into production – but this approach is tedious and error prone) along with an overview of how third party log reader agents a while back in a competitive review article for SQL Server Magazine.
Unfortunately, the problem with third-party log reader agents, or tools that ‘magically’ read SQL Server Log Files and then give you the ability to generate scripts that can roll-out/back unwanted operations while ensuring that desired operations are rolled forward, is that they’re terribly hard to create.
Kalen Delany touched upon this a bit in a post for SQL Server Magazine where she asked Can the Transaction Log Tell us What Happened? She then followed up on this on her own blog – where comments from other DBAs and SQL Gurus helped point out just how hard it is to be able to actively READ the log. And that’s because, for all intents and purposes, SQL Server Log files are ‘proprietary’ and ‘internal’ storage mechanisms that are subject to change.
As such, vendors who attempt to offer the ability to read log files and offer DEPENDABLE information from them have to spend a decent amount of time working closely with Microsoft to stay abreast of changes and ensure that their applications work correctly. This, in turn, incurs some serious costs and overhead for products that everyone should buy – but which (strangely) not everyone seems to purchase or deploy.
At any rate to my knowledge, there are only TWO vendors today that provide Log Reader Agents/Solutions. I’ve used both and love them both – even though they both come with different price-points (because those different price-points are indicative of different features, options, etc). And those solutions are:
ApexSQL Log – which offers a bevy of features (and I recommend that you take a look at ApexSQL Recovery Studio as well – as it also includes ApexSQL Recover) and
Quest’s Toad for SQL Server – which actually bundles a log-reader agent into this SSMS replacement as an effective ‘freebie’ that makes Toad for SQL Server a phenomenal value (as I pointed out in my recent review).
Dealing with Physical Corruption
In my presentation, I also covered techniques for addressing physical corruption. Effectively, handling these kinds of problems in HA environments takes a couple of interesting turns or twists.
Depending upon what HA solution you’re using, it’s possible that SQL Server will automatically detect physical corruption and will transparently attempt to correct it. For example, since database mirroring has a full-blown copy of non-corrupt data on, say, the mirror when corruption is encountered on the principal, then it’s possible (in some cases – as there are caveats) for SQL Server Mirroring solutions to just ‘ask’ the mirror for a copy of corrupted data which can then be replaced without missing a beat. Similar features also exist with SQL Server Denali’s ‘Always On’ capabilities as well. For more info on how this works with Mirroring (in SQL Server 2008 and above) – along with LIMITATIONS and caveats – make sure to see Books Online for Automatic Page Repair During a Database Mirroring Session.
In cases where there’s no ‘magical fix’ (or even when there is a magical fix and it runs into limitations/constraints), the most important technique for addressing physical corruption is early detection – because allowing it to go undetected allows it to become more ensconced – to the point where it becomes MUCH harder to remove.
And, in cases where you’re not automatically protected, you then need to KNOW how to deal with corruption when it occurs – meaning that you need to know how to run RESTORE operations with the PAGE option when deemed necessary. Likewise, you need to know how to assess the impact of physical corruption when it manifests (in many cases it can be CLEANLY and EASILY addressed without ANY down-time – but you have to know when this is the case.)
To learn more about those last two bullets, I recommend that you check out my free presentation on SQL Server Disaster Recovery and Prevention (which should be available until roughly July 25th) – where I’ve outlined best practices for detecting, assessing, and addressing physical corruption. Likewise, I’ve also made a blog post available that covers SQL Server Disaster Recovery and Prevention Resources.
Best Practices
Finally, as part of my presentation I ‘harped’ upon common best-practices that I iterate over and over again to clients and during presentations. Specifically, I touched upon the fact that many HA solutions or initiatives (as well as DR plans) tend to forget about the importance of environment – meaning that ensuring application up-time requires more than just ensuring that you have your data and systems to run it on. Or, more specifically, if you haven’t addressed issues with connection strings, permissions, certificates, paths, firewall ports, and other things prior to initiating an HA failover or a DR recovery operation, then you have no idea if things will work as needed.
As such, I touched upon the need/requirement to not only test HA/DR solutions initially for initial validation but to regularly test these plans/operations as well. Otherwise you run the VERY serious risk that a subtle environment change somewhere has busted your HA or DR solution in a way that will be unpleasant to deal with in the case of a true disaster.
Likewise, as I regularly point out, if you’re REGULARLY testing HA failover and DR plans, then you’re REGULARLY building familiarity and skill with the commands, operations, and tasks necessary to recover from a disaster – something that’s worth it’s weight in gold when an actual disaster strikes (because there’s no worse time to learn about how to restore a database, for example, than when your database is down, end-users are ticked, and your manager is standing over your shoulder).
And, of course, in order to make sure that Disaster Recovery operations go as smoothly as possible in both regular testing and disaster scenarios, you need to clearly document your HA and DR plans as granularly and succinctly as possible.
For more information on these topics, I high recommend Merill Aldrich’s (blog | twitter) post on All Those Extra Bits We Need for DR – which touches upon environmental and other concerns.
About the Author
You May Also Like