Using the Cloud for Disaster Recovery
If your IT infrastructure goes down there is barely a company on earth that could keep running and companies are only increasing their reliance on IT. But the cloud can offer a solution.
March 14, 2017
What’s the most important part of a company today? The sales team? Management? Developers? Who knows but the reality is no one is irreplaceable and while there could be short term pain if a key team member left they can be replaced. If your IT infrastructure goes down there is barely a company on earth that could keep running and companies are only increasing their reliance on IT. It is for this reason every company either has Disaster Recovery (DR) processes or is looking at enabling DR but it’s expensive. Datacenters, connectivity, servers, cooling, licenses, power and much more. All for something that you hope is never used. That’s an expensive seat-belt! What would be great is some kind of service where you only pay if you actually use it, you know, something consumption based…
All jokes aside it is because most cloud service are consumption based that make it so attractive for DR purposes. There are certain ongoing costs such as storage, licensing and perhaps even some minimal compute (VMs) for key workloads but typically you only pay for the majority of the workloads if you actually have to perform a failover either live or test.
I’m most familiar with the Microsoft cloud services so I want to look at how the Microsoft cloud services could be used for typical environments and a key point is that don’t think of picking one, all-encompassing solution. Rather pick the right solution for each workload then look at how to automate the DR failover process. Before looking at DR to the cloud first look at how DR would be performed on-premises between locations.
Firstly, a discovery is performed to identify the various systems that are critical to the business, what those systems depend upon (as they have to be equally depended) and the various availability requirements including how long they can be down (Recovery Time Objective) and how much data can be lost (Recovery Point Objective). It’s important to have a realistic number for availability. The first inclination may be “it has to be available in 30 seconds and I can’t lose any data” but the reality could be to achieve that costs $500k a year where as a 15-minute downtime and 5 minutes of data loss would only cost the company $5k in lost revenue. Not a great investment.
There are numerous methods to protect workloads which are listed in a rough order of preference
Application level replication such as SQL AlwaysOn, domain controller multi-master replication, Exchange Database Availability Groups and so on. With these technologies, you have full visibility into the state of replication, control of replication and the applications are aware of any failovers
In-guest replication or hypervisor-level replication which may optionally hook into VSS to give periodic application-consistent recovery points (point-in-time instances of the protected workload that can be restored to instead of most recent)
Storage-level replication
Restore a backup
There is also another option that is often overlooked and its placement in preference really depends. Some workloads have no state. Consider a farm of 50 IIS web servers. They just serve requests and the actual data is stored in a database. I don’t care about the state of the web servers. Therefore, rather than replicate them perhaps I have a master image of an IIS server or a PowerShell DSC configuration and in the event of a disaster just quickly create a new 50 server farm from the template. This avoids any replication and gets me up and running as quickly as I can trigger some automation to create 50 new instances.
Additionally, the costs are very different which is magnified when using the consumption-based cloud as the target. Sure, an application level replication is best but it requires an application instance to be running at the target to actually receive the data whereas some other replication technology perhaps only replicates the instance storage.
Now with the cloud there are different types of service available. There are VMs through Infrastructure as a Service, platform services through Platform as a Service however this would require the applications to be written for PaaS which likely is not the case for many organizations today but will become more prevalent with Azure Stack etc. Then there is Software as a Service such as Office 365. Now leveraging something like Office 365 during a disaster is not practical however moving to Office 365 in advance removes that service as a concern during a disaster. If Exchange, SharePoint and so on are moved to the cloud then during a disaster you not only don’t have to worry about those services but they are also vital communication services that can now be relied upon to be leveraged during the DR process.
Now, let’s focus on IaaS to provide VMs in the cloud. Those same options for on-premises apply to the cloud but the cost difference is amplified. For the application level a VM must be running in Azure to receive the replication which means incurring the compute charges and any associated storage but gives the best fidelity control and replication. This would be a good option for tier-1 databases that are business critical where the additional cost is justified to enable complete control and visibility into the replication and providing the fastest failover and lowest data loss.
Next replication can be leveraged. With Azure this would be Azure Site Recovery that provides hypervisor level replication for Hyper-V VMs and in-guest level replication for ESX VMs and physical based systems. To use this technology an ASR license is required which is per-protected OS instance per month (which can also be purchases as part of Operations Management Suite) plus the storage however there are no compute charges unless you actually perform a failover at which point compute charges would be incurred until the Azure based resources are deprovisioned.
Storage-level could be utilized through devices such as StorSimple which is an appliance that utilizes Azure as an additional storage-tier for the least accessed storage plus the storage for complete snapshots of all content. In a disaster a virtual StorSimple can be fired up in Azure and expose the data. The ongoing costs are simply the storage in Azure.
Then there is the option of just spinning up VMs from a template where state is not a problem which is where Azure Resource Manager and VM Scale Sets excels. In 5 minutes I can spin up a 100-node IIS farm with no prior costs other than the storage to host the single template image.
A number of options and right now this may seem daunting. I’m using various technologies so in an actual disaster my recovery plan would be complex with lots of steps across many different systems. Azure Site Recovery provides a recovery plan object. In advance the recovery plan is created which specifies the order of VM failover, PowerShell can be called via Azure Automation, integration with SQL AlwaysOn is available and it’s even possible to add a manual wait step if I need to go pull some big lever during a real disaster. This takes time but during the actual disaster a single action is triggered that orchestrates the complete failover (and failback later on) which is critical during a stressful true disaster where humans can’t be expected to run through pages of manual steps. The same technology is used to test failover but the DR instances are spun up on an isolated network to not interfere with production.
There is a lot to think about and a lot more options including the monitoring of hybrid environments, managing images, when to use PaaS, containers, how will users actually get to services during a disaster. The list goes on. The good news is we’ll be covering all of this at IT/Dev Connections 2017 so I hope to see you there!
So what are you waiting for? Join me at IT/Dev Connections 2017! This year I am managing the Cloud and Datacenter track as the Track Chair. To see what I have planned for this track, read: ITDC 2017: What to Expect from the Cloud and Datacenter Track.
IT/Dev Connections runs October 23-26, 2017, in San Francisco.
Widely recognized for the depth and objectivity of its content, and included in the list of the 19 best DevOps conferences to attend in 2017, IT/Dev Connections training sessions teach developers and IT professionals the skills they need to do their jobs better. Its speakers include the most knowledgeable and accomplished developers, technology experts, and strategists in the industry. Covering everything from Microsoft’s roadmap and strategy to detailed best practices, these hands-on experts take attendees through the entire planning and implementation process, providing specific guidance across all perspectives. The following are a few of the technologies covered: Visual Studio, HTML5, SQL Server, SharePoint, Microsoft Azure, Microsoft Exchange, Microsoft Windows, Systems Management, Cloud, Big Data, DevOps, and Virtualization.
About the Author
You May Also Like