Insight and analysis on the information technology space from industry thought leaders.

Disaster Recovery Strategies for a Disaster-Prone WorldDisaster Recovery Strategies for a Disaster-Prone World

Downtime is costly, but a well-planned high availability (HA) and disaster recovery (DR) strategy can minimize disruptions.

Industry Perspectives

February 21, 2025

5 Min Read

disaster recovery playbook on a desk near a keyboard

Alamy

By Philip Merry, SIOS Technology Corp.

High availability (HR) is the ability to shift the operation of an application, workload, or service to another secondary server or servers and quickly resume functioning in the event of a fault, failure, or disaster. Disaster recovery (DR) locates secondary servers geographically distant for protection from sitewide, regional, or cloud availability zone disasters. Efficient, reliable HA/DR is vital for maintaining uptime and continuity of services. That's because the alternative — downtime — is expensive.

Downtime Is Costly

According to a 2024 annual survey by International Technology Intelligence Consulting, 91% of medium and large enterprises say the average cost of an hour of downtime can be $300,000 in lost revenue, lost or diminished productivity, penalties associated with SLA violations, and costs associated with response and remediation. For some organizations, the same hour of downtime can cost as much as $4 million.

In addition, there are soft costs associated with increased customer churn due to loss of brand confidence, increased customer acquisition costs, reputation management, and the toll incident response takes on the people called on when things go wrong. Murphy's Law assures us that things that can go wrong will go wrong and usually at the most inconvenient times — even for those organizations that have chosen to outsource their IT infrastructure and other business-critical services. Just ask customers of CrowdStrike, Microsoft Azure, Atlassian, AWS, Rogers Communications … you get the idea.

Have a Plan

HA/DR should be an integral part of enterprise infrastructure from the outset, not an afterthought. That is why implementation must be an intentional pursuit, and to get there, you need a plan.

When you begin building a plan for achieving HA/DR, the first step is to set specific, measurable goals and define quantifiable targets like the desired level of uptime and allowable annual downtime.

Your organization's risk tolerance will dictate actual thresholds, but for this discussion let's define HA/DR as delivering a minimum of 99.99% (four nines) of uptime. That's the equivalent of less than 52 minutes of downtime per year. You should also set clear recovery time objectives (RTO) for how fast you need to restore operations. Lastly, consider your optimal recovery point objective (RPO) — which measures your tolerance for how much data loss you can sustain in a downtime incident.

Establishing defined targets benefits you by allowing you to prioritize resource allocation for protecting the systems and services on which your business-critical operations depend and avoid the trap of attempting to protect all assets equally (leaving high-value systems disproportionately vulnerable).

Identification of operational priorities also helps you recognize and understand the role of software in your operational environment, including which applications need to be running, how software can help protect and maintain operations, how software is affected in the event of a disaster, and the processes required to maintain certain software products.

Leverage Automation

Another benefit comes from understanding where automation can be implemented to complement or replace human intervention. Looking at your organization's current disaster response plan, ask yourself what roles and responsibilities your people have and can they be supported through analytics? Can those tasks be handled automatically, thus freeing that person to do something more important to the process?

Once you have identified your operational priorities and opportunities for applying automation to improve response and recovery, it is easier to adopt the means to achieve "disaster recovery in-depth." Disaster recovery in-depth is a strategy for adopting the right technologies and processes for providing a level of protection appropriate for all the systems your organization relies on.

Among those processes is an HA/DR solution that complements your organization's IT architecture and has the flexibility to accommodate a variety of applications and respond to different contingencies — application crashes, data center loss, network issues, cloud service outages — to reliably deliver operational continuity. HA/DR requires hardware and software redundancy and staff redundancy to avoid "knowledge silos." Staff redundancy simply means making sure more than one person on your IT team has the information and experience needed to manage, maintain, and run the systems you rely on. Without it, your risk is too high. If a knowledge silo leaves your organization or is unavailable when needed, the chances of an error or delay in disaster response increases.

Follow Best Practices

There is no blueprint that every organization can use to establish disaster recovery in-depth, but there are some best practices to consider, including:

Design with high availability in mind.
Choose solutions that meet your use case.
Plan for geographic spacing (mirror systems in one region with a standby system in another region).
Identify and eliminate single points of failure (don't protect your applications but ignore your DNS server).
Document everything (application names and versions, vendor contacts, operational protocols, etc.).
Communicate all pertinent information and expectations thoroughly and effectively between teams.
Automate wherever possible.
Test, test, test.

Finally, if yours is an organization that relies heavily on a single cloud infrastructure provider, consider moving to a hybrid or multi-cloud architecture as a part of your high availability design. Think of it like diversifying your 401(k) rather than investing in a single stock. If you invest everything in Acme Widgets and that company fails, your assets could be wiped out. Cloud diversification avoids a potentially disastrous single point-of-failure scenario and allows you to adopt a technology like SANless clusters to achieve high availability in a cloud-dependent environment.

Be Ready When Disasters Strike

SANless clustering solutions support seamless failover for your mission-critical applications in a multi-cloud or hybrid environment, synchronized with local storage with real-time, block-level (synchronous or asynchronous) replication. SANless clusters function like traditional SAN-based storage hardware (without the resource drain) when used in conjunction with clustering software, such as Windows Server Failover Clustering. They also have the flexibility needed to configure nodes within geographically distributed data centers or in cloud availability zones.

It is a disaster-prone world, and you've got too much riding on your IT infrastructure to risk a catastrophic incident resulting from an avoidable single source of failure. With solid planning and the adoption of a disaster recovery in-depth strategy, you can achieve high availability disaster recovery for your critical operations.

About the author:

Philip Merry is a Customer Experience Software Engineer at SIOS Technology Corp., with a deep understanding of SAP and Linux technologies. He is passionate about creating innovative solutions in the technology industry by utilizing his expertise in software engineering, cloud architecture, networking, and system administration. He enjoys exploring new technologies and innovative projects and constantly seeks to expand his expertise in his field. Philip holds a bachelor's degree in computer science from Clemson University.

About the Author

Industry Perspectives

See more from Industry Perspectives

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

Disaster Recovery Strategies for a Disaster-Prone WorldDisaster Recovery Strategies for a Disaster-Prone World

Downtime Is Costly

Have a Plan

Leverage Automation

Follow Best Practices

Be Ready When Disasters Strike

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

<span class="ArticleBase-LargeTitle">Disaster Recovery Strategies for a Disaster-Prone World</span>Disaster Recovery Strategies for a Disaster-Prone WorldDisaster Recovery Strategies for a Disaster-Prone World

Downtime Is Costly

Have a Plan

Leverage Automation

Follow Best Practices

Be Ready When Disasters Strike

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

Disaster Recovery Strategies for a Disaster-Prone WorldDisaster Recovery Strategies for a Disaster-Prone World