Insight and analysis on the information technology space from industry thought leaders.

How IT Leaders Can Assess and Evaluate Risk to Mitigate IT Outages

IT outages are inevitable, but IT leaders can implement strategies to greatly reduce their risk and recover more quickly when they occur.

Industry Perspectives

September 19, 2024

6 Min Read
power cord unplugged on top of calculator
Alamy

By Scott Willson, xtype

IT failures are inevitable, but their impact can be minimized with the right strategies in place. As technology becomes increasingly integral to business operations, IT leaders must be prepared to assess and mitigate risks effectively to prevent costly outages. According to Uptime Institute's 2024 outage analysis, between 10 and 20 "high-profile IT outages or data center events" occur every year, signaling that there is no way to sidestep such incidents.

The cost of IT outages extends beyond financial implications. Reputational damage is a draining aftermath of high-profile outages. AT&T's shares were down by 2.4% during an outage in February. In the global outage that affected major industries and household names in July, CrowdStrike, the cybersecurity company at the center of the issue, also lost $25 billion of its market value in only days after the incident, faced lawsuits, and received an award for the "most epic fail."

Fortunately, risk is a topic that resonates at the board level, underscoring the importance of building resilience from leadership down through all critical functions. While it will likely never be possible to prevent all outages, there are several tactics IT leaders can deploy to significantly reduce risk surrounding outages and to rebound more quickly from unpreventable incidents.

Related:Outages Cost Companies Millions — Here's How to Avoid Them

Assessing Risk at Every Touchpoint

Uptime Institute's study highlights several causes of outages, including reliance on third-party providers, system and software issues such as configuration errors, data synchronization, and human error. This indicates that downtime can be triggered from almost any point in a company's system, emphasizing the need for IT leaders to scrutinize every potential point of failure.

Creating a risk assessment matrix can help prioritize threats based on their likelihood and impact, enabling IT leaders to allocate resources more effectively and focus on the most critical areas first.

Here is a seven-point approach that can help IT leaders evaluate their preparedness for downtimes and spot risks:

  • Infrastructure: Audit all potential touchpoints within your IT ecosystem, including hardware, software, network connections, and third-party integrations.

  • Applications: Conduct thorough testing of all applications, including updates and patches.

  • Data: Implement and maintain robust data backup and recovery systems to safeguard critical information and ensure business continuity.

  • Human factors: Train and upskill your team members, focusing on best practices and security protocols to minimize human-induced risks.

  • CI/CD pipelines: Audit your pipelines to ensure de-risk policies such as code scans and test automation are enforced. Also, ensure the same automation mechanics used in production are used in lower environments— this will prove the quality of delivery pipelines before they are used in production.

  • Vet open source code: MIT News states, "Ninety-six percent of the computer programs used by major industries include open source software." Anyone can contribute to open-source software, including hackers and state-sponsored actors. The credibility of open-source software must be earned through a skeptical eye.

  • All patches are not alike: Applying patches should not be done blindly. All software patches should go through a risk-reward assessment to determine the appropriate window for deployment. Ideally, they should go through the same rigor your code changes go through as they move through your CI/CD pipelines.

Related:Introduction to IT Disaster Recovery Planning

Other factors to consider are historical performance, known vulnerabilities, and dependency on external vendors.

Risk assessments are not one-time measures. Implementing a schedule of regular, comprehensive assessments ensures you remain proactive rather than reactive, minimizing the chances of unforeseen outages.

Why Monitoring and Automation Are Key to Detecting Issues Early in IT Deployments

Full visibility in production environments for developers is an often underestimated measure in preventing major outages. Real-time multi-environment visibility ensures that they are not left in the dark about where problems originate. Investing in such monitoring technology provides a level of control that allows for the integration of crucial safety steps such as approvals, scans, and automated testing, ensuring issues are identified and addressed early in the development cycle.

For example, a faulty plugin deployed to production can trigger a major issue and a prolonged code freeze. To prevent such occurrences, it's crucial to ensure non-production instances closely mirror production environments. While many developers rely on cloning to achieve this, it's a tedious process. Instead, controlled instance synchronization keeps all environments aligned and significantly reduces the risk of unexpected issues.

A key lesson from the CrowdStrike incident is also that it is important to automate the software delivery pipeline — from moving code through testing to deployment and maintenance. Automating this process ensures that developers can catch issues that might be overlooked in manual reviews, providing a safety net that ensures the integrity of the code before it reaches production.

Communication's Role in Minimizing the Impact of IT Outages

During an outage, timely and transparent communication can help manage expectations and reduce panic. Response protocols should be integrated into your comprehensive disaster recovery strategy to ensure seamless coordination across all teams.

When the situation is clearly defined, your initial response should inform all stakeholders about the issue's status, the steps to resolve it, and the expected timeline for resolution. The severity of the outage would determine the appropriate spokesperson to provide trustworthy information. High-profile incidents typically necessitate communication from the most senior leadership. This communication approach serves dual purposes: It maintains stakeholder trust in the brand during a challenging period and keeps all parties aligned in their response efforts.

Focus should also be on communicating with empathy to your customers. A poorly handled incident can quickly escalate into a major catastrophe. Delivering the right messages in both initial and ongoing responses builds confidence in your customers and other stakeholders that you are effectively managing the situation and can handle future crises.

Post-incident, conduct a detailed analysis and share learnings with your team. Effective communication does not end with resolution; the goal is to fully understand what went wrong and how to prevent similar incidents in the future.

Planning for Recovery and Business Continuity

Business resilience relies on pre-emptive frameworks designed to mitigate potential threats and vulnerabilities. IT leaders must be prepared for adverse events, and a disaster recovery (DR) plan is essential for minimizing downtime when such events occur.

In the aftermath of an outage, a recovery plan is the first resource to restore an organization to normal or near-normal operations. While it falls under the IT team's purview, it is a critical component of the broader business continuity (BC) plan, which accounts for financial, legal, reputational, and regulatory considerations.

A robust business impact analysis forms the backbone of any effective recovery strategy. Other elements in a recovery plan include conducting a thorough risk analysis and vulnerability assessment, an inventory of assets, a clear definition of roles and responsibilities, a communication plan, and activation guidelines.

Creating a business continuity and disaster recovery (BCDR) plan is, however, only the beginning. You have to continuously test that they work and build muscle memory across relevant teams to effectively handle other incidents. An effective BCDR strategy must be kept up to date, especially when the organization undergoes significant changes or reaches key milestones.

No organization is immune to IT disruptions, but with a solid recovery strategy in place, they can withstand and quickly rebound from potential outages.

About the author:

Scott Willson is Head of Product Marketing at xtype.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like