What's Causing Cloud Outages? A Network Managers' Guide

From fat-finger errors to fishing boats, here are the leading reasons cloud outages at AWS, Microsoft, and others are a growing network resilience challenge.

finger pressing "cloud" button
Alamy

As enterprises rely more and more on cloud services to meet their network infrastructure, compute, data storage, and security needs, cloud computing outages have a significant impact on operations.

Many believe (or hope?) that moving services to the cloud would eliminate some issues. After all, you would assume cloud providers make use of the latest technologies, have staff with expertise in these technologies, and build in lots of redundancy.

Unfortunately, what we find is that cloud outages have a lot in common with their data center outage counterparts. Many occur due to human error, power outages, malicious acts, Mother Nature, or plain bad luck. 

What's Causing Cloud Outages?

There are several common culprits causing cloud outages. Over the last few years, we have seen examples of each. All have had a significant impact on the enterprises using the services. Here are some of the top problems that keep reoccurring.

Configuration mistakes

We're in the age of graphical user interfaces (GUIs) and automation. Yet, many critical IT chores like deploying a new server, provisioning storage for an application, or setting up new router tables are done manually via command line interfaces (CLIs). As one would expect, that can lead to configuration mistakes.

Related:Cloud Cost Calculators: Benefits and Limitations

That is often the case with cloud outages. One such mistake caused a six-hour outage of Facebook, Instagram, Messenger, Whatsapp, and OculusVR due to a routing protocol configuration issue. As we wrote at that time: "The outage was the result of a misconfiguration of Facebook's server computers, preventing external computers and mobile devices from connecting to the Domain Name System (DNS) and finding Facebook, Instagram, and Whatsapp."

Essentially, BGP routers were unrecognized, preventing traffic destined for Facebook networks from being routed properly. Resolution of the problem was more challenging than normal because not only was communication between routers interrupted, but so too, were DNS traffic and all applications.

The problem here was that everything ran over the same network. As a result, IT staff could not remotely correct the problem because they could not access the impacted systems. And making matters worse, IT staff were locked out of facilities because their access control system also ran over the same network.

Read the rest of this article on Network Computing.

Read more about:

Network Computing

About the Author(s)

Salvatore Salamone

Managing editor, Network Computing

Salvatore Salamone is the managing editor of Network Computing. He has worked as a writer and editor covering business, technology and science; written three business technology books; and served as an editor at IT industry publications including Network World, Byte, Bio-IT World, Data Communications, LAN Times and InternetWeek.

Network Computing

Network Computing, a sister site to ITPro Today, provides community members with in-depth analysis on new and emerging infrastructure technologies, real-world advice on implementation and operations, and practical strategies for improving their skills and advancing their careers. Its community is a trusted resource for IT architects and engineers who must understand business requirements as well as build and manage the infrastructures to meet those needs.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like