What's Causing Cloud Outages? A Network Managers' Guide
From fat-finger errors to fishing boats, here are the leading reasons cloud outages at AWS, Microsoft, and others are a growing network resilience challenge.
As enterprises rely more and more on cloud services to meet their network infrastructure, compute, data storage, and security needs, cloud computing outages have a significant impact on operations.
Many believe (or hope?) that moving services to the cloud would eliminate some issues. After all, you would assume cloud providers make use of the latest technologies, have staff with expertise in these technologies, and build in lots of redundancy.
Unfortunately, what we find is that cloud outages have a lot in common with their data center outage counterparts. Many occur due to human error, power outages, malicious acts, Mother Nature, or plain bad luck.
What's Causing Cloud Outages?
There are several common culprits causing cloud outages. Over the last few years, we have seen examples of each. All have had a significant impact on the enterprises using the services. Here are some of the top problems that keep reoccurring.
Configuration mistakes
We're in the age of graphical user interfaces (GUIs) and automation. Yet, many critical IT chores like deploying a new server, provisioning storage for an application, or setting up new router tables are done manually via command line interfaces (CLIs). As one would expect, that can lead to configuration mistakes.
That is often the case with cloud outages. One such mistake caused a six-hour outage of Facebook, Instagram, Messenger, Whatsapp, and OculusVR due to a routing protocol configuration issue. As we wrote at that time: "The outage was the result of a misconfiguration of Facebook's server computers, preventing external computers and mobile devices from connecting to the Domain Name System (DNS) and finding Facebook, Instagram, and Whatsapp."
Essentially, BGP routers were unrecognized, preventing traffic destined for Facebook networks from being routed properly. Resolution of the problem was more challenging than normal because not only was communication between routers interrupted, but so too, were DNS traffic and all applications.
The problem here was that everything ran over the same network. As a result, IT staff could not remotely correct the problem because they could not access the impacted systems. And making matters worse, IT staff were locked out of facilities because their access control system also ran over the same network.
Read more about:
Network ComputingAbout the Authors
You May Also Like