A Deep Dive into the Recent Microsoft Cloud Services Outage

Configuration changes and DNS issues have been the source of multiple major outages over the last two years. And everyone knows there will be more to come.

2 Min Read
person holding a blue cloud with Microsoft Cloud written on it
Alamy

Last month's global disruption of Microsoft cloud services, including Azure, Teams, and Outlook, was the latest in what is becoming an all-too-common occurrence of cloud outages. In this case, the cause was an innocent WAN router update gone wrong. But it highlights the point we've repeatedly made about the fragility of the world's global communications infrastructure.

In this latest incident, which lasted about two and a half hours, millions of users started experiencing network connectivity issues when trying to access the Microsoft cloud-hosted services. In a post-mortem explaining what happened, Microsoft noted: "a network engineer was performing an operational task to add network capacity to the global Wide Area Network (WAN) in Madrid. The task included steps to modify the IP address for each new router, and integration into the IGP (Interior Gateway Protocol, a protocol used for connecting all the routers within Microsoft’s WAN) and BGP (Border Gateway Protocol, a protocol used for distributing Internet routing information into Microsoft’s WAN) routing domains."

It further noted that the company has an SOP (standard operating procedure) when making such changes. The SOP details a four-step process that includes testing the change in a network emulator; testing the change in a lab setting; a review documenting these first two steps, as well as roll-out and roll-back plans; and a safe deployment approach that only allows access to one device at a time to limit impact if there are any issues once an update is started.

Related:Big 3 Public Cloud Providers Focus on What's Next

Unfortunately, the SOP was changed before the scheduled update. Microsoft noted: "Critically, our process was not followed as the change was not re-tested and did not include proper post-checks per steps one through four. This unqualified change led to a chain of events that culminated in the widespread impact of this incident."

What happened? The change added a command to purge the IGP database — however, Microsoft noted that the command operates differently for different router manufacturers. "Routers from two of our manufacturers limit execution to the local router, while those from a third manufacturer execute across all IGP joined routers, ordering them all to recompute their IGP topology databases."

Read the rest of this article on Network Computing.

About the Author(s)

Salvatore Salamone

Managing editor, Network Computing

Salvatore Salamone is the managing editor of Network Computing. He has worked as a writer and editor covering business, technology and science; written three business technology books; and served as an editor at IT industry publications including Network World, Byte, Bio-IT World, Data Communications, LAN Times and InternetWeek.

Network Computing

Network Computing, a sister site to ITPro Today, provides community members with in-depth analysis on new and emerging infrastructure technologies, real-world advice on implementation and operations, and practical strategies for improving their skills and advancing their careers. Its community is a trusted resource for IT architects and engineers who must understand business requirements as well as build and manage the infrastructures to meet those needs.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like