A Deep Dive into the Recent Microsoft Cloud Services Outage
Configuration changes and DNS issues have been the source of multiple major outages over the last two years. And everyone knows there will be more to come.
Last month's global disruption of Microsoft cloud services, including Azure, Teams, and Outlook, was the latest in what is becoming an all-too-common occurrence of cloud outages. In this case, the cause was an innocent WAN router update gone wrong. But it highlights the point we've repeatedly made about the fragility of the world's global communications infrastructure.
In this latest incident, which lasted about two and a half hours, millions of users started experiencing network connectivity issues when trying to access the Microsoft cloud-hosted services. In a post-mortem explaining what happened, Microsoft noted: "a network engineer was performing an operational task to add network capacity to the global Wide Area Network (WAN) in Madrid. The task included steps to modify the IP address for each new router, and integration into the IGP (Interior Gateway Protocol, a protocol used for connecting all the routers within Microsoft’s WAN) and BGP (Border Gateway Protocol, a protocol used for distributing Internet routing information into Microsoft’s WAN) routing domains."
It further noted that the company has an SOP (standard operating procedure) when making such changes. The SOP details a four-step process that includes testing the change in a network emulator; testing the change in a lab setting; a review documenting these first two steps, as well as roll-out and roll-back plans; and a safe deployment approach that only allows access to one device at a time to limit impact if there are any issues once an update is started.
Unfortunately, the SOP was changed before the scheduled update. Microsoft noted: "Critically, our process was not followed as the change was not re-tested and did not include proper post-checks per steps one through four. This unqualified change led to a chain of events that culminated in the widespread impact of this incident."
What happened? The change added a command to purge the IGP database — however, Microsoft noted that the command operates differently for different router manufacturers. "Routers from two of our manufacturers limit execution to the local router, while those from a third manufacturer execute across all IGP joined routers, ordering them all to recompute their IGP topology databases."
About the Authors
You May Also Like