Anticipating The Perfect Storm of Impossible Events

Jesse Robbins of Opscode says that resiliency is a function of culture, as well as engineering. "You cannot learn the lessons of failure without experiencing it," said Robbins. That's can be difficult message for IT operations team that view downtime as an enemy to be avoided at all costs.

Data Center Knowledge

February 20, 2012

2 Min Read
ITPro Today logo in a gray background | ITPro Today

Jesse Robbins is a trained fireman. He also has managed some of the world's largest Internet infrastructures. Robbins says the lessons of fire readiness can be applied to building reliable systems.

"You cannot learn the lessons of failure without experiencing it," said Robbins, the co-founder and Chief Community Office at Opscode. "That's why we do fire drills."

In a keynote at last week's Cloud Connect conference, Robbins said that resiliency is a function of culture, as well as engineering. That's can be difficult message for IT operations team that view downtime as an enemy to be avoided at all costs - the Voldemort ("He Who Must Not be Named!") of the data center.

"Failure happens," said Robbins. "You just have to design for it. Every organizaton must learn that this is going to occur."

'Unpredictable' Outages

Robbins, who earned the moniker "The Master of Disaster" during his time on the infrastructure team at Amazon, points to incident reports as evidence of the prevailing sentiment about outages.

"If you search for 'outage post-mortem' on the Internet, what you find is people talking about how crazy and impossible the outage was," he said. "It is always a perfect storm of impossible events. 'We could never have known that there was this one latency defect.' "

The way to prevail is to assume that this type of hard-to-predict defect will eventually materialize and discover it, rather than being surprised when it reveals itself in a crisis. One way to accomplish this is "resilience engineering" featuring fault injection - introducing unexpected events to see how the system responds. The most famous example of this is the Chaos Monkey used by the Netflix engineering team.

Robbins uses a similar approach, called GameDay, to help teams "create resiliency through destruction."

"It's a function of people first, then technology," he said. "We depend on services that should never fail. These exercises cause people to prepare. You build confidence in your responsibility to respond to failure. This is the difference between companies that succeed at scale on the web and those that don't.

Automation is a critical component of engineering for resiliency, said Robbins. Opscode makes configuration management software that allows organizations to automate large portions of their infrastructure.

"We have a bunch of manual processes which we need to automate," said Robbins. "This is really the future."

Read more about:

Data Center Knowledge

About the Author

Data Center Knowledge

Data Center Knowledge, a sister site to ITPro Today, is a leading online source of daily news and analysis about the data center industry. Areas of coverage include power and cooling technology, processor and server architecture, networks, storage, the colocation industry, data center company stocks, cloud, the modern hyper-scale data center space, edge computing, infrastructure for machine learning, and virtual and augmented reality. Each month, hundreds of thousands of data center professionals (C-level, business, IT and facilities decision-makers) turn to DCK to help them develop data center strategies and/or design, build and manage world-class data centers. These buyers and decision-makers rely on DCK as a trusted source of breaking news and expertise on these specialized facilities.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like