Who Does and Doesn't Need Chaos Engineering?
Think of chaos engineering principles as a way to enhance reliability management.
In a world where companies face ever-intensifying pressures to maximize the reliability and performance of software systems, chaos engineering principles have become trendy for finding potential problems. But should chaos engineering be the foundation for your reliability management strategy? Keep reading to find out.
What Is Chaos Engineering?
Chaos engineering is a reliability management technique in which engineers purposefully try to cause systems to fail to assess their resilience.
In other words, when you perform chaos engineering, you deliberately inject error conditions or failures into your systems to see how the systems respond. You might deliberately overwhelm an application with traffic to simulate an unexpected spike in demand, or disconnect a cluster of servers to emulate what would happen if a cloud data center suddenly went offline.
What Problems Do Chaos Engineering Principles Solve?
The big idea behind chaos engineering principles is that by proactively stress-testing your applications, services and infrastructure, you can find and address weak points that you would not otherwise notice until an unplanned disruption occurred. You can find and manage chaotic events in a planned and proactive way, rather than merely waiting for something to go wrong and only then reacting.
Chaos Engineering vs. Performance Testing and Synthetic Monitoring
In some ways, chaos engineering is just a trendy term for practices that teams have been using for years.
Chaos engineering principles are similar to (indeed, some would call them a form of) performance testing, in which engineers place a load on IT resources and evaluate how they respond. Chaos engineering is also comparable to synthetic monitoring, a form of testing that uses simulated transactions to assess application reliability and performance.
Some folks would contend that chaos engineering is distinct from performance testing and synthetic monitoring because chaos engineering is about more than just running formal tests. For example, it could take the form of deliberately pulling the plug on a server or DDoS-ing your own network--scenarios that you could not realistically simulate using test automation frameworks.
When Should You Use Chaos Engineering Principles?
If your IT team already has a healthy reliability management strategy in place that is based on foundational techniques like performance testing and synthetic monitoring, adding chaos engineering to your routine can be a good way to augment it. Chaos engineering principles may help you identify issues that your testing routines are not catching. Simply planning chaos engineering workflows can also be helpful as a means of stepping back to think through the fundamentals of your systems and the reliability weak points inherent in them.
That said, if you aren't yet doing basic performance testing, synthetic monitoring and application performance management, you'll want to start with those practices before jumping into chaos engineering. Think of chaos engineering principles as a way to enhance reliability management, rather than the first step toward building reliable systems.
How Hard Is It to Implement Chaos Engineering?
In most cases, adding chaos engineering to your reliability management strategy doesn't require a major overhaul of your IT environment. You may want to adopt a tool designed to execute chaos engineering tests, such as Gremlin, Chaos Monkey or the chaos engineering tools being rolled out by cloud vendors for use on their platforms. But, otherwise, you can stick with your existing tool sets and application architectures.
Note, too, that chaos engineering principles can be applied to any type of environment. Whether you're running apps on-prem, in a single cloud, in multiple clouds or in a hybrid cloud, you can take advantage of chaos engineering.
The Limitations of Chaos Engineering Principles
Chaos engineering principles have received a lot of attention during the past decade. Indeed, it can be easy to become lost in the hype without appreciating the limitations of chaos engineering, including:
Limited coverage of problems: Chaos engineering reveals only the problems that you design it to reveal. It doesn't magically uncover every weakness lurking in your environment. If you forget this, you can end up with a false sense of confidence.
Doesn't solve problems: Chaos engineering can help you find problems, but it doesn't resolve them. That's a separate process. And while it may be fun to brag about how you ran cool chaos engineering tests, those tests don't actually achieve much unless they result in meaningful improvements to your systems.
Doesn't pinpoint problem sources: Detecting a problem through chaos engineering doesn't necessarily mean you'll fully understand the problem or know what its root cause is--especially in modern, distributed environments where you have multiple layers of infrastructure hosting dozens of microservices. You'll need to do extra work--probably using methods like distributed traces--to get to the root of an issue.
Learn More about Chaos Engineering
Chaos engineering was made popular largely through Netflix, whose experience using chaos engineering to manage its own services makes for an informative read. You can also check out the Chaos Engineering book by Casey Rosenthal and Nora Jones for a truly deep dive into the topic.
About the Author
You May Also Like