Chaos Engineering tests large complex Cloud based systems by injecting extreme, and turbulent conditions with goal to build more resilient systems

Advances in large-scale, distributed software systems are changing the game for software engineering.  Now organizations of all sizes are leveraging the power of the cloud by hosting their data, applications, and services in shared data centers.

 

Despite their extraordinary uptime, services from AWS and Azure can—and do—fail. Data centers lose power. Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes.  Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic.

 

Chaos Engineering is “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” You want to push your cloud infrastructure to its theoretical limits in order to understand how it will react to heavy traffic or unexpected failures.

 

Chaos engineering takes the complexity of that system as a given and tests it holistically by simulating extreme, turbulent, or novel conditions and observing how the system responds and performs. What happens if a disk server suddenly goes down, or if network traffic suddenly spikes because of a DDoS attack? What happens if both happen at the same time? Once an engineering team has that data, it can use the feedback to redesign the system to be more resilient.

 

The goal of Chaos Engineering is to ensure you can better understand how to design and build apps and services that can withstand havoc while delivering the optimal customer experience. Avoiding service outages and cyberattacks is the key goal for any chaos experiment.

 

Analysis gained from a chaos experiment should inform IT decision making when it comes to cloud architecture. For example, resources may be scaled differently or firewall changes may be made to firm up potential vulnerabilities.

 

Also, chaos experiments offer a great opportunity for companies to see how their human staff will react in a time of crisis. With cloud-based systems, fire drills happen all the time due to issues with hardware, software, or networking. Chaos experiments can help to identify bottlenecks and problem points in any incident response process.

 

Back in 2010, Netflix was one of the first businesses to build their entire product offering around a cloud-based infrastructure. They deployed their video streaming technology in data centers around the world in order to deliver content at a high speed and quality level. But what Netflix engineers realized was that they had little control over the back-end hardware they were using in the cloud. Thus, Chaos Engineering was born.

 

The first experiment that Netflix ran was called Chaos Monkey, and it had a simple purpose. The tool would randomly select a server node within the company’s cloud platform and completely shut it down. The idea was to simulate the kind of random server failures that happen in real life. Netflix believed that the only way they could be prepared for hardware issues was to initiate some themselves.

IDST Monthly Access Membership Required

You must be a IDST Monthly Access member to access this content.

Join Now

Already a member? Log in here