Advances in large-scale, distributed software systems are changing the game for software engineering. Now organizations of all sizes are leveraging the power of the cloud by hosting their data, applications, and services in shared data centers.
Despite their extraordinary uptime, services from AWS and Azure can—and do—fail. Data centers lose power. Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic.
Chaos Engineering is “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” You want to push your cloud infrastructure to its theoretical limits in order to understand how it will react to heavy traffic or unexpected failures.
Chaos engineering takes the complexity of that system as a given and tests it holistically by simulating extreme, turbulent, or novel conditions and observing how the system responds and performs. What happens if a disk server suddenly goes down, or if network traffic suddenly spikes because of a DDoS attack? What happens if both happen at the same time? Once an engineering team has that data, it can use the feedback to redesign the system to be more resilient.
The goal of Chaos Engineering is to ensure you can better understand how to design and build apps and services that can withstand havoc while delivering the optimal customer experience. Avoiding service outages and cyberattacks is the key goal for any chaos experiment.
Analysis gained from a chaos experiment should inform IT decision making when it comes to cloud architecture. For example, resources may be scaled differently or firewall changes may be made to firm up potential vulnerabilities.
Also, chaos experiments offer a great opportunity for companies to see how their human staff will react in a time of crisis. With cloud-based systems, fire drills happen all the time due to issues with hardware, software, or networking. Chaos experiments can help to identify bottlenecks and problem points in any incident response process.
Back in 2010, Netflix was one of the first businesses to build their entire product offering around a cloud-based infrastructure. They deployed their video streaming technology in data centers around the world in order to deliver content at a high speed and quality level. But what Netflix engineers realized was that they had little control over the back-end hardware they were using in the cloud. Thus, Chaos Engineering was born.
The first experiment that Netflix ran was called Chaos Monkey, and it had a simple purpose. The tool would randomly select a server node within the company’s cloud platform and completely shut it down. The idea was to simulate the kind of random server failures that happen in real life. Netflix believed that the only way they could be prepared for hardware issues was to initiate some themselves.
Chaos Engineering principles and and Practices
A pre-condition to starting Chaos is defining “normal” for your app so that any deviations can be detected via your monitoring tools. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior. Metrics can help. Start by quantifying the “additional cost or lost revenue per minute of downtime.” Look for qualitative feedback on how downtime impacts customer experience. If you don’t have this data today, you have some homework to do.
Every chaos experiment should begin with a hypothesis, where the team questions what might happen if their cloud-based platform experienced an issue or outage. Then a test should be designed with as small of a scope as possible in order to still provide helpful analysis.
Once you’ve done that, you can start running experiments. The most popular types include:
- Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc. Inject a stack fault to see how your system responds during normal traffic. This exposes what happens when you lose anything from a service to an Availability Zone.
- Inject a request failure to artificially introduce latency, exceptions or other abnormal behavior from components as they process a modified request.
- Send an overwhelming number of requests to your app or service to see how resilient it is. As with a DoS or DDoS attack, the requests involved may be valid or faulty.
The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large.
An essential principle of chaos engineering is minimizing the blast radius – the potential impact of the experiment you’re running. Instead of testing on every host in our production environment, we can start with a single host in our test environment. This allows us to grow our confidence in the system and our understanding in lockstep with scale and the potential risk.
Typically, chaos experiments are run for a limited period of time and are not an everyday activity. That’s because they require a significant amount of upfront planning and team participation among developers, quality assurance engineers, and cloud architects.
Challenges to implementing Chaos
One of the challenges to implement controllability. While faults can be injected in a matter of seconds. But doing it in a way that maximizes the learning while minimizing the risks and pain takes experience.
Getting buy-in from stakeholders to purposefully break working things in a live production environment is counter to operations’ mission.
Testing can also be very political. Finding the points of failure in a system might force deep conversations about a particular software architecture and its robustness in the face of tough situations. A particular company might be deeply invested in a specific technical roadmap (e.g. microservices) that chaos engineering tests show is not as resilient to failures as originally predicted.
Future Trends
In Future Chaos Engineering will become increasingly autonomous and intelligent. Netflix engineers have created automation tools that randomly inject different types of faults and latency into their systems to continually test resiliency. Since the release of Chaos Monkey in 2011, Netflix has developed an entire Simian Army to build confidence in their ability to recover from failure. Nora Jones, a senior software engineer at Netflix and Chaos Day speaker, explained that Chaos experiments are automatically enabled for new services.
Through artificial intelligence and machine learning tools to help automate a lot of the analysis and recovery activities. These smart systems will be better at scanning networks and cloud environments to monitor performance and stability during tests. They may even be able to identify threats or dependencies that were unknown before.