Chaos Engineering As A Service with AWS Fault Injection Service
In 2011, Netflix started the migration of theirs infrastructures from private cloud to AWS cloud.
While beginning this journey to the cloud, Netflix re-thinks the way of designing infrastructures. The purpose of this new approach was to move from a development model that assumed no outage, to a model designed to be outage proof.
This mindset change requires to test infrastructure, to be sure infrastructure design is really outage proof. To do so, Netflix created tools calls “chaos monkey and popularizes “chaos engineering” in the same way.
Chaos engineering concepts
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Multiples applications layers can be tested with chaos engineering tools :
- Infrastructure (hypervisor)
- Network (latency, outage)
- Application it-self (missing modules, unsupported runtime)
- Parameter (missing parameter, wrong format)
All of this are common concepts, that can be applied in the AWS world.
Chaos Engineering in AWS
In the previous section, we have seen why chaos engineering appears and what issues it answer. In this section, we will see how to apply these concepts and tools to an infrastructure hosted on AWS.
AWS Well Architected Framework
In 2018, AWS published their own framework to build performant, secure, resilient and cost effective infrastructures. This framework is based on 5 pillars:
- Operational Excellence
- Security
- Reliability
- Performance Efficiency
- Cost Optimization
Reliability pillar is the one we will focus on a chaos engineering approach. Here’s the introduction of this pillar according to the official documentation:
The Reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.
To be compliant with the framework and the best practices, our infrastructure must be tested in “its total lifecycle”, from development to production. We assume your production environment is hosted on AWS, and be extension, your pre-production environment too.
In the next section, we will see how execute performance testing in the AWS world.
Why choose FIS
- Without his service, testing infrastructure reliability is a little bit tricky (mac gyver way)
- Managed service. Use it and voila.
- DDoS Simulation Testing Policy hard to fit.
Limits
- If Cloudformation is obviously supported as infrastructure as code provider, Terraform is not at the moment I wrote this lines. You can vote to make this feature reality.
- Can’t be a part of a CodePipeline or CodeDeploy pipeline. I’m a little bit disappointed, I think it’s a feature everybody want, especialy in order to be “DevOps” compliant and run advanced tests just after releasing new features in production.