Chaos Engineering As A Service with AWS Fault Injection Service

Jérémy Chauvet
3 min readJun 24, 2021

In 2011, Netflix started the migration of theirs infrastructures from private cloud to AWS cloud.

While beginning this journey to the cloud, Netflix re-thinks the way of designing infrastructures. The purpose of this new approach was to move from a development model that assumed no outage, to a model designed to be outage proof.

This mindset change requires to test infrastructure, to be sure infrastructure design is really outage proof. To do so, Netflix created tools calls “chaos monkey and popularizes “chaos engineering” in the same way.

Chaos engineering concepts

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Multiples applications layers can be tested with chaos engineering tools :

  • Infrastructure (hypervisor)
  • Network (latency, outage)
  • Application it-self (missing modules, unsupported runtime)
  • Parameter (missing parameter, wrong format)

All of this are common concepts, that can be applied in the AWS world.

Chaos Engineering in AWS

In the previous section, we have seen why chaos engineering appears and what issues it answer. In this section, we will see how to apply these concepts and tools to an infrastructure hosted on AWS.

AWS Well Architected Framework

In 2018, AWS published their own framework to build performant, secure, resilient and cost effective infrastructures. This framework is based on 5 pillars:

  • Operational Excellence
  • Security
  • Reliability
  • Performance Efficiency
  • Cost Optimization

Reliability pillar is the one we will focus on a chaos engineering approach. Here’s the introduction of this pillar according to the official documentation:

The Reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.

To be compliant with the framework and the best practices, our infrastructure must be tested in “its total lifecycle”, from development to production. We assume your production environment is hosted on AWS, and be extension, your pre-production environment too.

In the next section, we will see how execute performance testing in the AWS world.

Why choose FIS

Limits

  • If Cloudformation is obviously supported as infrastructure as code provider, Terraform is not at the moment I wrote this lines. You can vote to make this feature reality.
  • Can’t be a part of a CodePipeline or CodeDeploy pipeline. I’m a little bit disappointed, I think it’s a feature everybody want, especialy in order to be “DevOps” compliant and run advanced tests just after releasing new features in production.

--

--

Jérémy Chauvet
0 Followers

DevOps, AWS Expert & Developer (Swift / Python)