Chaos Engineering

Software testing does not stand still. With the development of new technologies, a plethora of tools and approaches have appeared, allowing improvement in the quality of software for the end user. Some technologies are quite popular and on everyone’s lips, for example, the application of AI-language models in testing or behavior-driven development approaches. The popularity of an approach is often due to the simplicity of its implementation and use, however, other valuable technologies cover various aspects of testing but are more complicated to implement. In this article, I want to talk about one such approach that surfaced relatively recently, namely in 2012 – chaos engineering.

Imagine you run a global web application that measures millions of operations per hour. This kind of highly loaded system must ensure user uptime, and one can achieve this through various types of performance testing. On the one hand, testing allows you to understand the capabilities of your system, but on the other, it does not protect you from sudden failures in the product. When a mass failure occurs, locating the source of the problem can be an arduous process. For example, if you have network and data transfer problems, you start with the first support line, and then the second support line, you spend time reconfiguring your traffic to the server nodes, and then maybe even make changes to the code through the development team. Time spent uncovering the root cause costs you invaluable user time and loyalty to your service, which can cause significant damage to both your company’s image and revenue. In today’s world, a user is more likely to leave for another platform than to wait for you to fix the problem. 

Apple Inc. first addressed the problem of predicting random system failures in 1983. With the help of software engineer Steve Capps, Apple developed “The Monkey”, a program that randomly generated user interface events at high speed, simulating a monkey frantically banging on the keyboard and moving and clicking the mouse.

Tech giants like Amazon and Google navigated the challenges of testing fault tolerance against unexpected failures internally until Netflix released an open-source solution for automated fault tolerance testing called Chaos Monkey. Antonio Garcia describes the goals and principles of the tool in his book Chaos Monkeys,

“Imagine a monkey entering a ‘data center’, these ‘farms’ of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices, and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.”

Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases. 

Now we are ready to explore chaos engineering. If we take the definition from the official community website, chaos engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. In chaos engineering, your task is to recreate various out-of-box situations and prepare your system to deal with them automatically to maintain its performance. Even if you can’t automatically create various scripts, such as redistributing traffic to other nodes in case of network problems, you can prepare instructions for your support team that will allow them to respond instantly to all unexpected failures. This approach to system uptime is very similar to what pilots do in an emergency. You may not be aware of this, but every pilot has specific instructions for every potential situation that may arise.

To implement this practice in your development process, you need to understand the main principle of chaos engineering, namely, that to successfully obtain test results you need to:

  1. Define the steady state of your system in quantitative metrics, such as server utilization metrics, traffic throughput, response times, and so on. There should be clear metrics that will tell you that the system is operating at a steady state. 
  2. Formulate a hypothesis that needs to be tested. You will encounter the term “experiment more often than testing.” The fewer hypotheses you take into work, the simpler and better the results will be.
    1. An example of a hypothesis for an experiment could be that the system should automatically switch traffic to another node when one of them is unavailable. It is necessary that the hypothesis be based on real events that can happen in the system.
  3. Perform chaos emulation on a separate environment similar to the production environment, or in free windows on the production environment (this is recommended by the approach itself, as it is almost impossible to recreate the full hardware and software part of the product).
  4. Monitoring of the situation in real-time by the support and development side, if there is a problem, localize it and make the necessary changes in the pipeline, instructions, hardware configuration, or system code.

Although chaos emulation tools on the market are sparse, the current applications can tackle the most common system vulnerabilities and potential needs.

Chaos Monkey is an open-source solution from Netflix that enables AWS-based emulation of virtual machine and container shutdowns on a configurable schedule to test the fault tolerance of a distributed system.

Chaos Mesh is an open-source cloud-native tool that allows you to emulate different types of failures in Kubernetes. The solution enables attacks that check network latency, system time manipulation, resource utilization, and more. The system also has a very user-friendly management interface.

Gremlin is a SaaS tool that allows you to test a large number of situations such as memory leaks, response latency, disk overflows, and more. The solution allows you to accurately predict failures and create custom scenarios for different situations, as well as to be able to change the vector of actions directly during the test execution.

LitmusChaos is an open-source tool that is mainly used for cloud infrastructure deployed on K8s. The application allows you to emulate failures in different layers of the system including storage, networking, computing, etc. 

While it may seem that the process of organizing chaos is complex and intimidating, it is not. It is important to follow the basic principles of chaos engineering and prepare the right environment to experiment with your application. Using chaos engineering will allow you to be almost certain that your product is fault-tolerant. In addition, several experiments can show inefficiencies in infrastructure utilization, like during downtime, which can yield some savings on your company’s IT costs. 

Therefore, chaos is not about breaking the system, but about identifying its shortcomings, while it can be solved by you and before it happens with real users. If you’re considering creating your chaos engineering practice for existing solutions, our QA services team is always ready to assist.

Alexander Meshkov

Delivery QA Director at First Line Software

Alexander Meshkov is QA Delivery Director at FLS. Alexander has over 10 years of experience in software testing, organization of the testing process, and test management. A frequent attendee and speaker of diverse testing conferences, actively engages in discussions and keeps up-to-date with the latest trends and advancements in the field.


Ilia Blaer, Director Of Operations at First Line Software

Talk To Our Team Today

Talk to Our Team Today

Related Blogs

Interested in talking?

Whether you have a problem that needs solving or a great idea you’d like to explore, our team is always on hand to help you.