Chaos engineering

Testing
Leon Cordero

Tech businesses like Netflix are leading the charge in applying chaos engineering to make sure their systems are stable and strong. But regardless of the size or system complexity of a company, is it necessary?

It depends, is the succinct response. Chaos engineering might be a wise investment if you have a sizable, crucial system with millions of users. For smaller businesses with simpler systems, it might not be worthwhile to incur the extra expense and complexity.

Simple system

In 2015/2017, I worked in a mid-sized company with a website that served 2000 users per minute. We faced performance issues and broken links, which caused a drop in Google ranking. I noticed that the feedback loop of our changes was extremely long (between 1 and 2 weeks), the time it takes Google to process your changes in the sitemap. To reduce this feedback loop, I built a crawler that downloaded the sitemap every night and checked all pages and metadata.

Over time, this approach provided valuable insights into the system’s strengths and weaknesses; the crawler then evolved to replicate production traffic into QA; after one year, the sitemap was reduced 75% in size with better ranking and more organic traffic coming into the website; the speed of the websites was increased threefold; and no new issues were introduced.

This crawler did a lot for the system; on the one hand, it was an automation that validated the correctness of the system; on the other hand, it was able to bring chaos into it by simulating traffic spikes and putting the system under high pressure, which the entire team could understand a lot about. Because the infrastructure was simple, the firm did not require a complicated method of testing it.

The system’s simplicity aided in bringing chaos engieneering; however, this may be impossible if the firm chose a complicated distributed design; when I left the company, there was plenty of room for growth in the system.

Complex system

Scaling up from a tiny website to a high-traffic platform is a whole different scenario. Every application we design in my present work at a FinTech business must be both highly available and extremely performant. Slow systems have no place in the finance business.

We invest much in infrastructure and testing to guarantee our systems are incredibly robust. However, this expenditure is important to satisfy our customers’ requests and assure the stability of our platform. In this instance, a more complex way of validating the system’s dependability is required.

Conclusion

Before investing money and time in chaos engineering, it’s critical to understand your system and its surroundings, particularly your line of business and the SLA for your applications. This will help to simplify your efforts and get better outcomes. Remember that regulating complicated systems may be difficult.