BossaBox

This is the playbook for engineering-playbook

Reliability

All the other ISE Engineering Fundamentals work towards a more reliable infrastructure. Automated integration and deployment ensures code is properly tested, and helps remove human error, while slow releases build confidence in the code. Observability helps more quickly pinpoint errors when they arise to get back to a stable state, and so on.

However, there are some additional steps we can take, that don’t neatly fit into the previous categories, to help ensure a more reliable solution. We’ll explore these below.

Remove “Foot-Guns”

Prevent your dev team from shooting themselves in the foot. People make mistakes; any mistake made in production is not the fault of that person, it’s the collective fault of the system to not prevent that mistake from happening.

Check out the below list for some common tooling to remove these foot guns:

If a user ever makes a mistake, don’t ask: “how could somebody possibly do that?”, do ask: “how can we prevent this from happening in the future?”

Autoscaling

Whenever possible, leverage autoscaling for your deployments. Vertical autoscaling can scale your VMs by tuning parameters like CPU, disk, and RAM, while horizontal autoscaling can tune the number of running images backing your deployments. Autoscaling can help your system respond to inorganic growth in traffic, and prevent failing requests due to resource starvation.

Note: In environments like K8s, both horizontal and vertical autoscaling are offered as a native solution. The VMs backing each Pod however, may also need autoscaling to handle an increase in the number of Pods.

It should also be noted that the parameters that affect autoscaling can be difficult to tune. Typical metrics like CPU or RAM utilization, or request rate may not be enough. Sometimes you might want to consider custom metrics, like cache eviction rate.

Load shedding & DOS Protection

Often we think of Denial of Service [DOS] attacks as an act from a malicious actor, so we place some load shedding at the gates to our system and call it a day. In reality, many DOS attacks are unintentional, and self-inflicted. A bad deployment that takes down a Cache results in hammering downstream services. Polling from a distributed system synchronizes and results in a thundering herd. A misconfiguration results in an error which triggers clients to retry uncontrollably. Requests append to a stored object until it is so big that future reads crash the server. The list goes on.

Follow these steps to protect yourself:

These types of errors can result in Cascading Failures, where a non-critical portion of your system takes down the entire service. Plan accordingly, and make sure to put extra thought into how your system might degrade during failures.

Backup Data

Data gets lost, corrupted, or accidentally deleted. It happens. Take data backups to help get your system back up online as soon as possible. It can happen in the application stack, with code deleting or corrupting data, or at the storage layer by losing the volumes, or losing encryption keys.

Consider things like:

Look into the difference between snapshot and incremental backups. A good policy might be to take incremental backups on a period of N, and a snapshot backup on a period of M (where N < M).

Target Uptime & Failing Gracefully

It’s a known fact that systems cannot target 100% uptime. There are too many factors in today’s software systems to achieve this, many outside of our control. Even a service that never gets updated and is 100% bug free will fail. Upstream DNS servers have issues all the time. Hardware breaks. Power outages, backup generators fail. The world is chaotic. Good services target some number of “9’s” of uptime. ie: 99.99% uptime means that the system has a “budget” of 4 minutes and 22 seconds of uptime each month. Some months might achieve 100% uptime, which means that budget gets rolled over to the next month. What uptime means is different for everybody, and up to the service to define.

A good practice is to use any leftover budget at the end of the period (ie: year, quarter), to intentionally take that service down, and ensure that the rest of your systems fail as expected. Often times other engineers and services come to rely on that additional achieved availability, and it can be healthy to ensure that systems fail gracefully.

We can build graceful failure (or graceful degradation) into our software stack by anticipating failures. Some tactics include:

Practice

None of the above recommendations will work if they are not tested. Your backups are meaningless if you don’t know how to mount them. Your cluster failover and other mitigations will regress over time if they are not tested. Here are some tips to test the above:

Maintain Playbooks

No software service is complete without playbooks to navigate the developers through unfamiliar territory. Playbooks should be thorough and cover all known failure scenarios and mitigations.

Run maintenance exercises

Take the time to fabricate scenarios, and run a D&D style campaign to solve your issues. This can be as elaborate as spinning up a new environment and injecting errors, or as simple as asking the “players” to navigate to a dashboard and describing would they would see in the fabricated scenario (small amounts of imagination required). The playbooks should easily navigate the user to the correct solution/mitigation. If not, update your playbooks.

Chaos Testing

Leverage automated chaos testing to see how things break. You can read this playbook’s article on fault injection testing for more information on developing a hypothesis-driven suite of automated chaos test. The following list of chaos testing tools as well as this section in the article linked above have more details on available platforms and tooling for this purpose:

Analyze all Failures

Writing up a post-mortem is a great way to document the root causes, and action items for your failures. They’re also a great way to track recurring issues, and create a strong case for prioritizing fixes.

This can even be tied into your regular Agile restrospectives.