Containing Impact of Failure

Let's look at some techniques to contain the impact of failure in distributed systems

Most of the techniques described in this chapter are used to identify failure and recover from it. It’s also useful to contain the impact of a failure, so we will now discuss some techniques for this purpose. This can be done via technical means, such as fault isolation.

Fault isolation

One common way to contain the impact of the failure is to deploy an application redundantly in multiple facilities that are physically isolated and have independent failure modes. So, when an incident affects one of these facilities, the other facilities are not impacted and continue functioning as normal.

Note: Fault isolation introduces a trade-off between availability and latency since physical isolation increases network distance and latency.

Balancing the trade-off between availability and latency

Most cloud providers provide multiple physically isolated datacenters and are all located close to each other in a single region to strike a good balance in this trade-off. These are commonly known as availability zones.

Graceful degradation

Graceful degradation is another technique to contain failure, where an application reduces the quality of its service to avoid failing completely.

For example, suppose a service that provides the capabilities of a search engine by calling downstream services. And one of these services which shows the advertisements for each query is having some issues. The top-level service can just render the results of a search term without any advertisements instead of returning an error message or completely empty response.

Techniques to contain failure are broadly categorized into two main groups:

  • Those performed at the client-side
  • Those performed at the server-side

Get hands-on with 1400+ tech skills courses.