Clients of an application should be able to react properly to the backpressure emitted by an application.

Retries

The most typical way to react to failures in a distributed system is retries. Retries are performed because we assume that a failure is temporary, so retrying a request is expected to have a better outcome.

Issue with retries

Retries can have adverse effects, the one is described below:

Overloading a service

Think about the whole architecture of the systems and the various applications involved to determine where retries will be performed. Performing retries at multiple levels can significantly amplify the traffic coming from customers, which can overload services and cause issues.

For example, let’s assume we have four services A, B, C, and D that call each other in order, as shown in the following illustration:

If each service performs three retries for every failed request, then a temporary issue at service D will cause every request to be retried 27 times, thus creating a lot of additional load to service D during a period it’s already experiencing issues.

Mitigating service overloading issue

Following are some approaches that help overcome the service overloading issue:

Retry failed requests at the highest level possible

Retrying failed requests at the highest level is a conventional approach, which contains additional context around the business function of the request and whether it’s worth retrying.

Use exponential backoff when retrying a request

In this approach, the system waits a bit more every time before performing the next retry. It gives the downstream system a better opportunity to recover from any temporary issues.

Note: Ideally, exponential backoff is also combined with some jitter, which is a small incremental delay in data transfer. So retries from various servers of service are distributed evenly, and they do not produce sudden spikes of traffic that can also cause overload issues…

Use circuit breaker

Clients of an application can also perform some form of load shedding to help downstream applications recover using a circuit breaker.

A circuit breaker essentially monitors the percentage of failed requests. When a specific threshold is crossed, this is interpreted as a permanent failure of the downstream application. As a result, the circuit breaker rejects all the requests locally without sending them to the downstream application.

The circuit breaker allows sending just a few requests periodically, and if a good percentage of them is successful, it starts sending load again.

Benefits

Circuit breakers are beneficial in two ways:

  • It gives the downstream service a chance to recover from overload situations or other kinds of permanent failures.
  • It improves the customer experience by reducing request latency when the response from downstream is not necessary.

Note: An example of this technique was described previously when explaining the concept of graceful degradation.

Embed timeout hints

Clients can embed timeout hints in their requests to help downstream applications. These hints inform downstream applications about when a response to a request is not useful anymore. In this way, downstream applications can discard requests waiting for a long time in message queues or in memory buffers due to resource exhaustion, thus speeding up the processing of accumulated backlogs.

Put your knowledge to the test by engaging with the AI widget below. You’ll answer six questions covering various techniques for dealing with failure in distributed systems. To begin, say hello to Edward in the widget below, and it will lead the way.

Powered by AI
14 Prompts Remaining
Prompt AI WidgetOur tool is designed to help you to understand concepts and ask any follow up questions. Ask a question to get started.

Get hands-on with 1400+ tech skills courses.