Failure Handling Techniques

Let’s look into some commonly used hardware failure handling techniques in distributed systems.

Failure is the norm in a distributed system, so building a system that can cope with failures is crucial.This chapter will cover principles on dealing with failures and basic patterns for building systems that are resilient to failures.

In distributed systems, dealing with a failure consists of three main parts: main parts:

  • identifying the failure
  • recovering from the failure
  • containing a failure to reduce its impact, in some cases

Hardware failures

Hardware failures can be the most damaging ones since they can lead to data loss or corruption. Also,the probability of a hardware failure is significantly higher in a distributed system due to the bigger number of hardware components involved.

Silent hardware failures

Silent hardware failures are the ones with the biggest impact since they can potentially affect the behavior of a system without anyone noticing.

Example

A node in the network corrupts some part of a message and the recipient receives data that is different from what the sender originally sent without being able to detect that. Similarly, data written to a disk can be corrupted during the write operation or even a long time after that was completed.

Techniques to handle failures

Following are some techniques that are commonly used to handle the hardware and silent hardware of failures.

Retransmitting data

One way to detect these failures when sending a message to another node is to introduce redundancy in the message using a checksum derived from the actual payload. If the message is corrupted, the checksum will not be valid. As a result, the recipient can ask the sender to send the message again.

Storing data to multiple disks

When writing data to disk, the retransmitting data technique might not be useful since the corruption will be detected a long time after a write operation has been performed by the client, which means it might not be feasible to rewrite the data. Instead, the application can make sure that data is written to multiple disks so that corrupted data can be discarded later on and the right data can be read from another disk with a valid checksum.

Error correcting codes (ECC)

Error correcting codes (ECC) is used in cases where retransmitting or storing the data multiple times is impossible or costly.

These are similar to checksums and are stored alongside the actual payload, but they have the additional property that they can be used to correct corruption errors calculating the original payload again.

The downside of error correcting codes is that they are larger than checksums, thus having a higher overhead in terms of data stored or transmitted across the network.

Note: As a distributed system consists of many different parts, these kinds of failures can happen on any of them. So it raises the question of where and how we can use these failure handling techniques.

Get hands-on with 1400+ tech skills courses.