Failure in the World of Distributed Systems

Let's see why failures occur in distributed systems, and how we can detect them.

We'll cover the following

One reason for failure
One mechanism to detect failure
- Trade-off for the small timeout value
- Trade-off for the large timeout value
Failure detector
- Properties that categorize failure detectors
- A perfect failure detector

We should understand that it is challenging to identify failure because of all the characteristics of a distributed system that the Difficulties Designing Distributed Systems lesson described. One of them is the asynchronous nature of the network.

One reason for failure

The asynchronous nature of the network in a distributed system can make it very hard for us to differentiate between a crashed node and a node that is just really slow to respond to requests.

One mechanism to detect failure

Timeouts is the main mechanism we can use to detect failures in distributed systems. Since an asynchronous network can infinitely delay messages, timeouts impose an artificial upper bound on these delays. As a result, we can assume that a node fails when it is slower than this bound. This is useful because otherwise, the assumption that the nodes are extremely slow would block the system that is waiting for the nodes that crashed.

However, a timeout does not represent an actual limit. Thus, it creates the following trade-off.

Trade-off for the small timeout value

If we select a smaller value for the timeout, our system will waste less time waiting for the nodes that have crashed.

At the same time, the system might declare some nodes that have not crashed dead, while they are actually just a bit slower than expected.

The following illustration shows this trade-off phenomenon.

Get hands-on with 1400+ tech skills courses.

Before Getting Started

Introduction to Distributed Systems

Basic Concepts and Theorems

Distributed Transactions

Achieving Isolation

Achieving Atomicity

Concluding Distributed Transactions

Consensus

Time

Order

Networking

Security

Security Protocols

From Theory to Practice

Case Study 1: Distributed File Systems

Case Study 2: Distributed Coordination Service

Case Study 3: Distributed Data Stores

Case Study 4: Distributed Messaging System

Case Study 5: Distributed Cluster Management

Case Study 6: Distributed Ledger

Case Study 7: Distributed Data Processing Systems

Practices & Patterns

Communication Patterns

Coordination Patterns

Data Synchronization

Shared-nothing Architectures

Distributed Locking

Compatibility Patterns

Dealing with Failure

Distributed Tracing

Concluding this Course

Failure in the World of Distributed Systems

One reason for failure

One mechanism to detect failure

Trade-off for the small timeout value