Applying Failure Handling Techniques

The end-to-end argument principle

The end-to-end argument is a design principle, which suggests that some functions such as the fault tolerance techniques described previously can be implemented completely and correctly only with the knowledge and help of the application standing at the endpoints of the communication system.

A canonical example to illustrate this point is the careful file transfer application, where a file needs to be moved from computer A’s storage to computer B’s storage without damage.

As shown in the following illustration, hardware failures can happen in many places during this process, such as:

  • The disks of computers
  • The software of the file system
  • The hardware processors, their local memory
  • The communication system

Implementing error recovery functionality at the application level

Even if the various subsystems embed error recovery functionality, this can only cover lower levels of the system, and it cannot protect from errors that happen at a higher level of the system.

For example, error detection and recovery implemented at the disk level or in the operating system won’t help if the application has a defect that leads to writing the wrong data in the first place. It means that we can only achieve complete correctness by implementing this function at the application level.

This function can be implemented redundantly at lower levels, but this is mostly done as a performance optimization.

Note: An example of this was already presented in the chapter about networking, where we saw that the link layer provides reliable data transfer for wireless links (e.g., Wi-fi) even though this can also be provided at the transport layer (e.g., TCP).

It’s also important to note that this redundant implementation at lower levels is not always beneficial, but it depends on the use case.The literature by Saltzer et al. and Moors et al. covers this trade-off extensively.

Providing exactly-once guarantees at the application level

The end-to-end argument principle manifests in many different ways when dealing with a distributed system. The most relevant problem we have encountered repeatedly throughout this course is providing exactly-once guarantees.

Let’s consider an extremely simplified version of the problem, where application A wants to trigger an operation on application B exactly once, and each application consists of a single server.

Using TCP

The communication subsystem, specifically TCP, can provide reliable data delivery via retries, acknowledgments, and deduplication, which are the techniques already described in this course.

However, TCP is still not sufficient to provide exactly-once guarantees at the application level.

Problems

The application can encounter the following problems when using TCP to achieve exactly-once guarantees:

  • The TCP layer on the side of application B might receive a packet and acknowledge it back to the sender side while buffering it locally to be delivered to the application. However, the application crashes before receiving this packet from the TCP layer and processing it. In this case, the application on the sender side will think the packet has been successfully processed while it wasn’t.

  • The TCP layer on the side of application B might receive a packet and deliver it successfully to the application, which processes it successfully. However, a failure happens at this point, and the applications on both sides are forced to establish a new TCP connection. Application A had not received an application acknowledgment for the last message, so it attempts to resend it on the new connection.

  • Unfortunately, TCP provides reliable transfer only in the scope of a single connection, so it will not be able to detect if this packet has been received and processed in a previous connection. As a result, a packet will be processed by the application more than once.

Solution

The main takeaway from the above problems is that any functionality needed for exactly-once semantics (e.g., retries, acknowledgments, and deduplication), needs to be implemented at the application level in order to be correct and safe against all kinds of failures.

Achieving mutual exclusion

The end-to-end principle manifests in a slightly different shade to achieve mutual exclusion in a distributed system.

The fencing technique discussed previously extends mutual exclusion function to all the involved ends of the application.

Note: The goal of this chapter is not to explore all the problems where the end-to-end argument is applicable. Instead, the goal is to raise awareness and make the reader appreciate its value on system design to consider if and when need be

Get hands-on with 1400+ tech skills courses.