Leases in Distributed Systems

Learn about the issues with locks and leases in distributed systems.

Recalling concurrency

In the introductory chapters of the course we learned that concurrency is one of the factors that contribute significantly to the complexity of distributed systems.

A mechanism is needed to ensure that all the various components of a distributed system running concurrently do so in a safe way and does not bring the overall system to an inconsistent state.

Leader election problem

We have already discussed the leader election problem, where the system needs to ensure only one node in the system can perform the leader duties at any point in time.

Press + to interact

Solution

Amongst the available techniques, locking is the simplest solution and one that is used commonly. However, locking techniques are subject to different failure modes when applied in a distributed system.

This chapter will cover common pitfalls in locks and address them to use locking safely in a distributed system.

Locks

The main property derived from the use of locks is mutual exclusion. Multiple concurrent actors can be sure that only one will perform a critical operation at a time.

All actors follow the same sequence of operations, first they acquire the lock, perform that critical operation, and then release the lock so that other workers can proceed.

Press + to interact

This sequence of operation is usually simple to implement when all actors are running inside the same application sharing a single memory address space and the same lifecycle. However, doing the same in a distributed system is much more complicated, mostly due to the potential of partial failures.

Complication of locking in distributed systems

The main complication of locking in a distributed system is that the various system nodes can fail independently. As a result, a node that is currently holding a lock might fail before it releases the lock. This would bring the system to a halt until that lock is released via some other means (i.e., via an operator), thus reducing availability significantly. A timeout mechanism can be used to cope with this issue.

Using leases instead of locks

A leaseC. G. Gray and D. R. Cheriton, “Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency,” Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, 1989. is essentially a lock with an expiry timeout, after which the lock is automatically released by the system responsible for managing the locks. By using leases instead of locks, the system can automatically recover from failures of nodes that have acquired locks by releasing them and allowing other nodes to acquire them to make progress.

Safety risks in lease

Leases introduce new safety risks.There are now two different nodes in the system that have different views about the state of the system, specifically the nodes that hold a lock. This is because these nodes have different clocks, so the time of expiry can differ between them. And because a failure detector cannot be perfect, as explained earlier in the course.

The fact that part of the system considers a node that can fail does not necessarily mean this node will fail. It could be a network partition that prevents some messages from being exchanged between some nodes, or that node might be busy processing something unrelated. As a result, that node might think it still holds the lock even though it has expired and is automatically released by the system.

The following illustration shows an example, where nodes A and B are trying to acquire a lease in order to perform some operations in a separate system.

Node A manages to acquire a lease first successfully. However, there is a significant delay between acquiring the lease and performing the associated operation. This delay could be due to various reasons, such as:

  • Long garbage collection pause
  • Scheduling delays
  • Network delays

In the meanwhile, the lease has expired. It was released by the system and acquired by node B, which has also managed to perform the operation that’s protected by the lock. After a while, the operation from node A also reaches the system, and it’s executed even though the lease is not held anymore by that node violating the basic invariant that was supposed to be protected by the lease mechanism.

Ensuring that the lease is still held by simply performing another check before initiating the operation in node A would not solve the problem since the same delays can occur between this check and the initiation of the operation or even the delivery of the operation to the system.

Note: One simple technique called fencing can solve the issues with distributed leases. We will discuss it in the next lesson.

Get hands-on with 1400+ tech skills courses.