Leases in Distributed Systems
Learn about the issues with locks and leases in distributed systems.
Recalling concurrency
In the introductory chapters of the course we learned that concurrency is one of the factors that contribute significantly to the complexity of distributed systems.
A mechanism is needed to ensure that all the various components of a distributed system running concurrently do so in a safe way and does not bring the overall system to an inconsistent state.
Leader election problem
We have already discussed the leader election problem, where the system needs to ensure only one node in the system can perform the leader duties at any point in time.
Solution
Amongst the available techniques, locking is the simplest solution and one that is used commonly. However, locking techniques are subject to different failure modes when applied in a distributed system.
This chapter will cover common pitfalls in locks and address them to use locking safely in a distributed system.
Locks
The main property derived from the use of locks is mutual exclusion. Multiple concurrent actors can be sure that only one will perform a critical operation at a time.
All actors follow the same sequence of operations, first they acquire the lock, perform that critical operation, and then release the lock so that other workers can proceed.
This sequence of operation is usually simple to implement when all actors are running inside the same application sharing a single memory address space and the same lifecycle. However, doing the same in a distributed system is much more complicated, mostly due to the potential of partial failures.
Complication of locking in distributed systems
The main complication of locking in a distributed system is that the various system nodes can fail independently. As a result, a node that is currently holding a lock might fail before it releases the lock. This would bring the system to a halt until that lock is released via some other means (i.e., via an operator), thus reducing availability significantly. A timeout mechanism can be used to cope with this issue.
Using leases instead of locks
A
Safety risks in lease
Leases introduce new safety risks.There are now two different nodes in the system that have different views about the state of the system, specifically the nodes that hold a lock. This is because these nodes have different clocks, so the time of expiry can differ between them. And because a failure detector cannot be perfect, as explained earlier in the course.
The fact that part of the system considers a node that can fail does not necessarily mean this node will fail. It could be a network partition that prevents some messages from being exchanged between some nodes, or that node might be busy processing something unrelated. As a result, that node might think it still holds the lock even though it has expired and is automatically released by the system.
The following illustration shows an example, where nodes A and B are trying to acquire a lease in order to perform some operations in a separate system.
Node A manages to acquire a lease first successfully. However, there is a significant delay between acquiring the lease and performing the associated operation. This delay could be due to various reasons, such as:
- Long garbage collection pause
- Scheduling delays
- Network delays
In the meanwhile, the lease has expired. It was released by the system and acquired by node B, which has also managed to perform the operation that’s protected by the lock. After a while, the operation from node A also reaches the system, and it’s executed even though the lease is not held anymore by that node violating the basic invariant that was supposed to be protected by the lease mechanism.
Ensuring that the lease is still held by simply performing another check before initiating the operation in node A would not solve the problem since the same delays can occur between this check and the initiation of the operation or even the delivery of the operation to the system.
Note: One simple technique called fencing can solve the issues with distributed leases. We will discuss it in the next lesson.
Get hands-on with 1400+ tech skills courses.