Solving the Distributed Snapshot Problem
Let's explore a seminal algorithm used for capturing distributed snapshots.
We'll cover the following
Chandy-Lamport algorithm
The Chandy Lamport algorithm solves the consistent snapshot problem in a distributed system.
Idea
The algorithm is based on the following main idea: a marker message is sent between nodes using the available communication channels that represent an instruction to a node to record a snapshot of the current state.
Working
The algorithm works as follows:
- The node that initiates the protocol records its state and then sends a marker message to all the outbound channels.
Importantly, the marker is sent after the node records its state and before any further messages are sent to the channels.
- When a node receives a marker message, its behaviour depends on whether the node has already recorded its state (while emitting the mark previously) or not.
- If the node has not recorded its state, it records its state, and then it records the state of the channel the marker was received from as an empty sequence. It then sends the marker to all the outbound channels.
- If the node has recorded its state, it records the state of the channel the marker was received from as the sequence of messages received from after the node’s state was recorded and before the node received the marker from .
The following illustration contains a sample execution of this algorithm on the simple system presented previously.
The node sends the token and, right after, initiates the execution of the protocol.
As a result, it records its state and sends the marker in channel . The node receives the token, transitions to state . It then sends the token to channel and transitions to state .
Afterward, it receives the marker message, records its state and the state of channel as an empty sequence, and sends the marker message to channel .
Note that this is just one of the possible executions. In an alternative execution, node could have processed both the token and the marker, recording its state as and potentially sending the marker across channel without sending the token yet. This would have led to a different but still consistent snapshot.
Meanwhile, node receives the token, transitions to state , and buffers the token in the sequence of messages received while the snapshot protocol was executing. The node then receives the marker and records the state of the channel as the sequence [token].
At this point, the protocol concludes, since the state of all nodes and channels has been recorded and the global snapshot state is the following:
snapshot():
snapshot():
snapshot(): []
snapshot(): [token]
Get hands-on with 1400+ tech skills courses.