Solving the Distributed Snapshot Problem

Let's explore a seminal algorithm used for capturing distributed snapshots.

We'll cover the following

Chandy-Lamport algorithm

The Chandy Lamport algorithm solves the consistent snapshot problem in a distributed system.

Idea

The algorithm is based on the following main idea: a marker message is sent between nodes using the available communication channels that represent an instruction to a node to record a snapshot of the current state.

Working

The algorithm works as follows:

  • The node that initiates the protocol records its state and then sends a marker message to all the outbound channels.

Importantly, the marker is sent after the node records its state and before any further messages are sent to the channels.

  • When a node receives a marker message, its behaviour depends on whether the node has already recorded its state (while emitting the mark previously) or not.
    • If the node has not recorded its state, it records its state, and then it records the state of the channel cc the marker was received from as an empty sequence. It then sends the marker to all the outbound channels.
    • If the node has recorded its state, it records the state of the channel the marker was received from as the sequence of messages received from cc after the node’s state was recorded and before the node received the marker from cc.

The following illustration contains a sample execution of this algorithm on the simple system presented previously.

The node pp sends the token and, right after, initiates the execution of the protocol.

As a result, it records its state s0s_0 and sends the marker in channel cc. The node qq receives the token, transitions to state s1s_1. It then sends the token to channel c′c^{\prime} and transitions to state s0s_0.

Afterward, it receives the marker message, records its state s0s_0 and the state of channel cc as an empty sequence, and sends the marker message to channel c′c^{\prime}.

Note that this is just one of the possible executions. In an alternative execution, node qq could have processed both the token and the marker, recording its state as s1s1 and potentially sending the marker across channel c′c^{\prime} without sending the token yet. This would have led to a different but still consistent snapshot.

Meanwhile, node pp receives the token, transitions to state s1s_1, and buffers the token in the sequence of messages received while the snapshot protocol was executing. The node pp then receives the marker and records the state of the channel c′c^{\prime} as the sequence [token].

At this point, the protocol concludes, since the state of all nodes and channels has been recorded and the global snapshot state is the following:

snapshot(pp): s0s_0
snapshot(qq): s0s_0
snapshot(cc): []
snapshot(c′c^{\prime}): [token]

Get hands-on with 1400+ tech skills courses.