Communication among Raft Nodes

Let's explore how Raft nodes communicate, what problem occurs, and how Raft solves this problem.

Communication mechanism

Nodes communicate via remote procedure calls (RPCs) and Raft has two basic RPC types:

  • RequestVote: Sent by candidates during an election
  • AppendEntries: Sent by leaders to replicate log entries and provide a heartbeat form

The commands are stored in a log replicated to all the nodes of the cluster.

The log entries are numbered sequentially, and they contain the term in which they were created and the associated command for the state machine, as shown in the following illustration.

An entry is considered committed if it can be applied to the state machine of the nodes. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines, while also guaranteeing that no other entry will be committed for the same index. It also guarantees that all the preceding entries of a committed entry are also committed. This status essentially signals that consensus has been reached on this entry.

As mentioned previously, leaders are responsible for receiving commands from clients and replicating them across the clusters. This happens in the following order:

  1. When a leader receives a new command, it appends the entry to its own log and then sends an AppendEntries request in parallel to the other nodes, retrying when it does not receive a timely response.
  2. When the leader receives a response from a majority of followers, the entry can be considered committed.
  3. The leader applies the command to its state machine and informs the followers they can do the same by piggybacking the necessary information about committed entries in subsequent AppendEntries messages.

Of course, this is mostly the happy path.

Divergence among nodes

During leader and follower failures, divergence might be observed between the various nodes. The following illustration contains some examples of this phenomenon.

For example, a follower might crash and thus miss some (committed) entries (rows a and b in the above illustration). It might receive some more (non committed) entries (rows c and d). Or, both things might happen (rows e and f). Specifically, the scenario in row f could happen if a node was elected leader in both terms 2 and 3, replicated some entries, but crashed before any of these entries were committed.

Resolving divergence

Raft contains some additional elements to resolve these temporary divergences.

The main overarching principle is that any elected leader should contain any entries that have been committed up to the term it becomes leader. The leader is then responsible for helping any followers with conflicting logs adjust them accordingly to converge again.

It’s important to note that a leader only appends entries to its log and never updates it. Only followers are allowed to update their log.

These two aspects are satisfied in the following way.

  • During an election, every RequestVote RPC contains some information about the candidate’s log. A voter is allowed to vote for a candidate only if its log is not more up-to-date than the candidate’s log. Raft determines which of the two logs is more up-to-date by comparing the index and term of the last entries in the logs. A candidate must receive votes from a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other log in that majority, then it’s guaranteed to hold all the committed entries.

  • When sending an AppendEntries RPC, a leader includes the index and term of the entry that immediately precedes the new entries in its log. The followers check against their own logs and reject the request if their log differs. If that happens, the leader discovers the first index where their logs disagree and starts sending all the entries after that point from its log. The follower discards its own entries and adds the leader’s entries to its log. As a result, their logs eventually converge again.

What happens when a leader crashes before committing an entry?

We mentioned previously that a leader knows that an entry from its term can be considered committed when it has been successfully replicated to a majority of nodes. It can then be safely applied to the state machine.

Let’s see what happens when a leader crashes before committing an entry.

If subsequent leaders have received this entry, they will attempt to finish replicating the entry.

However, a subsequent leader cannot safely conclude that an entry from a previous term is committed once stored on a majority of nodes.

The reason is there is an edge case where future leaders can still replace this entry even if it’s stored on a majority of nodes.

Feel free to refer to the paperD. Ongaro and J. Ousterhout, “In Search of an Understandable Consensus Algorithm,” Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, 2014. for a full description of how this can happen.

As a result, leaders can safely conclude an entry from a previous term is committed by replicating it and then replicating a new entry from its term on top of it. If the new entry from its own term is replicated to a majority, the leader can safely consider it as committed. Thus, it can also consider all the previous entries as committed at this point.

So, a leader is guaranteed to have all the committed entries at the start of its term, but it doesn’t know which ones. To find out, it needs to commit an entry from its own term. To expedite this in periods of idleness, the leader can just commit a no-op command at the beginning of its term.

Get hands-on with 1400+ tech skills courses.