Replication

Learn what replication is and why it is used in distributed systems.

Partitioning can improve the scalability and performance of a system by distributing data and request load to multiple nodes.

Another dimension that benefits from using a distributed system is known as availability.

Availability

Availability refers to the ability of the system to remain functional despite failures in parts of it.

Mechanism to achieve availability

The technique we use to achieve availability is replication.

Replication

Replication is the main technique used in distributed systems to increase availability. It consists of storing the same piece of data in multiple nodes (called replicas) so that if one of them crashes, data is not lost, and requests can be served from the other nodes in the meanwhile.

However, the benefit of increased availability from replication comes with a set of new complications.

Replication implies that the system now has multiple copies of every piece of data. These copies must be maintained and kept in sync with each other on every update.

Ideally, replication should function transparently to the end user, or engineer. This is to create the illusion that there’s only one copy of every piece of data. This makes a distributed system look like a simple, centralized system of a single node that is much easier to reason about and develop software around.

Of course, this is not always possible. We may require significant hardware resources or need to give up other desirable properties to achieve this ideal. For instance, engineers sometimes willingly accept a system that provides much higher performance, but occasionally gives an inconsistent view of the data. Hence, they only do this under specific conditions—and in a specific way—they can account for when they design the application.

Therefore, there are two main strategies for replication:

  1. Pessimistic replication
  2. Optimistic replication

Pessimistic replication

Pessimistic replication tries to guarantee from the beginning that all the replicas are identical to each other—as if there was only one copy of the data all along.

Optimistic replication

Optimistic replication, or lazy replication, allows the different replicas to diverge. This guarantees that they will converge again if the system does not receive any updates, or enters a quiesced state, for a period of time.

Replication is a very active field in research, so there are many different algorithms for it.

What’s the main difference between replication and partitioning in distributed systems, and how do they impact system functionality?

What's the main difference between replication and partitioning in distributed systems?

Get hands-on with 1400+ tech skills courses.