Problems with sharing

At this point, it must have become evident that sharing leads to coordination, which is one of the main factors that inhibit high availability, performance, and scalability.

For example, we have already explained how distributed databases can scale to larger datasets more cost-efficiently than centralized, single-node databases. At the same time, some form of sharing is sometimes necessary and even beneficial for the same characteristics. For instance, a system can increase its overall availability by reducing sharing through partitioning since the various partitions can have independent failure modes. However, when looking at a single data item, availability can be increased by increasing sharing via replication.

Solution

A key takeaway from all of the above problems is reducing sharing.

Reducing sharing

Reducing sharing can be very beneficial when applied properly. Some system architectures follow this principle to the extreme to reduce coordination and contention so that every request can be processed independently by a single node or a single group of nodes in the system. These are usually called shared-nothing architectures.

This chapter will explain how we can use this principle to build such architectures and what are some of its trade-offs.

Techniques for reducing sharing

A basic and widely used technique for reducing sharing is the decomposing stateful and stateless parts of a system.

Decomposing stateful and stateless parts of a system

The main benefit from this is that stateless parts of a system tend to be fully symmetric and homogeneous, which means that every instance of a stateless application is indistinguishable from the rest. Separating them makes scaling a lot easier.

Since all the instances of an application are identical, one can balance the load across all of them easily way since all of them should be capable of processing any incoming request.

Note: In practice, the load balancer might need to have some state that reflects the load of the various instances.

The system can be scaled out to handle more load by simply introducing more instances of the applications behind a load balancer. The instances could also send heartbeats to the load balancer so that the load balancer can identify the ones that have potentially failed and stop sending requests to them. The same could also be achieved by the instances exposing an API, where requests can be sent periodically by the load balancer to identify healthy instances.

Of course, to achieve high availability and scale incrementally, the load balancer also needs to be composed of multiple redundant nodes. There are different ways to achieve this, but a typical implementation uses a single domain name (DNS) for the application that resolves multiple IPs belonging to the various servers of the load balancer. The clients, such as web browsers or other applications, can rotate between these IPs. The DNS entry needs to specify a relatively small time-to-live (TTL) so that clients can identify new servers in the load balancer fleet relatively quickly.

Managing stateful systems

The stateful systems are a bit harder to manage since the various nodes of the system are not identical.

Each node contains different pieces of data, to perform appropriate routing to direct requests to the proper part of the system. As implied before, the presence of this state also creates tension that prevents us from completely eliminating sharing if there is a need for high availability.

A combination of partitioning and replication is used to strike a balance. Data are partitioned to reduce sharing and create some independence and fault isolation, but every partition is replicated across multiple nodes to make every partition fault-tolerant. We have already examined several systems that follow this pattern, such as Cassandra and Kafka.

Sharing as a spectrum

Sharing is not a binary property of a system but rather a spectrum.

  • On the one end of the spectrum, there might be systems with as little sharing as possible, i.e., not allowing any transactions across partitions to reduce the coordination needed.

  • Some systems might fall in the middle of the spectrum, where transactions are supported but performed in a way that introduces coordination only across the involved partitions instead of the whole system.

  • Lastly, systems that store all the data in a single node fall somewhere on the other end of the spectrum.

A typical shared-nothing architecture

The following illustration shows an example of a shared-nothing architecture:

Get hands-on with 1400+ tech skills courses.