Data Synchronisation

Learn why we need to store data in multiple places and what is the approach to synchronize data.

The need to store data in multiple places

There are cases where we need to store the same data in multiple places and potentially different forms. These are also referred to as materialized views. Below are some examples of such cases:

  • Data that resides in a persistent datastore also needs to be cached in a separate in-memory store so that read operations can be processed from the cache with lower latency. Write operations need to update both the persistent datastore and the in-memory store.

  • Data stored in a distributed key-value store must also be stored in a separate datastore that provides efficient full-text search, such as ElasticSearch or Solr. Depending on the form of read operations, the appropriate datastore can be used for optimal performance.

  • Data stored in a relational database also needs to be stored in a graph database so that graph queries can be performed in a more efficient way.

Note: Given that the data resides in multiple places, we need a mechanism that keeps them in sync. This chapter will examine some of the approaches available for this purpose and the associated trade-offs.

Synchronizing data using dual writes

One approach is to perform writes to all the associated datastores from a single application that receives update operations. This approach is sometimes referred to as dual writes. Typically, writes to the associated data stores are performed synchronously to update data in all the locations before responding to the client with a confirmation that the update operation was successful.

Problems

One drawback of this approach is how the system handles partial failures and their impact on atomicity. If the application updates the first datastore successfully, but the request to update the second datastore fails, then atomicity is violated. Due to this the overall system becomes inconsistent. It’s also unclear what the response to the client should be in this case since data has been updated, but only in some places.

However, even if we assume that there are no partial failures, there is another pitfall in how concurrent writers handle the race conditions and their impact on isolation. Let’s assume two writers submit an update operation for the same item. The application receives them and attempts to update both datastores, but the associated requests are re-ordered, as shown in the following illustration:

In the above illustration, the first datastore contains data from the first request, while the second datastore contains data from the second request. This also leaves the overall system in an inconsistent state.

Solution

An obvious solution to mitigate these issues is to introduce a distributed transaction protocol that provides the necessary atomicity & isolation, such as a combination of two-phase commit and two-phase locking. To be able to do this, the underlying datastores need to provide support for this. Even in that case, this protocol will have some performance and availability implications, as explained in the previous chapters.

Get hands-on with 1400+ tech skills courses.