Long-Lived Transactions and Sagas

In this lesson, we will explore long-lived transactions and saga transactions. We will also look into the benefits that sagas provide over distributed transactions.

As explained previously, achieving complete isolation between transactions is relatively expensive.

The system either has to maintain locks for each transaction and potentially block other concurrent transactions from making progress, or abort some transactions to maintain safety, which leads to some wasted effort.

Furthermore, the longer the duration of a transaction is the bigger the impact of these mechanisms is expected to be on the overall throughput.

There is also a positive feedback cycle: using these mechanisms can cause transactions to take longer, which can increase the impact of these mechanisms.

Long-lived transactions

There is a specific class of transactions, called long-lived transactions (LLT).

These are transactions that by their nature have a longer duration in the order of hours or even days, instead of milliseconds. This can happen because this transaction processes a large amount of data, requires human input to proceed, or needs to communicate with third party systems that are slow.

Examples of LLTs

  • Batch jobs that calculate reports over big datasets
  • Claims at an insurance company, containing various stages that require human input
  • An online order of a product that spans several days from order to delivery

As a result, running these transactions using the common concurrent mechanisms degrades performance significantly, since they need to hold resources for long periods of time, while not operating on them.

Sometimes, long-lived transactions do not really require full isolation between each other, but they still need to be atomic, so that consistency is maintained under partial failures. Thus, researchers came up with a new concept: the sagaH. Garcia-Molina and K. Salem, “Sagas,” Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data, 1987..

Saga

The saga is a sequence of transactions T1T_1, T2T_2, …, TNT_N that can be interleaved with other transactions.

However, it’s guaranteed that either all of the transactions will succeed, or none of them will, maintaining the atomicity guarantee.

Each transaction TiT_i is associated with a so-called compensating transaction CiC_i, that is executed in case a rollback is needed.

Benefits of the saga

The concept of saga transactions can be really useful in distributed systems. As demonstrated in the previous sections, distributed transactions are generally hard and can only be achieved by making compromises on performance and availability.

There are cases where we can use a saga transaction instead of a distributed transaction. This will satisfy all of our business requirements while keeping our systems loosely coupled and achieving good availability and performance.

Example scenario

Let’s imagine we are building an e-commerce application, where every order of a customer requires several discrete steps: credit card authorization, checking warehouse inventory, item shipping, invoice creation and delivery, etc.

  • One approach could be to perform a distributed transaction across all these systems for every order. However, in this case, the failure of a single component (i.e., the payment system) could potentially bring the whole system to a halt.

  • An alternative, leveraging the saga pattern, would be to model the order operation as a saga operation, consisting of all these sub-transactions, where each of them is associated with a compensating transaction.

For example, debiting a customer’s bank account could have a compensating transaction that would give a refund. Then, we can build the order operation as a sequential execution of these transactions, as shown in the following transactions. In case any of these transactions fail, we can rollback the transactions that have been executed and run their corresponding compensating transactions.

There might still be cases where some form of isolation is needed.

In the example above, orders from different customers about the same product might share some data, which can lead to interference between each other.

Cases where isolation is required

Think about the scenario of two concurrent orders A and B, where A has reserved the last item from the warehouse. As a result of this, order B fails at the first step and is rejected because of zero inventory. Later on, order A also fails at the second step because the customer’s card does not have enough money. Then, the associated compensating transaction runs, returning the reserved item to the warehouse.

This would mean that an order was rejected while it could have been processed normally. Of course, this violation of isolation does not have severe consequences. However, in some cases the consequences might be more serious, e.g. customers being charged without receiving a product.

To prevent these scenarios, some form of isolation can be introduced at the application layer.

Providing isolation at the application layer

Previous researchL. Frank and T. U. Zahle, “Semantic ACID Properties in Multi- databases Using Remote Procedure Calls and Update Propagations,” Software—Practice & Experience, Volume 28 Issue 1, Jan. 1998, 1998. on this topic proposed some concrete techniques that are countermeasures to isolation anomalies.

Some of these techniques are as follows:

Semantic lock

The use of a semantic lock essentially signals that some data items are currently in process and should be treated differently or not accessed at all. The final transaction of a saga takes care of releasing this lock and resetting the data to its normal state.

Commutative updates

The use of commutative updates that have the same effect regardless of their order of execution. This can help mitigate cases that are otherwise susceptible to lost update phenomena.

Re-ordering the structure of the saga

Re-order the saga structure so that a transaction called a pivot transaction delineates a boundary between transactions that can fail and those that can’t.

In this way, transactions that can’t fail, but could lead to serious problems if rolled back due to failures of other transactions, can be moved after the pivot transaction.

An example of this is a transaction that increases the balance of an account. This transaction could have serious consequences if another concurrent saga reads this increase in the balance, but then the previous transaction is rolled back. Moving this transaction after the pivot transaction means that it will never be rolled back, since only all the transactions after the pivot transaction can succeed.

We can apply these techniques selectively in cases where they are needed. However, they introduce significant complexity and move some of the burdens back to the application developers; the developers have to think again about all the possible failures and design accordingly. We need to consider trade-offs when choosing between using saga transactions or leveraging the transaction capabilities of the underlying datastore.

Get hands-on with 1400+ tech skills courses.