Spanner supports the following types of operations:

  • Read-write transactions
  • Read-only transactions
  • Standalone (strong or stale) reads

Read-write transaction

A read-write transaction can contain both read and/or write operations. It provides full ACID properties for the operations of the transaction. More specifically, read-write transactions are not simply serializable, but they are strictly serializable.

Note: Spanner documentation also refers to strict serializability with “external consistency”, but both are essentially the same guarantees.

A read-write transaction executes a set of reads and write operations atomically at a single logical point in time.

Note: As explained earlier, Spanner achieves these properties using two-phase locking for isolation and two-phase commit for atomicity across multiple splits.

Workflow

The workflow for the read-write transaction follows the following sequence:

  • After opening a transaction, a client directs all the read operations to the leader of the replica group that manages the split with the required rows. This leader acquires read locks for the rows and columns involved before serving the read request. Every read also returns the timestamp of any data read.

  • Any write operations are buffered locally in the client until the point the transaction is committed. While the transaction is open, the client sends keepalive messages to prevent participant leaders from timing out a transaction.

  • When a client has completed all reads and buffered all writes, it starts the two-phase commit protocolThe two-phase commit is required only if the transaction accesses data from multiple replica groups. Otherwise, the leader of the single replica group can commit the transaction only through Paxos.. It chooses one of the participant leaders as the coordinator leader and sends a prepare request to all the participant leaders along with the identity of the coordinator leader. The participant leaders involved in write operations also receive the buffered writes at this stage.

  • Every participant leader acquires the necessary write locks, chooses a prepare timestamp s​i that is larger than any timestamps of previous transactions, and logs a prepare record in its replica group through Paxos. The leader also replicates the lock acquisition to the replicas to ensure they will be held even in the case of a leader failure. It then responds to the coordinator leader with the prepare timestamp.

The following illustration contains a visualization of this sequence:

Spanner mitigating availability problems

It is worth noting that the availability problems from the two-phase commit are partially mitigated in this scheme because both the participants and the coordinator are essentially a Paxos group. So, if one of the leader nodes crashes, then another replica from that replica group will eventually detect that, take over and help the protocol make progress.

Spanner handling deadlocks

The two-phase locking protocol can result in deadlocks. Spanner resolves these situations via a wound-wait schemeD. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis,II, “System Level Concurrency Control for Distributed Database Systems,” ACM Transactions on Database Systems (TODS), volume 3, Issue 2, June 1978, 1978., where a transaction TX1TX_1 is allowed to abort a transaction TX2TX_2 that holds the desired lock only if TX1TX_1 is older than TX2TX_2.

Checking whether a replica is up-to-date or not

Spanner needs a way to know if a replica is up-to-date to satisfy a read operation. For this reason, each replica tracks a value called safe time (tsafet_{safe}).

Safe time (tsafet_{safe})

The tsafet_{safe} is the maximum timestamp at which the replica is up-to-date. Thus, a replica can satisfy a read at a timestamp tt if ttsafet \leq t_{safe}.

This value is calculated as:

tsafe=min(tsafePaxos,tsafeTM)t_{safe} = min({t_{safe}}^{Paxos}, {t_{safe}}^{TM})

tsafePaxos{t_{safe}}^{Paxos} is the timestamp of the highest-applied Paxos write at a replica group and represents the highest watermark below, which writes will no longer occur with respect to Paxos.

tsafeTM{t_{safe}}^{TM} is calculated as mini(si,gprepare)min_i(s_{i,g}^{prepare}) overall transactions TiT_i prepared (but not committed yet) at replica group gg. If there are no such transactions, then tsafeTM=+{t_{safe}}^{TM} = + \infty.

Read-only transactions

Read-only transactions allow a client to perform multiple reads at the same timestamp, and these operations are also guaranteed to be strictly serializable.

An interesting property

An interesting property of read-only transactions is that they do not need to hold any locks and block other transactions. The reason for this is that these transactions perform reads at a specific timestamp, which is selected in such a way as to guarantee that any concurrent/future write operations will update data at a later timestamp.

The timestamp is selected at the beginning of the transaction as TT.now().latest, and it’s used for all the read operations that are executed as part of this transaction.

In general, the read operations at timestamp treadt_{read} can be served by any replica gg that is up to date, which means treadtsafe,gt_{read} \leq t_{safe,g}.

More specifically:

  • In some cases, a replica can be certain via its internal state and TrueTime that it is up to date enough to serve the read and does so.
  • In some other cases, a replica might not be sure if it has seen the latest data. It can then ask the leader of its group for the timestamp of the last transaction it needs to apply in order to serve the read.
  • In case the replica is the leader itself, it can proceed directly since it is always up to date.

Standalone reads

Spanner also supports standalone reads outside the context of transactions.

Standalone reads do not differ a lot from the read operations performed as part of read-only transactions. For instance, their execution follows the same logic using a specific timestamp.

Standalone reads can be either strong or stale.

Strong read

A strong read is a read at a current timestamp and is guaranteed to see all the data committed up until the start of the read.

Stale read

A stale read is a read at a timestamp in the past. It can be provided by the application or calculated by Spanner based on a specified upper bound on staleness. A stale read is expected to have lower latency at the cost of stale data since it’s less likely the replica will need to wait before serving the request.

Partitioned Data Manipulation Language

There is also another type of operation called partitioned Data Manipulation Language (DML).

DMLallows a client to specify an update or delete operation in a declarative form, then executes in parallel at each replica group. This parallelism and the associated data locality make these operations very efficient. However, this comes with some tradeoffs.

  • These operations need to be fully partitionable. This means they must be expressible as the union of a set of statements, where each statement accesses a single row of the table, and each statement accesses no other tables. Doing so ensures each replica group will execute the operation locally without any coordination with other replica groups.
  • Furthermore, these operations must be idempotent because Spanner might execute a statement multiple times against some groups due to network-level retries.

Atomicity guarantee

Spanner does not provide atomicity guarantees for each statement across the entire table, but it provides atomicity guarantees per each group. It means that a statement might only run against some rows of the table, e.g., if the user cancels the operation midway or the execution fails in some splits due to constraint violations.

Get hands-on with 1400+ tech skills courses.