GFS Consistency Model
Learn about the GFS consistency model for write operations.
GFS provides a custom consistency model for write operations.
The state of a file region after a mutation depends on the type of mutation, whether it succeeds or fails and whether there are concurrent mutations.
Note: A file region is consistent if all clients will always see the same data, regardless of the replica they read from.
A region within a file
A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.
A region can be:
- Defined and consistent: When a mutation succeeds without interference from concurrent writes, the affected region is defined. All clients will always see what the mutation has written.
- Undefined but consistent: Concurrent successful mutations leave the region undefined but consistent. All the clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations.
- Both inconsistent and undefined: A failed mutation makes the region inconsistent. Different clients may see different data at times.
The following illustration shows these differences:
GFS’s extra mutation operation
Besides regular writes, GFS also provides an extra mutation operation called record appends.
Record appends
A record append causes data to be appended atomically at least once even, in the presence of concurrent mutations but at an offset of GFS’s choosing, which is returned to the client.
Clients are supposed to retry failed record appends and GFS guarantees that each replica will contain the data of the operation as an atomic unit at least once in the same offset.
However, GFS may insert padding or record duplicates in between. As a result, successful record appends create defined regions interspersed with inconsistent regions.
The following table contains a summary of the GFS consistency model. You can see how write works in GFS:
Summary of GFS consistency model
Write | Record Append | |
Serial success | defined | defined interspersed with inconsistent |
Concurrent success | consistent but undefined | |
Failure | inconsistent |
How applications accommodate the GFS consistency model
Applications can accommodate this relaxed consistency model of GFS by applying a few simple techniques at the application layer:
- Using appends rather than overwrites
- Checkpointing
- Writing self-validating and self-identifying records
Appending is far more efficient and more resilient to application failures than random writes.
Each record prepared by a writer can contain extra information like checksums so that its validity can be verified. A reader can then identify and discard extra padding and record fragments using these checksums.
Note: If occasional duplicates are not acceptable, e.g., if they could trigger non-idempotent operations, the reader can filter them out using unique record identifiers selected and persisted by the writer.
Mutation operation in HDFS
HDFS takes a slightly different path to simplify the semantics of mutating operations.
- HDFS supports only a single writer at a time.
- It supports only append (and not overwrite) operations.
- It also does not provide a record append operation, since there are no concurrent writes.
- It handles partial failures in the replication pipeline a bit differently, removing failed nodes from the replica set completely to ensure file content is the same in all replicas.
GFS and HDFS summarised
Both GFS and HDFS provide applications with the information where a region of a file is stored. This enables the applications to schedule processing jobs to run in nodes that store the associated data, minimizing network congestion and improving the system’s overall throughput. This principle is also known as moving computation to the data.
Now that you’ve gone through the important concepts of Google File System (GFS) and Hadoop Distributed File System (HDFS), test your knowledge by interacting with the AI Mentor below. You’ll answer a total of five questions focused on the workings of GFS and HDFS. To get started, say hello to Edward in the widget below, and it will lead the way.
Get hands-on with 1400+ tech skills courses.