Zookeeper
Let’s explore the Zookeeper’s API for coordination purposes.
Zookeeper’s API is essentially a hierarchical namespace similar to a filesystem.
Chubby also provides a hierarchical namespace, while etcd provides a key-value interface.
In the illustration below, we can see:
- Every name is a sequence of path elements separated by a slash (/).
- Every name represents a data node (called znode), which can contain a piece of metadata and children nodes.
For example, the node “/a/b” is considered a child of the node “/a”.
Zookeeper’s API operations
The Zookeeper’s API contains basic operations that can create nodes, delete nodes, check if a specific node exists, list the children of a node and read or set the data of a node.
Types of nodes
There are two types of znodes: regular nodes and ephemeral nodes.
-
Regular nodes are created and deleted explicitly by the clients.
-
Ephemeral nodes can be removed by the system when the session that created them expires (i.e., due to a failure).
Zookeeper’s API benefits
When a client creates a new node, it can set a sequential flag. Nodes created with this flag have the value of a monotonically increasing counter appended to a provided prefix.
Zookeeper also provides an API that allows clients to receive notifications for changes without polling, called watches.
On read operations, clients can set a watch flag, so that they are notified by the system when the information returned has changed.
A client connects to Zookeeper initiating a session, which needs to be maintained open by sending heartbeats to the associated server. If a Zookeeper server does not receive anything from a client for more than a specified timeout, it considers the client faulty and terminates the session. This deletes the associated ephemeral nodes and unregisters any watches registered via this session.
The update operations can take an expected version number, which enables the implemented conditional updates to resolve any conflicts arising from concurrent update requests.
Zookeeper ensemble
Zookeeper nodes form a cluster, called a Zookeeper ensemble. One of the nodes is designated as the leader and the rest of the nodes are the followers.
Zookeeper protocol
Zookeeper uses a custom atomic broadcast protocol, called Zookeeper atomic broadcast protocol (Zab), which was proposed by
This protocol is used to elect the leader and replicate the write operations to the followers.
Note: Chubby uses Paxos protocol for this purpose, while etcd uses Raft protocol.
Each of the nodes has a copy of the Zookeeper state in memory. Also the changes are recorded in a durable, write-ahead log which can be used for recovery.
All the nodes can serve read requests using their local database.
Followers have to forward any write requests to the leader node, wait until the request has been successfully replicated and broadcasted, and then respond to the client.
Reads can be served locally without any communication between nodes, so they are extremely fast.
Zookeeper’s sync operation
A follower node might be lagging behind the leader node, so client reads might not necessarily reflect the latest performed write t, thus not providing linearizability. For this reason, Zookeeper provides an additional operation called sync that clients can use to reduce the possibility of stale data.
This operation will be directed to the current leader, who will put it at the tail of the queue that contains pending requests to be sent to the corresponding follower.
The follower will wait for the completion of any pending sync operations before replying to subsequent read operations.
It is important to note that the sync operation does not need to go through the broadcast protocol and thus reach a majority quorum, it is just placed at the end of the leader’s queue and forwarded only to the associated follower.
Note: In contrast, in Chubby, both read and write requests are directed to the leader. This has the benefit of increased consistency but the downside of decreased throughput. To mitigate this, Chubby clients cache extensively and the leader is responsible for invalidating the caches before completing writes. This makes the system a bit more sensitive to client failures.
As a result, read operations are not linearizable with respect to write operations even if a sync operation is used first.
Get hands-on with 1400+ tech skills courses.