Coordination Services

Distributed mutual exclusion gave us a way to coordinate access to shared resources in a system without shared memory. The centralized algorithm, where one coordinator process grants and revokes locks, is direct and efficient. It requires only three messages per lock acquisition, and the coordinator has a complete view of who holds what. The problem is that the coordinator is a single point of failure. If it crashes, the entire system stalls waiting for locks that will never be granted.

The obvious solution is to replicate the coordinator. That introduces a new problem: if multiple coordinator replicas are running, they must all agree on who holds which locks. That agreement problem is consensus.

A coordination service is a replicated coordinator made fault-tolerant through consensus. It is a small, highly available, strongly consistent store that distributed applications use to share information and coordinate operations. The use cases are narrow but critical: who is the current leader, what is the latest configuration, which lock is held, which servers are alive.

Getting the wrong answer to any of these questions can be catastrophic. If two nodes both think they are the leader, they will independently accept writes and the system state will diverge. If a node reads a stale configuration, it may route requests to a server that no longer exists. Strong consistency is not optional here, and strong consistency across failures requires consensus.

Chubby, ZooKeeper, and etcd are the three coordination services covered in this topic. All three use consensus under the hood for exactly this reason.

Google Chubby

Chubby was designed at Google and first described publicly in a 2006 paper by Mike Burrows. The Google File System needed a way to elect a primary master, Bigtable needed a way to coordinate tablet (pieces of a huge table) assignments, and MapReduce needed a way to elect a master job tracker. Each of these systems could have implemented its own ad-hoc coordination mechanism, but that would be fragile and hard to reason about. Chubby was built to give all of them a shared, well-engineered foundation.

The design goal was a highly available and persistent lock service and configuration store for large-scale distributed systems. Chubby was expected to be a dependency of nearly every major system at Google, which meant its failure would cascade broadly. High availability was therefore the top priority.

Architecture

A Chubby deployment is called a Chubby cell. By default, a cell consists of five servers called replicas. One replica is elected the master and serves all client requests. The other four are for fault tolerance: they participate in consensus to keep the replicated log consistent, but they do not serve the authoritative read/write workload. If a client contacts a non-master replica, it replies with the identity of the current master.

Paxos is the consensus algorithm used to replicate state across the five servers and to elect a new master when the current one fails. A majority of replicas (three out of five) must be alive for the cell to function. The ability to tolerate two simultaneous failures is a practical choice: in a large data center environment, two concurrent failures are not unusual, but three simultaneous failures are rare.

Chubby typically deploys one cell per data center. Clients within that data center contact their local cell. Every few hours, the entire cell database would also be backed up to GFS (Google File System) to protect against catastrophic loss of all replicas in the cell.

The File System Interface

Although Chubby is described as a lock service, it exposes a file system interface. Everything in Chubby is a named node in a hierarchical namespace of files and directories.

A lock is a file, and its name is its hierarchical path. Any node may contain data, may contain children (think of it as a directory name that also contains data), and may have a lock associated with it.

Using a file namespace avoided the need to build a separate naming scheme on top of the lock service and gave applications a convenient place to store small amounts of associated data, such as the address of the current master.

The interface is not a standard POSIX file system. There is no kernel module; client software talks to the Chubby master via Google’s internal RPC system (Stubby, which later inspired the design of gRPC). File operations are intentionally limited. Files can only be read or written in their entirety. There are no byte-range reads or writes, and no seek operation.

When a client opens a file, it downloads the current contents and establishes a lease for that file. The server tracks which clients have cached copies and uses a write-through model: when a client writes a file, it sends the update to the master, which then sends cache invalidations to all other clients that have a cached copy. Combined with lease validity, this ensures that a client’s cached data is never stale as long as its lease is current. Either the server has told it the data changed, or the lease has expired and the client must revalidate.

Locks

Locks in Chubby are advisory, not mandatory. A process can hold a lock and other processes can still access the underlying data, but well-behaved processes check for lock ownership before proceeding. Locks can be held in two modes: exclusive (one writer) and shared (multiple readers).

Chubby is designed for coarse-grained locking. A coarse-grained lock controls a large resource, such as an entire Bigtable table or a GFS master, and may be held for hours or days. This differs from a fine-grained lock that might be held for milliseconds to protect a single row in a database. The architectural consequence is that a service optimized for coarse-grained locks can serve many more clients, because lock operations are infrequent relative to the work being protected.

Events and Watches

Clients can subscribe to events for any open file or directory. Event types include: a file’s contents were modified, a new file or subdirectory was created, and a lock was acquired. This lets services avoid polling. Instead of checking every few seconds whether they are still the leader, a client waits for a callback from Chubby telling it that the lock state has changed.

Leases

Chubby uses leases to manage the relationship between the master and clients. When a client acquires a lock or opens a file, it receives a time-bounded lease. The client must renew this lease periodically. If the client fails to renew before the lease expires, the server considers the client dead and revokes the lease.

Leases create a problem when the master itself fails. A new master is elected via Paxos and has access to the replicated state from the previous master, so it knows what sessions and locks existed. It goes through a recovery protocol: it broadcasts a new master epoch to clients, gives them a grace period to reconnect and re-establish their sessions, and only after that grace period begins serving new requests. Clients that fail to reconnect within the grace period have their sessions and locks released. This ensures a clean handoff without ambiguity about which lock grants are still valid.

Chubby as a Building Block

The most common pattern for using Chubby is leader election. If a group of processes wants to elect a leader, each process opens the same Chubby file and attempts to acquire an exclusive lock on it. Exactly one will succeed, and that process becomes the leader. The leader can store its address in the file so that other processes know how to contact it. When the leader fails, its lease expires, and the lock is released. Other processes can then compete for it again.

Because Chubby cells are small and serve thousands of clients, all data is stored in memory at the master. For durability, all writes are committed to disk and replicated across the replicas in the cell via Paxos before acknowledging success.

Apache ZooKeeper

ZooKeeper was developed at Yahoo! and contributed to the Apache Software Foundation as an open-source project in 2008. The Chubby paper had been published in 2006 and had significant influence, but Chubby was Google’s internal system and not available to anyone outside Google. Yahoo! was running large-scale systems of its own, including Hadoop and HBase, that needed the same kind of fault-tolerant coordination. ZooKeeper was built as an open-source coordination kernel to fill that role.

ZooKeeper does not simply clone Chubby’s interface. The most important architectural difference is that ZooKeeper does not provide locks as a primitive. Instead, it provides a minimal set of primitives from which locks, leader election, barriers, and other coordination patterns can be built. The philosophy is that a coordination kernel should be as small and general as possible, and that providing locks directly would force design choices onto applications that might not need them.

Data Model

ZooKeeper organizes data as a hierarchical tree of nodes called znodes. Each znode has a path (like a file system path), can hold a small amount of data (a few kilobytes), and can have children. Two types of znodes are most relevant for coordination:

Persistent znodes survive client disconnections. They remain until explicitly deleted.

Ephemeral znodes are automatically deleted when the client session that created them ends. This is the key mechanism for detecting failures: if a process creates an ephemeral znode to signal its presence and then crashes, the znode disappears. Other processes watching that znode are notified.

Either node may be created as a sequential znode. When creating a sequential znode, the system automatically appends a monotonically increasing integer to the name. This is essential for implementing distributed locks without thundering-herd problems. The sequential znode is not a different type of node but a creation-mode flag: PERSISTENT_SEQUENTIAL or EPHEMERAL_SEQUENTIAL.

Thundering herd problem: many clients are waiting for the same condition and a single state change wakes them all at once. They retry immediately, creating a burst of requests that can overload the coordination service and the network. Most retries are wasted because only one client can make progress. For example, only one client can acquire a lock.

ZooKeeper lock implementations avoid this by having each client watch only the znode that represents the next-lowest contender (the predecessor of the client’s znode), rather than having everyone watch the same lock znode.

Watches

A ZooKeeper watch is a one-shot notification mechanism. A client sets a watch when it asks ZooKeeper about a znode. Later, if the relevant state changes, ZooKeeper sends the client an event. After the watch triggers, it is removed and must be set again if the client wants continued monitoring.

You set watches via:

exists(path, watch=true): fires if the node is created, deleted, or its data changes (useful to watch for creation of a node that does not yet exist).
getData(path, watch=true): fires if the node’s data changes or the node is deleted (the node must exist when the watch is set).
getChildren(path, watch=true): fires if the node’s immediate children list changes (child added or removed) or the node is deleted (the node must exist when the watch is set).

Watches are intentionally one-shot. ZooKeeper avoids keeping long-lived subscriptions and pushes complexity to the client: when a watch fires, the client typically re-reads the znode state and re-registers the watch. This pattern helps keep the client’s view consistent even if multiple changes occur quickly or while the client is temporarily disconnected, because the client treats the event as “something changed” and then refreshes state from ZooKeeper.

Consistency Model

ZooKeeper uses a variant of consensus called Zab (ZooKeeper Atomic Broadcast). Like Raft, Zab elects a leader and replicates writes through the leader. All writes go through the leader and are applied in order across all replicas. This gives ZooKeeper linearizable writes: every write completes in a globally consistent order.

Reads are different. By default, a client can read from any ZooKeeper replica, not just the leader. A read might therefore return slightly stale data if the replica has not yet applied the latest writes. Reads are sequentially consistent: each client sees writes in order, but a follower may not have applied the leader’s latest writes.

Clients that need fresher data can issue a sync operation, which forces the server handling that client session to catch up to the leader’s committed state (as of when the sync is processed) before the read proceeds.

This is a deliberate tradeoff. Most coordination reads, such as checking a configuration value or watching for leader changes, can tolerate brief staleness. The rarer cases that need strict freshness pay the extra cost of a sync.

Building Locks with ZooKeeper

Because ZooKeeper provides no lock primitive, locks are constructed from the building blocks above. The standard recipe is:

To acquire the lock, create a sequential ephemeral znode under a lock directory (for example, /locks/my-lock/lock-0000000042).
List all children of /locks/my-lock. If your znode has the lowest sequence number, you hold the lock.
If not, watch the znode with the next-lowest sequence number below yours. When it is deleted (because that client released the lock or crashed), re-evaluate.

The sequential znode ensures that locks are granted in arrival order. Watching the predecessor rather than all children prevents the thundering herd problem: when a lock is released, only one waiter is notified rather than all of them.

ZooKeeper and Chubby Compared

The conceptual difference is design philosophy. Chubby is higher-level: it provides locks, events, and a file store, all integrated. ZooKeeper is a toolkit that provides the minimal primitives needed to build those things.

Both use consensus to replicate state. Both have lease-like session semantics. Both support watches and events. The coordination patterns you build on top of either system look much the same. ZooKeeper’s advantages over Chubby were that it was open-source, available outside Google, and designed from the start as a general coordination primitive rather than a lock service with a configuration store attached.

etcd

etcd was created in 2013 by CoreOS¹ as part of the infrastructure for their container-centric operating system. The immediate need was storing cluster configuration for CoreOS machines. ZooKeeper was available, but it requires a JVM, has a complex operational model, and its API dates from an era before RESTful services were ubiquitous. CoreOS wanted something simpler to deploy and operate, with an HTTP/JSON API that any language could talk to without a special client library.

etcd quickly became the authoritative store for Kubernetes cluster state. Every Kubernetes object, including pods, services, secrets, and configuration maps, is stored in etcd. If etcd fails, the Kubernetes control plane cannot function.

Architecture and Consistency

etcd uses the Raft consensus algorithm. Raft’s log-based replication maps cleanly onto etcd’s key-value model, and Raft’s emphasis on understandability made it easier to reason about correctness during development.

Like ZooKeeper, etcd provides strong consistency. Unlike ZooKeeper, which routes reads to any replica by default, etcd routes reads through the leader (or performs a quorum read) so that they are linearizable without requiring a separate sync call. Stale reads from followers are available as an opt-in for workloads that can tolerate them in exchange for lower latency.

Data Model

etcd stores a flat key-value map rather than an explicit directory tree. This differs from ZooKeeper and Chubby, where the namespace is hierarchical. In ZooKeeper, a parent znode must exist before a child can be created, and clients can list a node’s children. In Chubby, paths are explicitly modeled as files and directories.

In etcd, a key is an arbitrary byte string. Applications often choose path-like key names such as /config/... or /services/..., but that naming convention is not enforced by etcd. There is no parent object to create, and there are no “child” objects in the data model. Instead, etcd provides two building blocks that let you treat a prefix as if it were a directory: range queries over a key interval (typically all keys with a given prefix) and watches over that same range.

etcd’s watch API is more capable than ZooKeeper’s. A watch in etcd can monitor a key or an entire key prefix, and it delivers a stream of change events rather than a single one-shot notification. This is more convenient for long-running watchers.

Leases

etcd supports leases with a mechanism very similar to ZooKeeper’s ephemeral znodes. A client creates a lease with a time-to-live (TTL), then associates keys with that lease. If the client stops renewing the lease (via heartbeats), all keys associated with the lease are automatically deleted. Services use this for presence detection: a healthy server maintains a lease and stores its address in etcd under that lease. If the server crashes, the lease expires and the key vanishes, and watchers are notified.

Transactions

etcd supports multi-key transactions with a compare-and-swap structure. A transaction specifies a set of conditions (e.g., the version of a key is what I expect), a set of operations to apply if the conditions hold, and a fallback set of operations if they do not. This is used to implement distributed locks and leader election without race conditions.

etcd and ZooKeeper Compared

etcd replaced ZooKeeper in most new infrastructure projects for operational reasons, not correctness. Both are strongly consistent and use consensus internally. etcd uses Raft while ZooKeeper uses Zab, but both achieve equivalent safety and liveness guarantees.

The differences are about developer experience. etcd exposes a native HTTP/gRPC API that any language can talk to directly. ZooKeeper requires a dedicated client library and carries the operational overhead of the JVM. etcd’s watch API delivers a persistent stream of change events, while ZooKeeper’s watches are one-shot and must be re-registered after each notification. For teams building modern cloud infrastructure, etcd’s operational simplicity is the deciding factor.

Common Coordination Patterns

Whether you use Chubby, ZooKeeper, or etcd, the coordination patterns built on top of them are the same. The most important ones are below.

Leader Election

The scenario: you have N replicas of a service and exactly one must act as the primary at a time.

The shared idea across all three systems is that replicas contend for a well-known name in the coordination service. A replica becomes a leader only if it can acquire that name atomically, and its leadership remains valid only while it maintains a liveness condition (a session or a lease).

If the leader fails and its session or lease expires, the coordination service removes the leader’s claim, and the remaining replicas contend again. The mechanism varies: Chubby uses a lock in a file-system namespace, ZooKeeper typically uses ephemeral sequential znodes with predecessor watches, and etcd uses a key created under a TTL lease via an atomic transaction.

Distributed Locks

A distributed lock grants one process at a time exclusive access to a shared resource. The coordination service provides the serialization point: acquiring the lock is a write that goes through consensus, so it is globally ordered. Locks built on ephemeral nodes or leases are self-cleaning: a crashed lock holder’s lease expires and the lock is released automatically.

Configuration Management

Services store their configuration as values in the coordination service. When configuration changes, the update goes through consensus and is applied consistently across all replicas of the service. Clients watch the configuration keys and are notified when values change. This replaces the old model of modifying config files on each server individually.

Service Discovery

A running service instance registers its address by writing to the coordination service under a known prefix (e.g., /services/payments/instance-7), typically using an ephemeral key with a lease. Clients discover available instances by listing that prefix. Because ephemeral keys are deleted when the server fails, the list in the coordination service is always an accurate view of what is currently alive.

Fencing Tokens

Fencing is a subtle but important pattern. Consider a leader that acquires a lock and then experiences a stop-the-world garbage collection (GC) pause for thirty seconds. During the pause, its lease expires, a new leader acquires the lock, and then the old leader wakes up and tries to write to a shared resource, still believing it holds the lock.

The solution is a fencing token: a monotonically increasing number associated with each lock grant. Every time the lock is acquired (or re-acquired after a failure), the coordination service increments the token. The shared resource (a database, a storage server) is told to reject any request with a token lower than the highest it has seen. The old leader wakes up with a stale token and its writes are rejected. The new leader’s writes, with a higher token, succeed.

Raft uses a closely related idea internally. Each election increments a term (an epoch number), and servers reject requests from leaders with older terms. This prevents an old leader from continuing to act as the leader within the Raft cluster. A fencing token applies the same monotonic-number idea to resources outside the consensus group: the database or storage system rejects requests from an old leader, even if that leader still believes it is in charge.

Fencing tokens are essential any time the leaseholder can be paused (by garbage collection, swap activity, or I/O delays) or partitioned from the network. A lock without a fencing mechanism provides only weak safety.

What Coordination Services Do Not Give You

The limits of coordination services deserve to be made explicit. They store small amounts of data and are not suitable for storing megabytes of application data.

They are built for small, coordination-oriented updates, not for heavy data ingestion. Coordination services are a good fit for writes such as “who is the leader,” “what configuration version is current,” or “which services are registered,” but they are a poor fit for logging, metrics, or any workload that involves a constant stream of large writes. For that reason, etcd guidance sizes a cluster based on expected request rate, database size, and latency goals, rather than giving one universal “writes per second” limit.

A useful rule of thumb is: if the data is on the critical path of every client request, it does not belong in a coordination service. They are not a replacement for a database or a message queue.

A coordination service also does not, on its own, make a system correct. A leader elected by ZooKeeper can still have bugs, can still crash mid-operation, and can still leave the application state in an inconsistent condition. The coordination service serializes leadership decisions; the application must handle the rest.

References

Burrows, Mike, The Chubby lock service for loosely-coupled distributed systems, 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006.
Apache ZooKeeper: project page.
etcd: project page.
etcd on GitHub: source repository.

Next: Network-Attached Storage

Back to CS 417 Documents

CoreOS was a company and a family of container-focused Linux projects. CoreOS created etcd and Container Linux, and was acquired by Red Hat in 2018; its OS lineage continues as Fedora CoreOS and Red Hat Enterprise Linux CoreOS (used by OpenShift). ↩