Consensus and Replicated State Machines

Consensus is the problem of getting a group of nodes to agree on a single value, even when some nodes crash or messages are delayed. It appears in leader election, transaction commit, log ordering, and configuration management. Every fault-tolerant distributed system is built on top of some form of consensus.

Without proper consensus, a network partition can cause both sides to elect their own leader and accept writes independently. This condition is called split-brain: the system ends up with two divergent versions of state that must be reconciled or rolled back when the partition heals.

A quorum is a decision threshold (most often a majority) chosen so that any two quorums share at least one member. That overlap is what prevents split-brain: two isolated groups cannot both form a majority. A cluster of n servers tolerates at most ⌊(n−1)/2⌋ failures and stops making progress rather than risk inconsistency when too many servers are unreachable.

Replicated State Machines

A state machine is deterministic: the same inputs in the same order always produce the same state. State machine replication runs identical copies of the state machine on multiple servers and ensures they apply the same commands in the same order. A replicated log is the data structure that imposes that order, and consensus ensures all servers agree on the contents of the log.

If every server starts from the same initial state and executes the same log entries in order, they all reach the same state. Log ordering is therefore the central challenge that consensus solves.

Consensus Properties

A consensus protocol must satisfy three properties:

Agreement: All non-faulty processes decide on the same value.
Validity: The decided value must have been proposed by some process.
Termination: All non-faulty processes eventually decide.

Agreement and validity are safety properties (nothing bad happens). Termination is a liveness property (something good eventually happens). The tension between them is what makes consensus difficult.

FLP Impossibility

The FLP Impossibility Result proves that in a purely asynchronous distributed system, no deterministic algorithm can guarantee consensus if even one process may crash.

The obstacle is that an asynchronous system cannot distinguish a crashed process from a very slow one. There is no timeout that can be safely used to declare a process dead, so any protocol that always preserves safety has executions in which it cannot guarantee that every process eventually decides.

FLP does not mean consensus is unachievable in real systems. Every real protocol guarantees safety unconditionally but sacrifices liveness under extreme instability (for example, when no leader can stay elected because the network keeps changing). Once the system stabilizes, progress resumes.

Paxos

Paxos is the foundational consensus algorithm. It has three roles: proposers (initiate proposals), acceptors (vote), and learners (learn the decided value). A single server typically plays all three roles.

The protocol rests on a structural property: any two majorities of acceptors in a group of n share at least one member. That overlap is what makes the protocol safe across rounds. A later majority must include at least one acceptor that participated in an earlier majority, and Paxos’s promise and value-selection rules use that fact to prevent two different values from being chosen for the same decision.

The algorithm runs in two phases:

Phase 1 (Prepare/Promise): A proposer sends Prepare(n) to a majority of acceptors. Each acceptor promises to reject any proposal numbered below n and reports the highest-numbered proposal it has already accepted.
Phase 2 (Accept/Accepted): The proposer sends Accept(n, v) to a majority, where v is the value from the highest-numbered previously accepted proposal it learned about, or its own value if none were reported. An acceptor accepts Accept(n, v) only if it has not since promised a higher proposal number. If it accepts, it records (n, v) and replies Accepted.

A value is decided once a majority of acceptors have accepted it.

Multi-Paxos extends single-decree Paxos to decide a sequence of values (a log). Treating each log slot as an independent Paxos instance is correct but costs two round trips per entry. Multi-Paxos addresses this by reusing a long-lived leader: the leader runs Phase 1 once with a proposal number that covers all future slots, then skips Phase 1 and goes directly to Phase 2 for each subsequent log entry. That brings the common case to a single round trip per entry. When the leader changes, the new leader must rerun Phase 1 and recover any partially progressed slots, which is why leader instability is expensive.

The Multi-Paxos optimization is what makes a replicated log practical. Plain single-decree Paxos applied to every log slot is too slow for production systems.

Paxos is notoriously difficult to implement correctly. It leaves several questions unspecified: conflict resolution between concurrent proposers, cluster membership changes, and recovery from partial failures. Real deployments (Google Chubby, Apache ZooKeeper’s Zab variant, Google Spanner) required substantial engineering beyond the base algorithm.

Raft

Raft provides the same safety guarantees as Multi-Paxos but is designed to be easier to understand and implement. It is used in etcd, CockroachDB, TiKV, Consul, YugabyteDB, and many other newer systems.

Terms

Raft divides time into terms, each numbered with a consecutive integer. A term begins with an election. If a candidate wins, it serves as leader for that term. If no one wins (a split vote), a new term begins. Terms act as a logical clock: servers reject messages from older terms and update their own term when they see a higher one.

Server States

Every server is in exactly one of three states. A follower is passive: it responds to requests from leaders and candidates but does not initiate any. All servers start as followers. A candidate is a follower that has timed out waiting for a heartbeat and has initiated an election. A leader handles all client requests, replicates log entries to followers, and sends periodic heartbeats to prevent new elections.

Leader Election

Each follower maintains a randomized election timeout (typically 150 to 300 ms). If it expires without hearing from a leader, the follower starts an election:

Increment the current term.
Transition to candidate state and vote for itself.
Send RequestVote RPCs to all other servers.
A server grants its vote if it has not already voted this term and the candidate’s log is at least as up-to-date as its own.
If the candidate receives votes from a majority, it becomes leader and immediately sends heartbeats to suppress new elections.
If no candidate wins (a split vote), the term ends with no leader and a new election begins with a higher term.

“More up-to-date” is defined precisely: a log is more up-to-date if its last entry has a higher term, or if the terms are equal, the longer log wins. This restriction ensures a candidate cannot win unless its log contains all committed entries. The randomized timeout makes it unlikely that multiple candidates start elections at the same time.

Log Replication

Once a leader is elected, it handles all client requests. For each command, the sequence is:

The leader appends the command to its own log, tagged with the current term and index.
It sends AppendEntries RPCs to all followers in parallel.
Once a majority of servers have acknowledged the entry, the leader commits it.
The leader applies the entry to its state machine and returns the result to the client.
Subsequent AppendEntries messages carry the commit index, and followers apply newly committed entries to their own state machines.

Each AppendEntries RPC includes the index and term of the entry immediately preceding the new one. A follower rejects the RPC if its own log does not match at that position. When this happens, the leader backs up and retries from an earlier entry until it finds a point of agreement, then overwrites any conflicting entries from that point forward.

The Log Matching Property guarantees that if two entries in different logs share the same index and term, the logs are identical through that index. This invariant is what lets the consistency check in AppendEntries detect and repair divergence.

Commit Rules

An entry is committed once stored on a majority. The leader does not commit entries from previous terms directly: it must first commit an entry from its own current term, after which all preceding entries are committed implicitly under the Log Matching Property. This rule prevents a safety bug in which an entry from a prior term could be transiently replicated, declared committed, then overwritten when a server with a more complete log wins a later election.

Safety: The Leader Completeness Property

If an entry is committed in a given term, it appears in the log of every leader in all subsequent terms. This follows from the election restriction: a candidate can only win if its log is at least as up-to-date as any majority, which means it must hold all committed entries. Safety in Raft is unconditional. The protocol never allows two servers to commit different entries at the same index, as long as fewer than half the servers fail.

Liveness

Liveness is conditional. Raft requires a stable elected leader to make progress. If elections repeatedly fail due to network instability, the system stalls. Randomized timeouts make this rare.

Cluster Membership Changes

Adding or removing servers requires care. Raft uses joint consensus, a two-phase approach: the cluster transitions through a configuration that includes both old and new member sets, and decisions during the transition require majority agreement from both before switching to the new configuration alone. Even when adding or removing one server at a time, the reconfiguration must ensure majority intersection across the transition. Joint consensus is how Raft provides that property.

Log Compaction

Logs grow without bound. Servers periodically take a snapshot of the state machine and discard all log entries before that point. If a follower falls too far behind, the leader sends it the snapshot directly via an InstallSnapshot RPC.

What You Don’t Need to Study

The millisecond ranges for election timeouts.
The publication histories of Paxos or Raft.
Details of specific Paxos variants like Zab or other Multi-Paxos optimizations.
The internal architecture of specific systems that use Raft or Paxos.