pk.org: CS 417/Lecture Notes

Distributed Transactions

Atomicity, Consistency Models, and the Trade-offs of Distributed State

Paul Krzyzanowski – 2026-03-17

One of the most critical areas where the abstraction of a single system image is tested is in transaction management. If we want to perform an operation that updates data across multiple machines, we need to ensure that the system remains in a consistent state even if a network link goes down or a server crashes halfway through the process.

Transactions

A transaction is a sequence of operations that we treat as a single logical unit of work. In a distributed environment, data is partitioned or replicated across multiple nodes connected by unreliable networks, which introduces the need for coordination protocols to ensure all participants agree on the final outcome.

Transactions are defined by four correctness properties known as ACID: atomicity (all or nothing), consistency (valid state to valid state), isolation (no interference between concurrent transactions), and durability (committed changes survive crashes). We will examine each of these in depth after we have seen the mechanisms that enforce them.

In the early days of computing, we relied on centralized systems, and transactions were local operations. The shift to distributed computing improved price-to-performance ratios, scalability, and reliability: if one machine fails, the rest can often continue. But distributing data creates new problems. If you have a bank with a checking account database in New York and a savings account database in London, a simple transfer becomes a distributed transaction. You cannot just hope that both updates succeed. You need a protocol to guarantee that if the New York update commits, the London update commits too, and that if either fails, both are rolled back.

Large-scale systems like those at Amazon, Netflix, and Google must handle thousands of transactions per second across global distances, making the management of transactions correctly and efficiently a core engineering challenge.

We will explore the topics needed to make distributed transactions work. We start with concurrency control (the mechanisms that enforce isolation) and then look at how deadlock arises from locking. From there, we move to the commit problem: how to guarantee atomicity across multiple independent nodes. Once we have seen these mechanisms in action, ACID will formalize the properties of transactions. We then examine the consistency-availability trade-offs that distributed replication imposes, and the design philosophies that have emerged in response.

Commit and abort

Every transaction has an explicit outcome: it either commits or aborts.

When a transaction commits, all of its changes are made permanent and visible to other processes. The changes survive system crashes because they are written to stable storage1 before the commit is acknowledged.

When a transaction aborts, all of its changes are rolled back. The system is returned to its state before the transaction started. Abort is also called rollback. Transactions are specifically designed to be abortable: unlike killing an arbitrary process, aborting a transaction leaves no partial state behind.

The system uses a write-ahead log to make this work. Before any change is applied to the data, a record describing the change is written to a sequential log in stable storage (disk or flash). If the system crashes mid-transaction, recovery reads the log and either completes committed transactions (redo) or undoes the changes of incomplete ones (undo). The log is what allows the system to recover correctly even after a power failure in the middle of a complex operation.

The basic transaction lifecycle looks like this:

begin_transaction()
    a_bal = read(account_A)
    write(account_A, a_bal - 1000)
    b_bal = read(account_B)
    write(account_B, b_bal + 1000)
commit()          -- or abort() to undo everything

Concurrency Control

When we allow multiple transactions to run at the same time, we do it to increase throughput. If we ran them one after another, most of the system would be idle most of the time. The goal of concurrency control is to allow this parallelism while maintaining the isolation property: we want the final state of the system to be as if the transactions were executed in some serial order. We call this property serializability.

A schedule is a sequence of operations (reads and writes) from various concurrent transactions. A serializable schedule is one that is equivalent to some serial execution. To ensure serializability, a scheduler controls the order in which operations are allowed to proceed.

There are two main approaches to this:

  1. Pessimistic concurrency control assumes that conflicts are likely and prevents them proactively using locks.

  2. Optimistic concurrency control assumes that conflicts are rare and allows transactions to proceed freely, checking for conflicts only at commit time.

Pessimistic concurrency control: locking

Pessimistic concurrency control uses locks to prevent conflicting operations from interleaving. Before accessing a data item, a transaction must acquire a lock on it. If another transaction already holds a conflicting lock on that item, the requesting transaction must wait.

The standard protocol for managing locks in a way that guarantees serializability is two-phase locking (2PL). In 2PL, a transaction is divided into two phases. In the growing phase, it acquires locks but cannot release any. Once it releases its first lock, it enters the shrinking phase, where it can release locks but cannot acquire any new ones.

Phase Permitted actions Prohibited actions
Growing phase Acquire read locks, acquire write locks, upgrade read locks to write locks. Release any locks.
Shrinking phase Release read locks, release write locks, downgrade write locks to read locks. Acquire any new locks.

To see why the rule is important, suppose:

T3 might acquire the name lock after T1 releases it and before T2 acquires it, and the age lock after T2 releases it and before T1 acquires it.

The result is that T3 reads T1’s name and T2’s age, a combination that never existed in the database at any single point in time. The 2PL rule prevents this: once T1 releases any lock, it has entered its shrinking phase and cannot acquire new ones.

Read and write locks

Requiring an exclusive lock on every data access is unnecessarily restrictive. Two transactions that only read a data item cannot possibly conflict, so there is no harm in letting them proceed in parallel.

Most systems support two types of locks. A read lock (also called a shared lock) allows multiple transactions to hold it simultaneously, but prevents any transaction from acquiring a write lock. A write lock (also called an exclusive lock) grants exclusive access; no other transaction can hold any lock on the item while the write lock is held.

Multiple read locks can coexist on the same item; a write lock conflicts with all other locks.

This substantially improves concurrency for read-heavy workloads. Any number of readers can proceed concurrently, and they only block when a writer needs to modify the same data.

Strong strict two-phase locking

Plain 2PL has a problem called cascading aborts. During the shrinking phase, a transaction may have released some locks, but another transaction may have read the data that was just unlocked. If the first transaction then aborts, the second has read uncommitted data and must also abort, potentially cascading further.

To solve this problem, most systems use strict two-phase locking: write locks are held until the transaction commits or aborts. This prevents any transaction from reading uncommitted data, eliminating cascading aborts.

Strong strict two-phase locking (SS2PL) goes further and holds all locks, both read and write, until commit or abort. SS2PL is used in several commercial databases.

Deadlock

Locking introduces the risk of deadlock: a situation where a set of transactions are each waiting for a lock held by another in the set, forming a cycle that none can break.

For deadlock to occur, four conditions must all hold simultaneously.

  1. Mutual exclusion: A resource can be held by at most one transaction at a time.

  2. Hold and wait: A transaction holding locks can request additional locks and wait for them.

  3. Non-preemption: A lock, once granted, cannot be forcibly taken away.

  4. Circular wait: There is a cycle of transactions, each waiting for a resource held by the next.

The standard tool for reasoning about deadlock is the wait-for graph (WFG), where each node is a transaction and a directed edge from T1 to T2 means T1 is waiting for a resource held by T2. A cycle in the graph indicates a deadlock.

In a single-node system, the database engine maintains this graph in memory and can detect cycles immediately. In a distributed system, each node sees only the transactions and locks relevant to its own data. A deadlock might span multiple machines, with each machine seeing only some edges of the cycle, so no single node can detect it alone.

There are three practical approaches to handling deadlock.

1. Ignore. Rely on application-level timeouts to eventually abort stuck transactions. This is more common than you might expect in general-purpose systems, but it is not acceptable for transactional databases because a slow but live node can trigger the same timeout as a genuinely deadlocked one, causing an unnecessary abort.

2. Detect. Find cycles in the wait-for graph and break them by aborting at least one transaction.

One approach is centralized deeadlock detection: a single coordinator node collects local WFGs from all nodes, merges them into a global wait-for-graph, and searches for cycles.

This is straightforward but creates a single point of failure and suffers from phantom deadlocks, false positives that arise because local graphs are collected asynchronously. A lock may have been released by the time the coordinator sees the edge that depends on it, causing it to report a cycle that no longer exists.

The Chandy-Misra-Haas algorithm avoids the use of a coordinator and building global wait-for-graph by using edge chasing. When a transaction T0 blocks waiting for a resource held by T1, T0 sends a probe message to T1. The probe carries three IDs:

  1. the originating transaction

  2. the sender, and

  3. the recipient.

If T1 is itself blocked, it forwards the probe (updating the sender field) to each node that holds the resource it needs. Each blocked node in turn forwards the probe further. If the probe ever returns to T0, a cycle exists and T0 is therefore deadlocked. No central coordinator is needed, and probe messages are small.

Once a deadlock is detected, the system breaks it by aborting at least one transaction, typically the youngest or the one that has done the least work.

3. Prevent. Eliminate deadlock by making cycles structurally impossible, without needing detection. The standard approach assigns each transaction a unique timestamp when it begins and uses that timestamp to decide who waits and who aborts when two transactions conflict.

In wait-die, an older transaction may wait for a younger one, but a younger transaction requesting a resource held by an older one must abort and restart. In wound-wait, the rule is reversed: an older transaction preempts a younger one by aborting it, while a younger transaction waits for an older one. In both schemes, the edges in the wait-for graph always point in the same direction (either from old to young or from young to old), so cycles are impossible.

Optimistic concurrency control

Optimistic concurrency control (OCC) takes the opposite approach to locking: allow transactions to proceed without acquiring locks and check for conflicts only at commit time.

An optimistic transaction runs in three phases:

  1. In the working phase, the transaction reads data and writes to a private workspace. No locks are held.

  2. In the validation phase, before committing, the system checks whether any of the data the transaction read has been modified by a transaction that committed since the working phase began. If there is a conflict, the transaction is aborted and restarted.

  3. In the update phase, if validation passes, the private workspace is made permanent.

OCC is deadlock-free and allows maximum parallelism during the working phase. The cost is that transactions may be aborted and restarted after doing all their work. This makes OCC attractive for read-heavy workloads with low conflict rates, and unattractive when write contention is high.

Multi-version concurrency control

Multi-version concurrency control (MVCC) is used by most modern databases, including PostgreSQL, Oracle, and MySQL (InnoDB). Rather than overwriting a data item on every write, MVCC keeps multiple versions of it, allowing readers and writers to proceed concurrently without blocking each other.

A common MVCC design assigns each transaction a snapshot timestamp when it begins. Every write creates a new version of the data item. A read returns the newest version that was committed before the transaction’s snapshot was taken. This is commonly called snapshot isolation: the transaction sees a consistent view of the data as it existed when it began, regardless of what other transactions do afterward. Different systems implement the exact visibility rules differently, but the core idea is the same: reads see a stable snapshot and do not block writers.

Reads never block, because they always read from a consistent snapshot rather than waiting for in-progress writers to finish. Write-write conflicts are detected at commit time using a first-committer-wins rule: a transaction’s write proceeds into a private version, and when it tries to commit, the system checks whether another transaction has already committed a write to the same item since this transaction’s snapshot was taken. If so, the later transaction aborts and restarts.

Old versions must eventually be garbage-collected, and write-write conflicts still require resolution. But for read-heavy workloads, MVCC dramatically reduces the contention compared to locking.

Leases

In distributed systems, locks have a fault-tolerance problem: if the transaction that holds a lock crashes, the lock is never released, and the resource is permanently unavailable. A lease is a lock with a time limit. The lock is automatically released when the lease expires, even if the holder crashes.

The trade-off is in choosing the lease duration. Short leases need to be renewed frequently and may expire on a slow-but-alive transaction, forcing it to abort unnecessarily. Long leases reduce renewal overhead but mean a longer wait when the holder actually fails.

Fencing tokens (which we saw in the context of ZooKeeper in Week 6) address the case where a lease expires and a new holder acquires the resource, but the old holder is merely slow rather than dead and tries to make changes after its lease has expired. They are similar in concept to epochs in the Raft protocol: numbers that monotonically increase with each new lock grant, preventing a stale holder from making changes after its term has expired.

The concurrency control mechanisms we have examined ensure isolation within a single transaction. The next challenge is to provide atomicity across nodes: guaranteeing that all participants in a distributed transaction either commit or abort together.


The Commit Problem

Why distributed commit is hard

In a single-node database, committing a transaction is simple: write the changes to a write-ahead log, flush the log to disk, and you are done. If the machine crashes mid-write, recovery replays the log. With only one machine, there’s only one decision.

Now, suppose a bank transfer moves money from an account in database A to an account in database B. Both databases must either apply the change or neither should. If A commits but B crashes before it commits, you have just destroyed money. If B commits but A rolls back, you have created money out of thin air. The correctness requirement is that the outcome must be atomic across both databases.

The problem is that no single node has full knowledge of the state of all the other nodes. Each node can only observe its own state and the messages it receives. A node that stops responding might be slow, has crashed, or has committed and then crashed. The coordinator cannot tell the difference.

Two-Phase Commit

The Two-Phase Commit protocol (2PC) is the standard solution to this problem. Jim Gray introduced it in 1978 in his paper “Notes on Database Operating Systems,” and it has been implemented in virtually every distributed database and transaction manager since.

The protocol assumes a coordinator-participant model. One node acts as the coordinator, typically the node that initiated the transaction. The other nodes that hold data touched by the transaction are participants.

Phase 1: Prepare (Voting Phase)
The coordinator sends a PREPARE message to every participant. Each participant must determine whether it can commit its portion of the transaction. This means validating integrity constraints, ensuring that necessary locks are held, and flushing a prepare record to stable storage. The key invariant is that once a participant votes YES, it must be able to commit even after a crash. It is making a durable promise.
If a participant is ready, it responds YES. If anything is wrong, it responds NO. A single NO vote forces the entire transaction to abort. If a participant fails to respond, the coordinator waits, retrying as needed until the participant recovers and replies. The protocol assumes a fail-recover model: nodes eventually come back.
Phase 2: Commit or Abort

If all participants voted yes, the coordinator writes a commit record to its own log and broadcasts COMMIT to all participants. Each participant commits and releases its locks. If any participant voted no, the coordinator broadcasts ABORT and each participant rolls back.

The writes to the log are what make this work across failures. Before sending any decision, the coordinator writes that decision to stable storage. Before responding to a PREPARE message with yes, a participant writes a prepare record. If either crashes, the protocol can be recovered by replaying the log.

Coordinator          Participants
    |---- PREPARE ------->|
    |<-- YES / NO --------|
    |
    | (writes commit or abort to log)
    |
    |--- COMMIT/ABORT --->|
    |<-- ACK -------------|

2PC requires unanimous agreement, not a majority. This is a fundamental distinction from the consensus protocols we studied earlier.

Raft and Paxos require only a majority (F+1 out of a total of 2F+1 nodes) to make progress and can tolerate minority failures. In 2PC, every participant has veto power: one NO vote aborts the entire transaction, and one unresponsive participant blocks the protocol indefinitely.

Unlike consensus, where the goal is to agree on a value and a majority provides sufficient evidence of that agreement, distributed commit requires the decision to be honored by every node that touched the transaction. The transaction must complete everywhere or nowhere, so no majority shortcut is available.

Failure scenarios in 2PC

2PC handles many failure scenarios, but not all.

A participant fails before it votes.
The coordinator waits for it to recover and respond. It does not treat silence as a NO; it simply keeps retrying. This is why 2PC assumes a fail-recover rather than fail-stop model.
A participant fails after voting YES but before receiving the decision.
The participant enters a state called uncertain. When it recovers, it finds a prepare record in its log but no commit or abort record. It cannot unilaterally decide and must contact the coordinator. This is the critical window of vulnerability in 2PC.
The coordinator fails before sending any decision.
All participants have voted YES, all are in the uncertain state, and none can determine the global outcome. This is where 2PC blocks indefinitely.
The coordinator fails after some participants, but not all, receive the decision.
This is actually recoverable. A participant that has already committed is definitive proof that the coordinator decided to commit. Any uncertain participant that contacts a committed participant can safely commit; one that finds an aborted participant knows to abort. The recovery algorithm is: if any reachable participant has committed, everyone commits; if any reachable participant has aborted, everyone aborts; if no reachable participant has received a decision, the uncertain participants must wait. They cannot safely infer abort, because the coordinator may have decided to commit before it crashed.

The fundamental problem is the third scenario: 2PC can block. A participant that has voted YES cannot unilaterally commit or abort, and if no reachable node knows the coordinator’s decision, its locks remain held indefinitely. This blocking behavior is the known limitation of 2PC that has motivated decades of subsequent research.

Three-Phase Commit

Three-Phase Commit (3PC) was introduced by Dale Skeen in his 1981 paper “Nonblocking Commit Protocols.” The goal was to eliminate the blocking behavior of 2PC by ensuring that no single failure can leave the system in an irrecoverable state.

3PC adds a third phase, the pre-commit phase, between the voting and the final commit phases. The protocol works as follows.

Phase 1: CanCommit
The coordinator asks each participant: can you commit? Participants respond yes or no. If any respond no, the coordinator aborts. This is the same as in 2PC.
Phase 2: PreCommit
If all participants said yes, the coordinator sends a PRE-COMMIT message. Participants acknowledge the message. This phase establishes a shared understanding that the coordinator intends to commit. No participant has committed yet (and has not been told to), but all know that the decision is commit.
Phase 3: DoCommit
The coordinator sends DO-COMMIT and participants commit.

The key insight is that the PRE-COMMIT phase removes the fatal uncertainty of 2PC. In 2PC, an uncertain participant cannot tell whether the coordinator decided to commit or abort. In 3PC, if a new coordinator takes over after the original coordinator crashes, it can query participants and determine the state of the protocol. If any participant has received PRE-COMMIT, the new coordinator knows the decision was to commit and proceeds. If no participant has received PRE-COMMIT, the new coordinator can safely abort.

The PreCommit phase also gives the protocol more opportunities to resolve timeouts without blocking forever. A participant can safely abort on timeout when:

Once a participant has entered PreCommit, abort is no longer safe because another participant may have already committed. At that point, commit is the correct resolution:

The problem: 3PC breaks down under network partitions

3PC resolves coordinator failure at the cost of an additional round-trip and, more importantly, the assumption that the network has bounded message delay and that node failures can be reliably detected. In real environments, these assumptions rarely hold. If a network partition occurs during the pre-commit phase, nodes on each side may independently trigger timeout-based state transitions and reach incompatible decisions: one side proceeds to commit while the other times out and aborts. The protocol cannot distinguish a slow node from a dead one.

Because of these limitations, 3PC is rarely implemented in practice. Virtually no commercial database uses it. The better modern answer is to run 2PC on top of a replicated consensus group: replace the single coordinator with a Raft or Paxos group that can survive the failure of individual members. This solves the coordinator blocking problem, though a participant that is permanently unreachable can still stall the protocol. The section on how 2PC relates to Raft, Paxos, and Virtual Synchrony below discusses this further.

The availability cost of distributed transactions

2PC chains the availability of all participants together (a serial dependency). If each database in a transaction has 99.9% availability (about 8.75 hours of downtime per year), a transaction spanning two databases has 0.999 × 0.999 = 99.8% availability, roughly 17.5 hours of expected downtime per year. Across five databases, it’s 0.9995 ≈ 99.5%, or close to two full days of downtime per year. Each participant is a potential point of failure for every transaction that touches it.

This is one of the most compelling arguments for designing systems that minimize cross-database transactions. Much of the BASE and eventual consistency thinking covered later in this document grew directly from this reality. At Internet scale, with dozens or hundreds of services each with their own database, requiring strict 2PC across service boundaries produces systems that are collectively less available than any individual component.

How 2PC relates to consensus protocols

Distributed commit is closely related to consensus, so it is worth situating 2PC relative to the protocols we discussed earlier.

Raft and Paxos are general-purpose fault-tolerant consensus algorithms, resilient to network partitions as long as a majority of nodes are reachable. Either is an excellent tool for making the 2PC coordinator fault-tolerant: run the coordinator as a replicated state machine (using Raft or Paxos) so that coordinator failure does not stall the protocol. However, Raft and Paxos solve a different problem from 2PC. They assume all nodes apply all commands and have no concept of a participant vetoing a decision. 2PC must honor a participant’s NO vote. Replacing the coordinator with a Raft or Paxos group therefore fixes coordinator failure but does nothing about a participant that is unreachable or that cannot commit. If a participant is permanently down, the transaction still blocks.

Virtual Synchrony, by contrast, is fast and lightweight but cannot survive network partitions, making it unsuitable as a substitute for a commit protocol across independent nodes.

The consensus protocols and 2PC are often combined in systems: Paxos or Raft makes each participant group fault-tolerant by replicating its state across multiple nodes, and 2PC then coordinates across those groups when a transaction touches more than one of them.

These protocols also differ in cost. Raft and Paxos require durable log replication and quorum acknowledgment. 2PC is more expensive still: every participant must flush to stable storage twice (prepare and commit), and locks are held across network round-trips. Each is the right tool for a different job.

Sagas: avoiding 2PC in microservices (optional material)

A design pattern that emerged with microservices to improve performance while weakening transactional guarantees is that of sagas. We won’t cover it, but it’s useful to be aware that it exists.

The overhead and blocking behavior of 2PC make it impractical when a business operation spans many independent services, each with its own database. The saga pattern is a widely used alternative that achieves a form of distributed atomicity without a distributed commit protocol.

A saga breaks a multi-step operation into a sequence of local transactions, each of which commits immediately to its own database. For each step, the saga also defines a compensating transaction that undoes the step’s effect if a later step fails. If step N fails, the saga executes the compensating transactions for steps N-1, N-2, and so on, in reverse order.

Example: An e-commerce order might: (1) reserve inventory, (2) charge the payment card, (3) update loyalty points, (4) send a confirmation email. If step 3 fails, the saga refunds the card (compensates step 2) and releases the reservation (compensates step 1). No distributed lock is held across the whole sequence.

Sagas scale well and avoid the availability problem of 2PC, but they come with real shortcomings that must be understood before using them.

Intermediate state is visible. While the saga is executing, other transactions can observe partial results: inventory is reserved but payment has not been charged yet. This violates isolation. Applications must be designed to tolerate or hide this inconsistency.

Compensating transactions are not true rollbacks. A compensating transaction is a new forward operation that logically undoes an earlier step. Any external side effects from the earlier step may already be permanent. If the confirmation email was sent in step 4, no compensation can unsend it (it could generate a follow-up email with an explanation). If an API call to a payment processor succeeded, the compensation must issue a separate refund request, which can itself fail.

Failure handling is complex. Designing correct compensations for every possible partial-failure scenario and ensuring that compensations are themselves idempotent and reliable is difficult. The atomicity guarantee that 2PC provides automatically in the database layer moves into application code with sagas, and application code is where bugs live.

Sagas are appropriate when the application can tolerate temporary inconsistency, when compensations are feasible and reliable for every step, and when the scalability and availability benefits outweigh the complexity cost.

Having seen how atomicity and isolation are achieved in practice, we can now look at how these properties are formally defined.


ACID

ACID is the set of properties that traditional relational databases provide for transactions. The acronym was coined by Andreas Reuter and Theo Härder in 1983, but the concepts had been developed throughout the 1970s, particularly by Jim Gray.

Atomicity: The transaction is all-or-nothing. Either all of its operations complete and are committed, or none of them are. There is no partial execution. We have just seen how 2PC achieves atomicity across multiple nodes.

Consistency: A transaction moves the database from one valid state to another. All integrity constraints in the database (foreign keys, uniqueness, domain constraints) must hold at the end of the transaction. Note that this is a different definition of ‘consistency’ than in distributed systems, where the term refers to what values a read is allowed to return across replicas. ACID consistency is about application-level invariants enforced by the database.

Isolation: Concurrent transactions do not interfere with each other. From each transaction’s point of view, it appears to execute alone. The standard isolation level for serializability requires that the outcome of concurrent transactions is equivalent to some serial execution of those transactions. We have just seen how concurrency control mechanisms (locking, OCC, and MVCC) enforce this property.

Durability: Once a transaction commits, it is permanent. The changes survive crashes and restarts. This is what write-ahead logging provides.

These properties are straightforward to provide when all data lives on one machine. When a transaction spans multiple machines, each with its own database, every one of these properties becomes significantly harder and more expensive to maintain:


Consistency Models

The word “consistency” is overloaded in computer science, and it means something different in databases than in distributed systems.

In databases, consistency (the C in ACID) means that a transaction leaves the database in a valid state, satisfying integrity constraints such as foreign keys or balance invariants. That is an application-level correctness property, and the database enforces it by aborting transactions that would violate it.

In distributed systems, consistency refers to what a read operation is allowed to return given a history of writes across replicated nodes. The two uses of the word are unrelated, and conflating them is a persistent source of confusion. When the CAP theorem talks about consistency, it means the distributed systems definition, not the ACID one.

Stronger models give programs more intuitive guarantees but cost more in coordination. Weaker models allow better performance and availability but push complexity into the application.

Linearizability

Linearizability is the strongest practical consistency model. It was formalized by Herlihy and Wing in their 1990 paper “Linearizability: A Correctness Condition for Concurrent Objects.”

The guarantee is: every operation appears to take effect instantaneously at some point between its invocation and its completion, and the resulting order is consistent with real time. The system behaves as if there is a single copy of the data, and all operations execute one at a time on that copy.

A useful way to think about this is as a single imaginary global timeline. Even though the system is distributed across many machines, the outcome must look as if all operations were placed on one timeline, and that timeline must agree with real time whenever real time gives a clear ordering. Two key ideas follow from this:

Example: Suppose x was last written with the value 5.

  1. Client A reads x and gets 5.

  2. Client B writes x=10. The write completes.

  3. Client C begins a read of x after B’s write has completed.

Under linearizability, C must return 10, because B’s write completed in real time before C’s read began. If D then reads x after C, D must also return 10 or a later value. It would be incorrect for D to return 5, because D’s read started after B’s write completed.

Example: Consider a bank transfer where T1 transfers $100 and its write completes at 10:00:01.

  1. T1 transfers $100. Write completes at 10:00:01.

  2. T2 starts at 10:00:03 and reads the balance.

Under linearizability, T2 must see T1’s write, because T1 completed before T2 began. It would be incorrect for T2 to return the old balance.

This is what you want from a distributed lock or a counter. etcd provides linearizability for all operations. ZooKeeper provides linearizable writes, but reads are sequentially consistent by default; a client must call sync() before a read to guarantee it sees the latest write.

An important point is that linearizability does not require operations to be ordered by physical wall-clock timestamps. It only constrains non-overlapping operations:

The cost of linearizability is coordination. Every write must propagate to enough replicas, and every read must confirm it is seeing the current value. One common implementation routes all operations through a single leader:

  1. Clients send operations to the leader.

  2. The leader assigns each operation a position in a replicated log.

  3. A fault-tolerant replication protocol such as Raft propagates the log to all replicas.

  4. The operation is reported as committed only once the system knows no later operation will be inserted before it.

This is more expensive than sequential consistency because respecting real-time ordering requires waiting for that confirmation before responding to clients. Other approaches include timestamp schemes that account for clock uncertainty, such as Google Spanner’s TrueTime, which uses hardware-assisted bounds on clock error to establish a safe ordering window before committing.

Linearizability is also what Gilbert and Lynch used as the definition of “consistency” in the CAP proof.

Sequential consistency

Sequential consistency was defined by Leslie Lamport in 1979. It relaxes linearizability by dropping the real-time requirement.

The guarantee is: the result of any execution is the same as if the operations of all the processes were executed in some sequential order, and the operations of each individual process appear in that sequence in the order specified by its program.

Example: Clients A and B both write to x. A writes x=1 and then reads x; B writes x=2 and then reads x.

Under sequential consistency, A might read its own write and see 1, or it might see 2 (B’s write happened first in the global order). B might see 1 or 2.

What is required is that there exists some total order of all operations that is consistent with each process’s local order. You cannot have A seeing 2 while B sees 1 if those reads followed their respective writes in a way that would be contradictory.

Sequential consistency allows replicas to diverge temporarily as long as the global history is consistent with some valid sequential execution. It is weaker than linearizability because operations no longer need to respect real-time order. You can have one process see a write before another process does, as long as all processes agree on the same global ordering of writes.

Multi-player games sometimes use sequential consistency: the ordering of moves must be consistent across all clients, but a slight lag before seeing someone else’s move is acceptable.

Causal consistency

Causal consistency relaxes sequential consistency further by only requiring that causally related operations appear in the same order for all processes. Operations that are causally independent can be seen in different orders by different processes.

Two operations are causally related if one might have caused the other. Concretely: if process A writes x=1 and then sends a message to process B, and B reads x=1 and then writes y=2, then the writes to x and y are causally related. All processes must see them in that order.

Example: Alice posts a photo on a social network. Bob sees the photo and comments on it.

Under causal consistency, every user who sees Bob’s comment must also see Alice’s photo (the cause). Users who have not seen either yet may see them in any order. A user who sees the comment first and then does not see the photo has a causally inconsistent view.

Causal consistency is achievable with vector clocks. Each write carries a vector timestamp, and a node delivers an operation only after all operations that causally precede it have been delivered. This allows much more local operation and lower latency than sequential or linearizable systems.

Causal consistency makes intuitive sense for many collaborative applications. Systems like collaborative document editors and messaging platforms operate in this regime.

Eventual consistency

Eventual consistency is the weakest guarantee that is still useful. The promise is simple: if no new updates are made to a data item, all replicas will eventually converge to the same value.

There is no guarantee about when convergence happens, and no constraint on what you might read in the interim. Two concurrent reads of the same key from different replicas might return different values. A read might return a value that is many writes out of date.

Example: You post a status update on a social network. Your update immediately appears on the replica closest to you. Over the next second or two, it propagates to replicas in other regions. During that propagation window, some of your followers see the update and some do not. Eventually, all replicas have the update.

DNS works the same way: a record change propagates through the system over a period of minutes or hours.

Eventual consistency enables high availability and low latency because writes can be acknowledged from a single local replica without waiting for coordination. The application must be designed to tolerate stale reads and to handle conflicts when two concurrent writes to the same key diverge on different replicas.

You don’t need to know this for the exams, but keep this in the back of your mind in case you encounter it in the future.

The cleanest way to handle conflicts in eventually consistent systems is through Conflict-Free Replicated Data Types (CRDTs). CRDTs are data structures designed so that all concurrent updates can be merged without conflict, regardless of the order they are applied. A counter that can only be incremented is a simple CRDT: you just sum up all the increment operations seen. A more complex example is a distributed shopping cart that is a CRDT set, where any replica can add items and the final cart is the union of all additions.

Strong Eventual Consistency (SEC) adds a safety property to eventual consistency: any two nodes that have received the same set of updates will be in the same state, even if they received the updates in different orders. CRDTs are the standard mechanism for achieving SEC.

Serializability and linearizability

Serializability and linearizability are both used to describe correctness guarantees, and it is worth being precise about how they differ.

Serializability is a property of transactions: multi-step, multi-object operations. It says that the outcome of executing a set of concurrent transactions must be equivalent to the outcome of executing those transactions in some serial order, one after the other. It says nothing about which serial order or whether that order respects real time. Two transactions could execute concurrently and produce a result consistent with the serial order B→A even if A started before B in wall-clock time.

Linearizability is a property of individual operations on a single object. It says that each operation appears to take effect instantaneously at some point between its invocation and its completion, in an order that is consistent with real time. Linearizability says nothing about grouping multiple operations into transactions.

The two properties are addressing different things, and you can have one without the other. A database can be serializable but not linearizable: the transactions produce results equivalent to some serial order, but that order might not reflect who started first. A key-value store can provide linearizable operations on individual keys but not serializability: there is no transaction mechanism for grouping reads and writes across multiple keys atomically.

You don’t need to know this for the exams, but should be aware that these properties can also work together:

Strong serializability (also called strict serializability) combines both: the results of concurrent transactions must be equivalent to some serial execution, and that serial order must be consistent with real time. If you commit a transaction and then start a new one, the new transaction must see all the effects of the first one. Strong serializability is the most intuitive guarantee and the strongest useful model. It is what you would naturally assume a correct database provides, and it is the goal of systems like Google Spanner.

A note on “strong consistency”

“Strong consistency” is an informal term that different people use to mean different things. In some contexts, it means linearizability. In others, it means sequential consistency. When you encounter this term in documentation or conversation, you may need to push for a more precise definition.

These models define the spectrum of guarantees a replicated system can offer. The CAP theorem reveals a fundamental constraint on which of those guarantees can be combined when the network is unreliable.


The CAP Theorem

Imagine you are building an online shopping site. You have a database storing inventory counts. You replicate it across multiple data centers for fault tolerance. Now your network partition occurs: the network link between your U.S. and EU data centers goes down. Your EU data center is still accepting requests from EU customers, and your U.S. data center is doing the same for U.S. customers. Both are willing to serve reads and writes. But they cannot talk to each other.

A customer in London buys the last unit of a limited-edition item. A customer in New York buys it at the same time. You cannot coordinate across the partition. What do you do? If you reject the transaction, you sacrifice availability. If you accept both, you sacrifice consistency.

This tension is what Eric Brewer captured in his famous CAP theorem.

The theorem

Brewer presented the CAP conjecture in his keynote address at the 2000 Symposium on Principles of Distributed Computing (PODC). In 2002, Seth Gilbert and Nancy Lynch of MIT published a formal proof.

The three properties CAP addresses are:

Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time. This is specifically linearizability, which we defined in the consistency models section above.

Availability (A): Every request to a non-failing node receives a non-error response. The system continues to respond even if some nodes have failed.

Partition Tolerance (P): The system continues to operate despite arbitrary message delays or loss between nodes.

The theorem says that when a network partition occurs, a distributed system cannot simultaneously guarantee both consistency and availability. It must choose one to sacrifice. Networks do partition: links fail, packets get lost, routers misbehave. Any system deployed in the real world must be able to survive message loss, so partition tolerance is not something you can opt out of. The choice is between C and A when a partition actually happens.

What the CAP theorem is not saying

CAP is commonly summarized as “you can have at most two of consistency, availability, and partition tolerance.” This framing is a useful shorthand, but is imprecise in two ways.

  1. It implies that partition tolerance is a property you trade away like the others. It is not. Partitions happen regardless of your design choices, so the real choice is only between C and A, and only when a partition occurs. When the network is healthy, a well-designed system can provide both consistency and availability simultaneously with no conflict.

  2. Neither consistency nor availability is binary. Consistency spans a spectrum from linearizability down to eventual consistency. Availability is similarly a matter of degree. Real systems are not simply CP or AP but sit at various points along both dimensions, and can make different trade-offs for different operations.

Brewer revisited the theorem in 2012 and acknowledged that the simple “pick two” characterization glosses over a great deal of important nuance.

PACELC

The CAP theorem says nothing about what happens when the network is working normally. This is a significant omission because partitions are rare events.

If there are no partitions, a system can provide both consistency (C) and availability (A) and the latency and consistency trade-off during normal operation matters far more in day-to-day operations.

Daniel Abadi of Yale University described the PACELC framework in 2010 and formalized it in a 2012 paper. PACELC stands for:

Partition, Availability, Consistency, Else, Latency, Consistency.

The full statement is:

The latency-consistency trade-off is fundamental and unavoidable. If your system provides strong consistency (linearizability), every write must be acknowledged by a quorum of replicas before the client sees a response. That coordination takes time: at least one network round-trip to the nearest quorum member. In a system with replicas across data centers that could be hundreds of milliseconds. If you relax consistency and allow stale reads, you can respond immediately from a local replica.

PACELC classifies systems along two axes:

Partition behavior Normal-case behavior Examples
PA (favor availability) EL (favor latency) Dynamo, Cassandra (default), Riak
PA (favor availability) EC (favor consistency) MongoDB (in some configurations)
PC (favor consistency) EC (favor consistency) HBase, Google Spanner, VoltDB

This table is more useful for system design decisions than the blunt CAP categories, because it forces you to think about the normal-case trade-off rather than just the partition-failure case. Note that the systems listed as examples are dependent on specific configurations, so don’t treat them as hard categories.


BASE

BASE was coined by Dan Pritchett, then a Technical Fellow at eBay, in a 2008 ACM Queue article titled “BASE: An ACID Alternative.” The acronym stands for Basically Available, Soft state, Eventually consistent.

Pritchett was describing the design philosophy that eBay and similar large-scale web systems were forced to adopt as their transaction volumes grew beyond what 2PC-backed ACID databases could handle. The same philosophy had been independently developing at Amazon, Google, and other internet-scale companies. It is, in essence, a practical response to the constraints that CAP and PACELC reveal: if you cannot have both consistency and availability during partitions, and if strong consistency imposes latency costs even without partitions, you design systems that accept weaker consistency in exchange for higher availability and lower latency.

Basically Available means the system prioritizes responding to requests over ensuring consistency. In the presence of failures or partitions, the system returns some response, even if it might be stale or incomplete. Rather than failing a request because one replica is unreachable, the system serves the request from what is available.

Soft State means the state of the system is not stable over time, even without input. Because updates propagate asynchronously and replicas reconcile changes at different times, the system’s state is in flux. You cannot assume that what you read will remain unchanged.

Eventual Consistency is the convergence guarantee: given enough time without new updates, all replicas will converge to the same value.

BASE is not a precise protocol like ACID. It is a design philosophy that accepts weaker consistency guarantees in exchange for availability and scale. It shifts the burden of handling inconsistency from the database to the application.

The chemistry pun is intentional, as acid and base are chemical opposites.


ACID vs. BASE

The choice between ACID and BASE is driven by the application’s requirements.

If your application involves financial transfers, medical record updates, or anything where a partial update could cause real harm, you need ACID. The cost in latency and throughput is worth it.

If your application involves social media feeds, product recommendations, shopping cart contents, or other cases where a slightly stale or inconsistent read is tolerable, BASE systems offer dramatically better scalability and availability.

Many modern systems are hybrid. A payment service might use strict ACID transactions for the account debit and credit, while using an eventually consistent store for the transaction history display. The art is in identifying which parts of your system actually require strong consistency and which do not.


Summary

We built up our understanding of distributed transactions from concrete mechanisms to formal abstractions.

References


Next: Week 8 Study Guide

Back to CS 417 Documents


  1. Stable storage is any storage system that can survive system crashes, power outages, and reboots. It’s typically a file system where the application makes sure that the file contents have been written to the disk rather than queued up in a memory buffer. On Linux systems, you can open a file with the O_SYNC flag or use the fsync() system call. The true guarantee of stability also depends on the file system, device, controller, and how write caches are handled.