Distributed Transactions

Transactions and ACID

A transaction is a sequence of operations treated as a single logical unit of work. A transaction either commits (all changes made permanent) or aborts (all changes rolled back). Transactions are specifically designed to be abortable: aborting leaves no partial state behind. The write-ahead log makes this work: changes are written to a sequential log before being applied to data, allowing recovery to redo committed transactions and undo incomplete ones after a crash.

The ACID properties define correctness for transactions: Atomicity (all-or-nothing), Consistency (valid state to valid state), Isolation (concurrent transactions do not observe each other’s partial results), and Durability (committed changes survive crashes). These properties are straightforward to provide on a single machine. In a distributed setting, each property requires expensive coordination: atomicity needs 2PC, isolation needs distributed locking, and the availability cost multiplies with every participant.

Note that the “consistency” in ACID refers to application-level data integrity constraints. The “consistency” in distributed systems and the CAP theorem refers to what values a read is allowed to return across replicas. The two uses of the word are unrelated.

Concurrency Control

Concurrency control enforces isolation. The standard goal is serializability: concurrent transactions must produce results equivalent to some serial execution. A schedule is a sequence of interleaved operations from concurrent transactions; a serializable schedule is one equivalent to some serial execution.

There are two main approaches: pessimistic concurrency control assumes conflicts are likely and prevents them using locks; optimistic concurrency control assumes conflicts are rare and checks for them only at commit time.

Two-phase locking (2PL) is the standard pessimistic protocol. A transaction has a growing phase (acquires locks, releases none) and a shrinking phase (releases locks, acquires none). This rule prevents the interleavings that produce inconsistent reads. Strict 2PL holds write locks until commit or abort, preventing cascading aborts. SS2PL holds all locks until commit or abort and is implemented by many commercial databases.

Read (shared) locks allow multiple concurrent readers. Write (exclusive) locks grant exclusive access. Multiple read locks can coexist; a write lock conflicts with all other locks.

Optimistic concurrency control (OCC) proceeds in three phases: working (reads and writes go to a private workspace, no locks held), validation (check for conflicts at commit time), and update (apply the workspace if validation passes). OCC is deadlock-free and efficient when conflicts are rare, but transactions may be aborted and restarted after completing all their work.

Multi-Version Concurrency Control (MVCC) maintains multiple versions of each data item. A common design gives each transaction a snapshot at start time and returns the newest committed version visible in that snapshot. This is called snapshot isolation. Different systems implement the exact visibility rules differently, but reads never block because they always draw from a stable snapshot. Write-write conflicts are resolved at commit time by a first-committer-wins rule.

Deadlock

Deadlock arises from locking: a set of transactions each hold locks needed by another in the set, forming a cycle nobody can break. Four conditions must hold simultaneously: mutual exclusion, hold and wait, non-preemption, and circular wait. OCC and MVCC do not hold blocking locks and are therefore deadlock-free.

The wait-for graph (WFG) represents lock dependencies: an edge from T1 to T2 means T1 is waiting for a lock held by T2. A cycle indicates deadlock. In a distributed system, each node sees only its local WFG edges; a deadlock can span multiple machines with no single node seeing the full cycle.

Three practical approaches exist. Ignoring deadlocks relies on application-level timeouts; acceptable in some systems but not in transactional databases, where a slow-but-live node triggers the same timeout as a genuinely deadlocked one. Detection finds cycles; Prevention makes cycles structurally impossible.

Centralized detection has one node collect all local WFGs and search for cycles. It is simple but produces phantom deadlocks: false positives caused by asynchronous snapshot collection, where an edge appears in the global graph after the underlying lock has already been released. The Chandy-Misra-Haas algorithm avoids the global snapshot by chasing edges with probe messages. A blocked transaction T0 sends a probe to the node holding the resource. The probe propagates along dependency edges; if it returns to T0, a cycle exists. When a deadlock is confirmed, the system aborts at least one transaction, typically the youngest or the one that has done the least work.

Timestamp-based prevention assigns each transaction a unique timestamp at start time and uses it to decide who waits and who aborts on a conflict, ensuring that WFG edges always point in the same direction so cycles are impossible. The two standard schemes are wait-die and wound-wait.

Two-Phase Commit (2PC)

Two-Phase Commit ensures all nodes in a distributed transaction either all commit or all abort.

The protocol uses a coordinator-participant model. In Phase 1 (Prepare/Voting), the coordinator sends a PREPARE message to all participants. Each participant checks whether it can commit, writes a prepare record to stable storage, and responds YES or NO. A YES vote is a durable promise: the participant must be able to commit even after a crash. If a participant fails to respond, the coordinator waits and keeps retrying; it does not treat silence as a NO. The protocol assumes a fail-recover model.

In Phase 2 (Commit or Abort), if all participants voted yes, the coordinator writes a commit record and broadcasts COMMIT. If any participant voted no, the coordinator broadcasts ABORT. Participants execute the decision and release their locks.

2PC requires unanimous agreement, not a majority. Any single participant can veto, and any unresponsive participant blocks the protocol indefinitely. This is inherent: the transaction must complete everywhere or nowhere, so no majority shortcut is available.

The critical vulnerability is the uncertain state: a participant that has voted yes but has not received the coordinator’s decision cannot unilaterally commit or abort. If the coordinator fails in this window, all participants block with locks held. 2PC is a blocking protocol. The practical remedy is to replace the single coordinator with a Raft- or Paxos-replicated group.

Availability cost: 2PC chains availability multiplicatively. At 99.9% per database, a five-database transaction has only 0.999^5 ≈ 99.5% availability, close to two days of downtime per year. This cost is a major driver of BASE-style architectures at internet scale.

Three-Phase Commit (3PC)

Three-Phase Commit was designed to eliminate 2PC’s blocking behavior. It inserts a PreCommit phase between voting and the final commit so a recovery coordinator can determine the intended decision: if any participant has seen PreCommit, commit; if none has, abort.

3PC eliminates blocking under single-node failures but assumes a synchronous network with bounded message delay. A partition during the PreCommit phase can cause split-brain behavior. Because of this, 3PC is not used in practice.

How 2PC Relates to Raft, Paxos, and Virtual Synchrony

Virtual Synchrony provides atomic multicast within a process group and is fast, but cannot survive network partitions. Raft and Paxos are fault-tolerant consensus algorithms that use majority agreement and are useful for making the 2PC coordinator fault-tolerant, but neither has a concept of a participant vetoing a decision and neither can substitute for 2PC itself. 2PC uses unanimous agreement and is designed specifically for transactional atomicity. In practice, Raft or Paxos replicates state within each participant group, and 2PC coordinates across groups.

ACID

ACID formalizes the correctness requirements that concurrency control and 2PC are designed to satisfy.

Atomicity is all-or-nothing; 2PC achieves this across multiple nodes. Consistency ensures valid state transitions; the database enforces integrity constraints by aborting violations. Isolation prevents concurrent interference; locking, OCC, and MVCC enforce this. Durability ensures committed changes survive crashes; the write-ahead log provides this.

Consistency Models

Consistency models define what values a read is allowed to return given a history of writes across replicas. Stronger models give more intuitive guarantees at higher coordination cost.

Linearizability is the strongest practical model. Every operation appears to take effect instantaneously at some point between its invocation and completion, in an order consistent with real time. Two key ideas follow from this:

Each operation appears atomic (all at once, not partially), and
If one operation finishes before another begins, the first must appear earlier in everywhere.

Linearizability does not require wall-clock timestamps for overlapping operations: if two operations overlap in time, either order is valid; only non-overlapping operations must respect real-time ordering. etcd provides linearizability for all operations; ZooKeeper provides linearizable writes but sequentially consistent reads by default. Linearizability is the definition of “C” in CAP.

Sequential consistency relaxes the real-time requirement of linearizability. There must exist some global total order of all operations consistent with each process’s program order, but that order need not match wall-clock time.

Under linearizability, if a write completes before a read begins, the read must see that write. Under sequential consistency, the system may order them differently as long as each client’s own program order is preserved.

Causal consistency only requires causally related operations to appear in the same order for all processes. Causally independent operations may be seen in different orders. Vector clocks implement this model.

Eventual consistency is the weakest useful model. All replicas will eventually converge if no new updates arrive, but there is no constraint on what reads return in the interim.

Serializability and Linearizability

Serializability is a property of transactions (multi-step, multi-object operations). It requires that concurrent transactions produce results equivalent to some serial execution, with no real-time constraint on which order is chosen.

Linearizability is a property of individual operations on a single object. It requires each operation to appear instantaneous in an order consistent with real time.

The two properties are independent: a database can be serializable without being linearizable, and a key-value store can be linearizable without supporting transactions.

The CAP Theorem

The CAP theorem states that when a network partition occurs, a distributed system cannot simultaneously guarantee both consistency (specifically linearizability) and availability. Partition tolerance is not a design choice; real networks partition. The practical choice is between C and A when a partition actually happens.

CP systems return errors during a partition rather than serve potentially stale data.
AP systems continue to serve requests but may return stale data.

CAP is commonly summarized as “you can have at most two of C, A, and P.” That framing is imprecise: partition tolerance is a constraint imposed by the network, not a property you trade away. The C-versus-A trade-off only arises during a partition; when the network is healthy, a well-designed system can provide both.

PACELC

PACELC extends CAP by observing that even during normal operation, there is a trade-off between latency and consistency. Strong consistency requires coordinating writes across a quorum before responding, which adds latency. Eventual consistency allows a local replica response, which is faster. PACELC classifies systems as PA/EL, PA/EC, or PC/EC. Cassandra and Dynamo are PA/EL; Spanner and HBase are PC/EC.

BASE

BASE (Basically Available, Soft State, Eventually Consistent) is the design philosophy adopted by large-scale internet systems as a response to the constraints revealed by CAP and PACELC. If strong consistency imposes availability costs during partitions and latency costs during normal operation, you design systems that accept weaker consistency in exchange for higher availability and lower latency. BASE is not a protocol; it shifts the burden of handling inconsistency from the system to the application.

ACID vs. BASE

The choice is driven by application requirements. Financial transfers and medical records need ACID; social feeds and product catalogs can tolerate BASE semantics. Many modern systems are hybrid, using ACID for core transactional data and BASE for derived or display data.

Key Relationships

Concept	Notes
Concurrency control	Pessimistic (2PL/SS2PL) vs. optimistic (OCC) vs. versioned (MVCC)
Commit protocol	2PC (unanimous, blocking, practical) vs. 3PC (non-blocking but impractical)
Consistency model strength	Linearizability → Sequential → Causal → Eventual
CAP choice during partition	CP (block rather than serve stale) vs. AP (serve stale rather than block)
Design philosophy	ACID (correctness over availability) vs. BASE (availability over consistency)

You Don’t Need to Study

The following topics are covered in the lecture notes for completeness but will not appear on exams or homework.

Sagas. A microservices design pattern worth knowing exists but outside the scope of this course.
The difference between wait-die and wound-wait. You should know that timestamp-based schemes prevent deadlock by construction; you do not need to memorize which rule applies to which case.
Strong serializability. You should understand that serializability and linearizability are independent properties; their combination is covered in the notes but is not a focus of assessment.
Sequential consistency. You should know that linearizability requires real-time ordering and that causal consistency requires ordering of causally related operations. Sequential consistency sits between them but is not the target model of any common system you are likely to encounter.
The distinction between strict 2PL and SS2PL. You should know that holding all locks until commit or abort prevents cascading aborts and is standard practice. You do not need to know the formal naming difference between the two variants.
CRDTs and Strong Eventual Consistency (SEC). You should know that eventually consistent systems need a strategy for reconciling conflicting concurrent writes. CRDTs and SEC are one principled answer, but the details are beyond the scope of this course.
Leases. A useful concept in distributed lock management, but not a focus of this course.