Distributed Transactions

Two-phase commit

Paul Krzyzanowski

October 2007


We've looked at a number of low level techniques that can be used for managing synchronization in a distributed environment: algorithms for mutual exclusion and critical section management. In addition (and we'll look at these later), we can have algorithms for deadlock resolution and crash recovery. Much as remote procedure calls allowed us to concentrate on the functionality of a program and express it in a more natural way than sends and receives, we crave a higher level of abstraction in dealing with issues of synchronization. This brings us to the topic of atomic transactions (also known colloquially simply as transactions).

In transacting business, all parties involved may have to go through a number of steps in negotiating a contract but the end result of the transaction won't be committed until both parties sign on the dotted line. If even one of the parties reconsiders and aborts, the contract will be forgotten and life goes on as before.

Consider, for example, the purchase of a house. You express your interest in purchasing a house by making an offer (and possibly putting some money down with a trusted party). At that point, you have not bought the house, but you have entered the transaction of purchasing a house. You may have things to do (such as getting a mortgage and inspection) and the seller may have things to do (such as fixing up certain flaws). If something goes wrong (you can't get a mortgage, the seller won't fix the heating system, you find the house is sitting on a fault line, the seller won't remove the black velvet wallpaper, ...), then the transaction is cancelled (aborted) and both parties go back to life as before: you look for another house and the seller remains in the house, possibly still trying to sell it. If, however, the transaction is not aborted and both parties sign the contract on the closing day, it is made permanent. The deed is signed over and you own the house. If the seller changes her mind at this point, she'll have to try to buy back the house. If you change your mind, you'll have to sell the house.

The concept of a transaction in the realm of computing is quite similar. One process announces that it's beginning a transaction with one or more processes. Certain actions take place. When all processes commit, the results are permanent. Until they do so, any process may abort (if something fails, for example). In that case, the state of computing reverts to the state before the transaction began: all side effects are gone. A transaction has an all or nothing property.

The origins of transactions in computing date back to the days of batch jobs scheduled to processes tapes. A days worth of "transactions" would be logged on a tape. At the end of the day, a merge job would be run with the original database tape and the transactions tape as inputs, producing a new tape with all the transactions applied. If anything went wrong, the original database tape was unharmed. If the merge succeeded, then the original tapes could be reused.

Transaction model

A process that wishes to use transactions must be aware of certain primitives associated with them. These primitives are:

  1. begin transaction - mark the start
  2. end transaction - mark the end; try to commit
  3. abort transaction - kill transaction, restore old values
  4. read data from object(file), write data to object(file).

In addition, ordinary statements, procedure calls, etc. are allowed in a transaction.

To get a flavor for transactions, consider booking a flight from Newark, New Jersey to Ridgecrest, California. The destination requires us to land at Inyokern airport, and non-stop flights are not available:

transaction begin
1. reserve a seat for Newark to Denver (EWK→DEN)
2. reserve a seat for Denver to Los Angeles (DEN→LAX)
3. reserve a seat for Los Angeles to Inyokern (LAX→IYK)
transaction end

Suppose there are no seats available on the LAX→IYK leg of the journey. In this case, the transaction is aborted, reservations for (1) and (2) are undone, and the system reverts to the state before the reservation was made.

Properties of transactions

The properties of transactions are summarized with the acronym ACID, which stands for Atomic, Consistent, Isolated, and Durable.

either an entire transaction happens completely or not at all. If the transaction does happen, it happens as a single indivisible action. Other processes cannot see intermediate results. For example, suppose we have a file that is 100 bytes long and a transaction begins appending to it. If other processes read the file, they only see the 100 bytes. At the end of the transaction, the file instantly grows to its new size.
If the system has certain invariants, they must hold after the transaction (although they may be broken within the transaction). For example, in some banking application, the invariant may be that the amount of money before a transaction must equal the amount of money after the transaction. Within the transaction, this invariant may be violated but this is not visible outside the transaction.
Isolated (or serializable)
If two or more transactions are running at the same time, to each of them and to others, the final result looks as though all transactions ran sequentially in some order.

An order of running transactions is called a schedule. Orders may be interleaved. If no interleaving is done and the transactions are run in some sequential order, they are serialized.

Consider the following three (small) transactions:

Some possible schedules are (with time flowing from left to right):
schedule execution order final x legal?
schedule 1 x=0 x=x+1 x=0 x=x+2 x=0 x=x+3 3 yes
schedule 1 x=0 x=0 x=x+1 x=x+2 x=0 x=x+3 3 yes
schedule 1 x=0 x=0 x=x+1 x=0 x=x+2 x=x+3 5 NO
Once a transaction commits, the results are made permanent. No failure after a commit can undo results or cause them to get lost. [Conversely, the results are not permanent until a transaction commits.]

Nested transactions

Transactions may themselves contain subtransactions (nested transactions). A top-level transaction may fork off children that run in parallel with each other. Any or all of these may execute subtransactions.

The problem with this is that the subtransactions may commit but, later in time, the parent may abort. Now we find ourselves having to undo the committed transactions. The level of nesting (and hence the level of undoing) may be arbitrarily deep. For this to work, conceptually, each subtransaction must be given a private copy of every object it may manipulate. On commit, the private copy displaces its parent's universe (which may be a private copy of that parent's parent).


We cannot just allow a transaction to update the objects (files, DB records, et cetera) that it uses. The transactions won't be atomic in that case. One way of supporting this is by providing a private workspace. When a process starts a transaction, it's given a private workspace containing all the objects to which it has access. On a commit, the private workspace becomes the real workspace. Clearly this is an expensive proposition. It requires us to copy everything that the transaction may modify (every file, for example). However, it's not as bleak as it looks. A number of optimizations can make this a feasible solution.

Suppose that a process (transaction) reads a file but doesn't modify it. In that case it doesn't need a copy. The private workspace can be empty except that it contains a pointer back to the parent's workspace. How about writing a file? On an open, don't copy the file to the private workspace but just copy the index (information of where the file's data is stored; a UNIX inode, for example). The file is then read in the usual way. When a block is modified, a local copy is made and the address for the copied block is inserted into the index. New blocks (appends) work this way too. Privately allocated blocks are called shadow blocks.

If this transaction was to abort, the private blocks go back on the free list and the private space is cleaned up. Should the transaction commit, the private indices are moved into the parent's workspace (atomically). Any parent blocks that would be overwritten are freed.

Another mechanism for ensuring that transactions can be undone (and possibly redone) is the use of a write-ahead log, also known as an intentions list. With this system, objects are modified in place (proper locking should be observed if other processes are to access these objects). Before any data (e.g. block, memory page) is changed, a record is written to the write-ahead log in stable storage. The record identifies the transaction (with an ID number), the block or page modified, and the old and new values.

If the transaction succeeds (i.e., commits), a commit record is written to the log. If the transaction aborts, the log is used to back up to the original state (this is called a rollback. The write-ahead log can also be played forward for crash recovery (this becomes useful in the two-phase commit protocol, which is discussed next). A term associated with the write-ahead log was stable storage. This is intended to be a data repository that can survive system crashes. After a datum is written to stable storage, it is retrievable even if the system crashes immediately after the write. A disk is suitable for stable storage, but it is important that any writes are immediately flushed to the disk and not linger in the memory (unstable) buffer cache.

The two-phase commit protocol

(Gray, 1978)

In a distributed system, a transaction may involve multiple processes on multiple machines. Even in this environment, we still need to preserve the properties of transactions and achieve an atomic commit (either all processes involved in the transaction commit or else all of them will abort the transaction - it will be unacceptable to have some commit and some abort). A protocol that achieves this atomic commit is the two-phase commit protocol.

In implementing this protocol, we assume that one process will function as the coordinator and the rest as cohorts (the coordinator may be the one that initiated the transaction, but that's not necessary). We further assume that there is stable storage and a write-ahead log at each site. Furthermore, we assume that no machine involved crashes forever.

The protocol works as follows (the coordinator is ready to commit and needs to ensure that everyone else will do so as well):

phase coordinator cohort
write prepare to commit message to the log work on transaction; when done, wait for message
send prepare to commit message
wait for reply receive message. When transaction is ready to commit, write agree to commit (or abort) to log.
send "agree" or "abort" reply
write commit message to the log. wait for commit message
send commit (or abort) message receive commit (or abort) message
wait for all cohorts to respond if a commit was received, write "commit" to the log, release all locks & resources, update databases.
if an abort was received, undo all changes.
send done message.
clean up all state. Done.

What the two phase commit protocol does is this. In phase 1, the coordinator sends a request to commit to all the cohorts and waits for a reply from all of them. The reply is either an agreement or an abort. Note that nobody has committed at this point. After the coordinator receives a reply from all cohorts, it knows that all transaction-relevant computation is finished so nothing more will happen to abort the transaction. The transaction can now be committed or, in the case that at lease one of the parties could not complete its transaction, aborted. The second phase is to wait for all cohorts to commit (or abort). If aborting, an abort message is sent to everyone. The coordinator waits until every cohort responds with an acknowledgement. If committing, a cohort receives a commit message, commits locally, and sends an acknowledgment back. All message deliveries are reliable (retransmits after time-out).

No formal proof will be given here of the correctness of the two-phase protocol. Inspecting for correctness, it is readily apparent that if one cohort completes the transaction, all cohorts will complete if eventually. If a cohort is completing a transaction, it is because it received a commit message, which means that we're in the commit phase and all cohorts have agreed. This information is in permanent memory in case of a crash (that's why information is written to the log before a message is sent. If any system crashes, it can replay its log to find its latest state (so it will know if it was ready to commit, for example). When the coordinator is completing, it is ensured that every cohort completes before the coordinator's data is erased(update).

some vocabulary

transaction will not complete (commit). All changes are undone to the state before the transaction started.
action which indicates that the transaction has successfully completed. All changes to the database, files, and objects are made permanent.
commit protocol
a fault-tolerant algorithm which ensures that all sides in a distributed system either commit or abort a transaction unanimously.
a record of system activity recorded in sufficient detail so that a previous state of a process can be restored.
given a log record, redo the action specified in the log.
stable storage
permanent storage to which we can do atomic writes.
an atomic action which is some computation that read and/or changes the state of one or more data objects and appears to take place indivisibly.
write-ahead log protocol
a method in which operations done on objects may be undone after restarting a system.