wiki · home

Efficient Optimistic Concurrency Control Using Loosely Synchronized Clocks

Summary

The paper describes an efficient optimistic concurrency control scheme for use in distributed database systems where objects are cached and manipulated at client machines while persistent storage and transactional support are provided by servers.

The Environment

The work has been done in the context of the Thor object oriented database. Thor allows user applications to share a universe of objects. These objects are encapsulated for safe sharing, and applications access them by invoking methods that observe or modify their object’s state. Application computations occur within transactions so persistent objects can be maintained consistently despite being accessed concurrently by several clients. Each application runs a single transaction at a time. It specifies when to commit the current transaction: the transaction includes all methods invoked by the application since the last commit point.

The applications run on client machines while persistent objects are stored on servers. There may be many servers and the server where a particular object is stored is referred as the object’s owner. New persistent objects may be created as result of the operations executed by an application and objects can also migrate between servers.

To improve performance, methods are executed on the client machine using cached versions of objects. Objects are fetched from their servers when needed, and when an object is fetched, a number of related objects are prefetched. The server tracks which objects are in the client cache; maintaining a table called cached set for each client that stores information about the cached objects.

    ┌────────┐                          ┌────────┐
    │ Client │       Applications       │ Client │
┌ ─┌┴────────┴┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┌┴────────┴┬ ─ ┐
   │Front End │          Thor          │Front End │
│  └──────────┘                        └──────────┘   │
         │                                   │
│        │                                   │        │
         │                                   │
│        │                                   │        │
     ────┴─────┬──────────────────────┬──────┴────
│              │                      │               │
               │                      │
│         ┌────────┐             ┌────────┐           │
          │        │             │        │
│         │ Server │             │ Server │           │
          │        │             │        │
│         └────────┘             └────────┘           │
               │                      │
│              │                      │               │
            ┌────┐                 ┌────┐
│           │Disk│                 │Disk│             │
            └────┘                 └────┘
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘

The front end manages the cache in the client machine. It keeps track of objects read and written by the current transaction T at its client. Then the client requests a commit, the front end gathers the data relevant to T:

validation information: identity of each object used by T along with the type of access (read or write)
installation information: modified copies of objects

To commit a transaction, the front end sends its installation information and validation information to a server that is the owner of some of the objects used by the transaction. This servers commits the transaction unilaterally if it owns all objects used by the transaction. Otherwise, it acts as the coordinator of a commit protocol with the other owners, called participants. The protocol is a 2-phase protocol.

In phase 1, the coordinator sends prepare information messages containing the validation and installation information to the participants. Each participant tries to validate the transaction. If validation succeeds, the participant logs the installation information on stable storage and sends a positive response to the coordinator; otherwise it rejects the transaction. If all participants respond positively, the coordinator commits the transaction by logging a commit record on stable storage. Then it notifies the client of its decision.

In phase 2, the coordinator sends commit messages to the participants. On receiving a commit message, a participant installs new versions of its objects that were modified by that transaction (so that future fetches see the updates), logs a commit record on stable storage, and sends an acknowledgment to the coordinator.

If a transaction has not modified any object at some participant, the transaction is called read-only at the participant, and it does not need a commit message from the coordinator.

If a transaction has not modified any object at any participant, the transaction is called read-only, and it does not need any information stored on stable storage. The validation on phase 1 is still required, though.

After a read-write transaction for some client C has committed at a server, the server sends invalidation messages to clients other than C that are caching objects installed by the transaction.

The front ends send acknowledgments after processing the invalidation messages; when the server receives it, it removes the information about the invalidated objects from the client’s cached set.

An Efficient Validation Scheme

The scheme presented in the paper is called backward validation. It preserves the following consistency: a validating transaction T is checked against all transactions that have already validated successfully.

Global serialization

The paper defines serializability and external consistency as follows:

Serializability: the committed transactions can be placed in a total order, called the serialization order, such that the actual effect of running the transactions is the same as running them one at a time in that order.
External consistency: the serialization order is such that, if transaction S committed before T began (in real time), S is ordered before T.

Transactions are ordered by using timestamps from real clocks. The clocks are loosely synchronized which means they may differ by at most a small skew (i.e., a few tens of milliseconds)¹.

Each transaction T is assigned a timestamp T.ts by its coordinator when the coordinator receives the commit request from the client. The timestamp consists of the local time at the coordinator augmented with the coordinator’s server ID to make it globally unique: T.ts = {time, server-ID}. Timestamps are totally ordered by the time field, with ties resolved by the server ID². Transactions are serialized in timestamp order.

After assigning the timestamp the coordinator sends a prepare message to each participant P including the following validation information:

T.ts: the timestamp of T.
T.ReadSet: IDs of objects at P that were read by T.
T.WriteSet: IDs of objects at P that were modified by T.
The identity of the client where T was executed.

Transactions S and T conflict if one has modified an object that the other read or modified. Upon receiving the validation information, each participant performs checks to ensure that conflicting transactions can be serialized in timestamp order.

Each participant stores the validation information of all successfully validated transactions in a validation queue, or VQ. The validation information of validating transactions is checked against all records in the VQ to detect conflicts and ensure serializability and external consistency.

If an incoming transaction T fails a validation check against a prepared transaction S, the participant aborts T by sending a negative ack to the coordinator.

Checks Against Later Transactions

For each validated transaction S with timestamp later than T, it checks that T did not read any object that S modified, and that T did not modify any object that S read. This is called the later-conflict check.

Checks Against Earlier Transactions

For each validated transaction S with timestamp earlier than T:

S read object x and T modified x: no check is necessary because S could not have read the version of x written by T, since T is not yet committed.
S modified object x and T read x: we ensure that T read the version of x written by S or a later transaction. There are two cases:
1. If S is not committed, the check fails because T could not have read the result of an uncommitted transaction.
2. If S is committed, the outcome depends on the version T read. We call this the version check.

A simpler check is used instead. It is called the current-version check:

Check that T has read the latest installed version of x.

The server also maintains an invalid set for each client; the invalid set identifies those objects in the cached set that have been invalidated. To check whether a validating transaction read the latest version of an object, the participant checks the invalid set of the client where the transaction ran and rejects the transaction if it used an object in that invalid set.

Truncation

Information about read-write transactions that have not yet committed are never removed. Information about committed transactions are removed. A threshold timestamp is used to check which information can be removed. The threshold is guaranteed to be greater than or equal to the timestamps of all transactions whose information has been removed from the VQ.

The validation record is retained for all uncommitted read-write transactions and for all transactions with timestamps above the threshold.

The VQ may contain records for transactions with timestamps below the threshold. This happens if the transaction is uncommitted.

A transaction T timestamped below the threshold fails validation. This additional validation check is called the threshold check. The check is required because information necessary for the later-conflict has been discarded from the VQ. On the other hand, for a transaction that passes the threshold check, the earlier checks are sufficient.

The VQ is truncated periodically; at that point a new threshold is calculated, and then remove validation records for transactions whose timestamp is lower than the new threshold, provided they are committed or are read-only at this participant.

The following pseudocode represents all the checks:

// Threshold Check
If T.ts < Threshold then
    Send abort reply to coordinator
    
// Checks Against Earlier Transactions
For each uncommitted transaction S in VQ such that S.ts < T.ts
    If (S.WriteSet ∩ T.ReadSet != ∅) then
        Send abort reply to coordinator

// Current-Version Check
// Assume T ran at client C
For each object x in T.ReadSet
    If x ∈ C's invalid set then
        Send abort reply to coordinator
        
// Checks Against Later Transactions
// Later-Conflict Check
For each transaction S in VQ such that T.ts < S.ts
    If (T.ReadSet ∩ S.WriteSet != ∅)
    or (T.WriteSet ∩ S.ReadSet != ∅) then
        Send abort reply to coordinator

Crash Recovery

When a server recovers from a crash, it must ensure that transactions it validates after the crash are serializable with transactions it validated before crashing.

For read-write transactions, the validation record can be logged along with the installation information without causing any significant increase in the latency. Read-only transactions do not have any installation information, so logging validation information would increase their latency and that is why we do not log validation information for read-only transactions.

A stable threshold is maintained, its value is always later the timestamp of any transaction that the server has ever validated. On recovery the threshold is set to the logged value of stable threshold.

The stable threshold is updated whenever the timestamp of a validating transaction is later than its current value. Frequent updates to the stable threshold are avoided by using frequent jumps, e.g., setting it to one second ahead of the current clock time, so it does not need to be increased for a number of subsequent transactions.

The assumption is: an incoming transaction with a timestamp below the threshold conflicts with some transaction that committed before the crash, even though it might not be the case.

Cached sets are not maintained on stable storage. The server maintains the addresses of clients that have cached its objects. After a crash, the server communicates with the clients and rebuilds their cached sets.

Invalid sets are recovered from the combination of the installation information in the log and the recovered cached sets. When a transaction commits that causes invalidations, the server generates an invalidation number that is stored in the transaction’s commit record; later invalidation numbers are always greater than earlier ones. The invalidation number is included in the invalidation message, and the front ends remember the number of the last invalidation they have acknowledged for that server. This information is sent to the server along with the cached set; this allows the server to discard all entries from the invalid set that have been acknowledged by the front end.

References

Atul Adya, Robert Gruber, Barbara Liskov, and Umesh Maheshwari. Efficient Optimistic Concurrency Control Using Loosely Synchronized Clocks. In Proc. of the ACM SIGMOD Int’l Conference on Management of Data, San Jose, CA, May 1995.

It would be interesting to see what is the skew when using NTP nowadays.↩︎
How they choose the servers ID? Increasing integers? Some string following a certain pattern? Or something else?↩︎