wiki · home

TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems

What is it?

TreadMarks is a distributed shared memory (DSM) system for standard Unix systems¹.

The authors mention that networks of workstations are being used as parallel computers with more frequency. The reason is that the performance of such clusters are not far from the supercomputers although the gap in price is still big. An interesting quote from the paper is:

We expect that the workstation cluster approach to parallel computing will gain further popularity, as advances in networking continue to improve its cost/performance ratio.

IMPROVE THIS: The systems being built today confirm this expectation. Note that the paper was written in 1994.

The authors mention that tuple spaces, distributed shared memory, and message passing are some examples of systems built to support parallel computation on workstation networks. DSM allows processes on different machines to share memory, without requiring the machines to physically share memory. An illustration about this model is as follows:

       ┌──────────┐   ┌──────────┐   ┌──────────┐        ┌──────────┐
       │  Proc1   │   │  Proc2   │   │  Proc3   │   ...  │  ProcN   │
       └───┬─┬────┘   └───┬─┬────┘   └───┬─┬────┘        └───┬─┬────┘
           │█│            │█│            │█│                 │█│
           │█│            │█│            │█│                 │█│
┌ ─ ─ ─ ─ ─│█│─ ─ ─ ─ ─ ─ ┤█├ ─ ─ ─ ─ ─ ─│█│─ ─ ─ ─ ─ ─ ─ ─ ─│█│─ ─ ─ ─ ─ ─ ┐
           │█│            │█│            │█│                 │█│
│         ┌┴─┴─┐         ┌┴─┴─┐         ┌┴─┴─┐              ┌┴─┴─┐          │
          │Mem1│         │Mem2│         │Mem3│      ...     │MemN│
│         └─┬┬─┘         └─┬┬─┘         └─┬┬─┘              └─┬┬─┘          │
            ││             ││             ││                  ││
│           ││             ││             ││                  ││            │
            ││             ││             ││                  ││
│           ││             ││             ││                  ││            │
            ││             ││             ││                  ││
│ ┌─────────┴┴─────────────┴┴─────────────┴┴──────────────────┴┴──────────┐ │
  └───────────────────────────────────────────────────────────────────────┘
│                                Network                                    │

│                                                                           │
   Shared Memory
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘

The problems with DSM mentioned by the authors include: not being available for standard operating systems, requiring kernel modifications to be used. Another problem is the performance of previously DSM that were built. The authors suggest that this was caused by the attempt to mimic hardware consistency guarantees via imitation of the consistency protocols used by hardware shared memory multiprocessors.

TreadMarks is an attempt to overcome this and they say that it overcomes most of these problems. It runs on commonly available Unix systems, its implementation was done at the user level, it does not require any particular compiler because its implementation relies on (user-level) memory management techniques to detect accesses and updates to shared data. Another point that was focused by TreadMarks was reducing the amount of communication necessary to keep the distributed memories consistent. TreadMarks implements something they called Lazy Release Consistency.

Design

As mentioned above, the focus was to reduce the amount of messages exchanged to maintain memory consistency. The release consistent memory model is the result.

Release Consistency

Release Consistency (RC) is a relaxed memory consistency model that allows a processor to delay making its changes to shared visible to other processors until certain synchronization accesses occur. Shared memory accesses can be categorized as either ordinary or synchronization accesses. Synchronization access can be sub-categorized as either acquire and release accesses, roughly corresponding to synchronization operations on a lock. Other mechanisms can be built on top of this model as well.

Essentially, RC requires ordinary shared memory updates by a processor p to become visible at another processor q, only when a subsequent release by p becomes visible at q.

An advantage of RC is that the memory updates don’t need to be visible immediately, requiring less communication. The propagation of modifications can be postponed until the next synchronization operation takes effect.

Lazy Release Consistency

In lazy release consistency (LRC), the propagation of modifications is postponed until the time of the acquire. At this time, the acquiring processor determines which modifications it needs to see according to the definition of RC.

The execution of each process is divided into intervals, each denoted by an interval index. Every time a process executes an acquire or a release, a new interval begins and the interval index is incremented. TreadMarks assign a vector timestamp to each interval, containing an entry for each processor. This allows the representation of partial ordering between the processes.

RC requires that before a processor p may continue past an acquire, the updates of all intervals with a smaller vector timestamp than p’s current vector timestamp must be visible at p. So, at an acquire, p sends its current vector timestamp to the previous releaser q. q then piggybacks on the release-acquire message to p, write notices for all intervals named in q’s current vector timestamp but not in the vector timestamp it received from p.

A write notice is an indication that a page has been modified in a particular interval, but it does not contain the actual modifications. What TreadMarks then do is: when a write notice arrives, it causes the processor to invalidate its copy of that page and the modifications will be only propagated in a subsequent access to that page.

An example to illustrate LRC is the following: consider variables x and y that are on the same page and processors P1, P2, and P3. Each processor executes the following operations:

P1: a1 x = 2 r1
P2:              a2 y = 3 r2
P3:                           a1 print x r1

What LRC does is that P3 only asks the previous holder of lock 1 for write diffs. If the operation were print x, y, P3 would print a stale value for y even though both variables are on the same page. The advantage is that less information flows through the network since information about y would be useless for P3 but the programmer must be careful to acquire the proper locks to avoid reading stale data. TreadMarks makes use of vector timestamps to overcome this problem, by requiring that processors need to send their current vector timestamp to previous lock holder and then having the previous holder sending write notices for all intervals (including updates to variables/pages that the new lock holder doesn’t care).

Multiple-Writer Protocols

False sharing occurs when two or more processors access different variables within a page, with at least one of the accesses being a write. A write to any variable of a page causes the entire page to become invalid at all other processors that cache the page. This leads to an access miss when any of these processors try to access the page and then the modified page is transported over the network. You can see that this is a problem since the original page would have sufficed, because the access was to a different variable than the one that was written.

Lazy Diff Creation

To address the problem of false sharing TreadMarks creates diffs lazily. A shared page is initially write-protected. At the first write, a protection violation occurs. The DSM software makes a copy of the page (a twin), and removes the write protection so that further writes to the page can occur without any DSM intervention. The twin and current copy can later be compared to create a diff. TreadMarks creates the diffs only when a processor requests modifications to a page or a write notice from another processor arrives for that page.

Implementation

Data Structures

The main data structures are the PageArray, with one entry for each shared page, the ProcArray, with one entry for each processor, a set of interval records (containing mainly the vector timestamp for that interval), a set of write notice records, and a diff pool. Each entry in the PageArray contains:

The current state: no access, read-only access, or read-write access.
An approximate copyset specifying the set of processors that are believed to currently cache this page.
For each page, an array indexed by processor of head and tail pointers to a linked list of write notice records corresponding to write notices received from that processor for this page.

Each entry in ProcArray contains a pointer to the head and the tail of a doubly linked list of interval records, representing the intervals of that processor that the local processor knows about.

The figure below gives an overview of the data structures used:

Locks

All locks have a statically assigned manager. Lock management is assigned in a round-robin fashion among the processors.

The lock acquire request contains the current vector timestamp of the acquiring processor. The lock request arrives at the processor that either holds the lock or did the last release on it, possible after forwarding by the lock manager. When the lock is released, the release “informs” the acquirer of all intervals between the vector timestamp in the acquirer’s lock request message, and the releaser’s current vector timestamp.

Barriers

Barriers have a centralized manager. At barrier arrival, each client “informs” the barrier manager of its vector timestamp and all of the client’s intervals between the last vector timestamp of the manager that the client is aware of and the client’s current vector timestamp. When the manager arrives at the barrier, it “incorporates” these intervals into its data structures. When all barrier arrival messages have been received, the manager then “informs” all clients of all intervals between their vector timestamp, as received in their barrier arrival message, and the manager’s current vector timestamp. The clients then “incorporate” this information as before. As for locks, incorporating this information invalidates the pages for which write notices were received.

Access Misses

If the faulting processor does not have a copy of the page, it requests a copy from a member of the page’s approximate copyset.

If write notices are present for the page, the faulting processor obtains the missing diffs and applies them to the page. If processor p has modified a page during interval i, then p must have all the diffs of all intervals (including those from processors other than p) that have a smaller vector timestamp than i.

After the set of necessary diffs and the set of processors to query have been determined, the faulting processor sends out requests for the diffs in parallel. When all necessary diffs have been received, they are applied in increasing vector timestamp order.

Garbage Collection

During garbage collection, each processor validates its copy of every page that it has modified. All other pages, all interval records, all write notice records and all diffs are discarded.

Garbage Collection is triggered when the amount of free space for consistency information drops below a threshold. An attempt is made to make garbage collection coincide with a barrier, since many of the operations are similar.

References

P. Keleher, A. L. Cox, and W. Zwaenepoel. Treadmarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In Proc. 1994 Winter Usenix Conf., pages 115-131, January 1994.

Notes

In the paper they mention SunOS and Ultrix as being some examples of standard Unix systems. Both were discontinued.↩︎