wiki · home


Dynamo: Amazon’s Highly Available Key-Value Store

Dynamo is a highly available and scalable distributed data store built for Amazon’s platform. Dynamo is used to manage the state of services that require highly reliability and tight control over the tradeoffs between availability, consistency, cost-effectiveness and performance.

Examples of applications that use Dynamo are: shopping carts, best seller lists, customer preferences, session management, sales rank, and product catalog.

The techniques used to achieve high availability scalability include: partitioning the data and replicating it using consistent hashing, consistency via object versioning. Dynamo is a completely decentralized system that requires minimal manual administration. Nodes can be added and removed and the system automatically reorganizes itself without a human to do tasks like redistribution or partitioning.

Design Considerations

Dynamo targets the design space of an “always writeable” data store. The conflicts are managed by the application since it can provide more context than the data store when resolving conflicts.

Other key principles embraced by Dynamo are:

System Architecture

System Interface

Dynamo stores objects associated with a key through a simple interface consisting of two operations: get() and put().

get(key) finds the object replicas associated with key and returns a single object or a list of objects with conflicting versions along with a context.

put(key, context, object) determines where the replicas of the object should placed based on the associated key, and writes the replicas to disk. The context encodes system metadata about the object that is opaque to the caller, including the object’s version.

Both the data and the key are treated by Dynamo as an opaque array of bytes. It applies a MD5 hash on the key to determine the storage nodes that are responsible for serving the key.

Partitioning Algorithm

Dynamo uses consistent hashing for partitioning the data and distributing the load across multiple storage hosts. Each node is responsible for the region in the ring between it and its predecessor. An advantage of this technique is that if some node joins or leaves only its immediate neighbors will be affected.

The variant of consistent hashing used by Dynamo makes use of virtual nodes, instead of mapping regions to a single physical node, each node gets assigned to multiple points in the ring. Each physical node then is responsible for several virtual nodes in the ring. Some of the advantages of using virtual nodes are:

Replication

Dynamo replicates its data on multiple hosts for high availability and durability. Each data is replicated at N hosts, where N is configured in a per-instance basis. The coordinator (virtual node in the in the ring) is responsible for replication of the data items that fall within its range. In addition to locally storing each key within its range, it also replicates the data in the next N-1 clockwise successor nodes in the ring.

The list of nodes that is responsible for storing a particular key is called the preference list.

Data Versioning

Dynamo provides eventual consistency, allowing for updates to be propagated to all replicas asynchronously. The result of each modification is treated as new and immutable which means that the system might contain several different versions of an object at the same time.

Vector clocks are used to capture causality between different versions of the same object. When clients wish to update an object they need to provide the version they are updating, the context mentioned before is what carries this information. To avoid having the vector clocks growing unbounded, Dynamo employs a truncation scheme: it stores a timestamp that indicates the last time the data was updated at a node. When the number of (node, counter) pairs in the vector clock reaches a threshold, the oldest pair is removed from the clock.

Execution of get() and put() operations

Both operations are invoked using Amazon’s infrastructure-specific request processing framework over HTTP. The clients can choose over two strategies for accessing a node:

  1. Route its request through a generic load balancer that will select a node based on load information.
  2. Use a partition-aware client library that routes requests directly to the appropriate coordinator nodes.

The first approach doesn’t require that applications need to link any code specific to Dynamo, as the second approach does but with the advantage of having lower latency since it does not pass through a potential forwarding step.

The node handling a read or write is identified as the coordinator and it typically is the first among the N nodes in the preference list. The operations involve the first N available nodes in the preference list. This is done to ensure consistency among replicas. The protocol is similar to a quorum based model. There are two configurable variables R and W. R is the minimum number of nodes that must participate in a successful read operation. Similarly, W is the minimum number of nodes that must participate in a successful write operation. Setting R and W such that R + W > N yields a quorum-like system1.

By making the (N, R, W) tuple configurable by each application, Dynamo allows applications to tune the system for their needs and the required tradeoffs between consistency and performance. For example, an application with the value of R set to 1 can have a higher performance since a single replica needs to be contacted to respond to read requests. But in this example the consistency might be compromised since the replica might have an old version of the data.

Handling Failures: Hinted Handoff

All read and write operations are executed on the first N healthy nodes from the preference list, which may not be the first N nodes.

When a node receives a replica from a node that is down, it will place the replica in a separate local database that is scanned periodically. When the node detects that the actual “owner” has come back up it will try to send the data to its rightful owner and if the transfer succeeds it will delete its local copy.

Handling permanent failures: Replica synchronization

To keep the replicas synchronized, Dynamos implements an anti-entropy protocol. This protocol uses Merkle Trees to avoid detect inconsistencies between replicas and to minimize the amount of transferred data. The protocol works as follows: each node maintains a separate Merkle tree for each key range it hosts. Replicas exchange the root of their Merkle tree corresponding to the key ranges they host in common. Traversing the tree and comparing the values allow the replicas to detect differences and then synchronize accordingly.

Membership and Failure Detection

An administrator uses a command line tool or a browser for sending commands to a Dynamo node to join or remove a node from a node ring.

A gossip-based protocol is used to propagate the membership changes and maintain an eventually consistent view of membership. Each node contacts a peer chosen randomly every second and the two nodes reconcile their persisted membership change histories.

External Discovery

To prevent logical partitions, some Dynamo nodes play the role of seeds2. Seeds are nodes that are discovered via an external tool and are known to all nodes.

Failure Detection

Dynamo uses failure detection to avoid communication attempts with unreachable peers during get() and put() operations, and when transferring partitions and hinted replicas. The notion of failure is: if server A tries to contact B and does not respond, A then considers B as offline. This holds even if B is responding requests from server C.

In this example, after concluding that B is unresponsive, server A will try to contact other servers that map to B’s partitions, and A will also periodically try to contact B again to see if it came back up. Note that if servers don’t need to interact they also don’t need to know if the other is alive.

References

Notes


  1. In this model, the latency of any operation is dictated by the slowest R or W replicas. To provide better latency these variables are configured to be less than N.↩︎

  2. Interesting that in the beginning of the paper they say that one key design of the system is that no node should have a distinguished and special role but yet here we are.↩︎