wiki · home

Table of Contents

PNUTS: Yahoo!’s Hosted Data Serving Platform

PNUTS is as system built by Yahoo! to support their web applications. Some of the features are:

Data and Query Model

Data is organized into tables of records with attributes. blob is another valid data type which allows arbitrary structures to be stored in a record. The schema is flexible, new attributes can be added at any time without locking down the entire table, and the records don’t need to have all attributes set.

The query language allows selection and projection from a single table. Updates and deletes must specify the primary key. The tables can be hashed or ordered, the application chooses which flavor it wants. Scans can be done by specifying predicates that are evaluated at the server. PNUTS also provides a way for retrieving multiple records from a single or different tables.

Consistency Model

PNUTS provides a consistency model that lies between general serializability and eventual consistency. A per-record timeline consistency is provided: all replicas apply the updates in the same order.

The figure below illustrates a possible scenario of operations on a given record:

        Insert  Update  Update  Delete     Insert Update
           │       │       │       │          │      │
           │       │       │       │          │      │
           │       │       │       │          │      │
           ▼       ▼       ▼       ▼          ▼      ▼
◀───────────────────────────────────────────────────────────────────▶
   v. 1.0   v. 1.1   v. 1.2  v. 1.3   v. 2.0   v. 2.1   v. 2.2

A read from any replica will return a consistent version of this timeline, and replicas always move forward in the timeline. This is done by using a per-record master that is responsible for executing all updates on a given record. The record has a version number that is incremented each time an update is executed.

Based on this, PNUTS provides (via its API) the following operations that have different consistency levels:

Note that there are now guarantees of consistency for multi-record transactions.

PNUTS also provides a notification feature where users can subscribe to the streams of updates in a table.

System Architecture

The system is divided in regions, where each region contains a full complement of the system components and a complete copy of each table. Regions might be geographically distributed but it is not mandatory. The PNUTS system architecture is as follows:

                                                                        │
                                                            ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
                                                                        │          │                                        ┌────────────┐
                                                            │  ┌─┐           ┌─┐                                            │            ├─┐
    ┌────────────┐                                             │ ├┐     │    │ ├┐  │                                        │            │ │
    │            ├─┐                                        │  │ │├┐         │ │├┐                                          │            │ │
    │            │ │          ┌───┐  ┌───┐  ┌───┐              │ │││    │    │ │││ │          ┌───┐  ┌───┐  ┌───┐           │            │ │
    │            │ │          │   │  │   │  │   │           │  └┬┘││         └┬┘││            │   │  │   │  │   │           └─┬──────────┘ │
    │            │ │          └───┘  └───┘  └───┘               └┬┘│    │     └┬┘│ │          └───┘  └───┘  └───┘             └────────────┘
    └─┬──────────┘ │                                        │    └─┘           └─┘                                                Tablet
      └────────────┘                Routers                          Message       │                Routers                     controller
          Tablet                                            │        Broker
        controller                                                      │          │
                                                            └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
                                                                        │

                                                                        │

┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐        │         ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐
│       │  │       │  │       │  │       │  │       │  │       │                  │       │  │       │  │       │  │       │  │       │  │       │
│       │  │       │  │       │  │       │  │       │  │       │        │         │       │  │       │  │       │  │       │  │       │  │       │
└───────┘  └───────┘  └───────┘  └───────┘  └───────┘  └───────┘                  └───────┘  └───────┘  └───────┘  └───────┘  └───────┘  └───────┘
                                                                        │
                         Storage Units                                                                     Storage Units
                                                                        │
                                                                        ▼

Data Storage and Retrieval

Data tables are horizontally partitioned into groups of records called tablets. Tablets are scattered across many servers; each server might have hundreds or thousands of tablets, but each tablet is stored on a single server within a region.

The figure below shows how the intervals are placed for a range of primary keys:

                             Ordered table with a primary key of type STRING


           ┌────────┐          ┌────────┐         ┌────────┐         ┌────────┐         ┌────────┐
MIN_STRING │Tablet 1│ "banana" │Tablet 2│ "grape" │Tablet 3│ "lemon" │Tablet 4│ "peach" │Tablet 5│ MAX_STRING
           └────────┘          └────────┘         └────────┘         └────────┘         └────────┘
              SU 3                SU 1               SU 1               SU 2               SU 1




                               Hash table with primary key of type STRING
                             Tablet boundaries defined by `Hash(PrimaryKey)`

           ┌────────┐          ┌────────┐         ┌────────┐         ┌────────┐         ┌────────┐
  0x0000   │Tablet 1│  0x102F  │Tablet 2│  0x4A44 │Tablet 3│  0x943D │Tablet 4│  0xA443 │Tablet 5│   0xFFFF
           └────────┘          └────────┘         └────────┘         └────────┘         └────────┘
              SU 3                SU 1               SU 1               SU 2               SU 1

Three components are responsible for managing the a providing access to data tablets: the storage unit, the router, and the tablet controller. Storage units store tablets, respond to get(), scan() and set() operations. Updates are committed by first writing them to the message broker.

The router is responsible for routing requests to the appropriate storage unit containing the desired record. It finds the tablet containing the record and then the storage unit that contains that tablet.

The routers only have a cached copy of the interval mapping. The tablet controller owns the mappings, and the routers periodically poll the tablet to fetch any changes made to the mappings. The tablet controller determines when a tablet should be moved to a different storage unit for load balancing or recovery reasons.

Replication and Consistency

One replica is designed as the master for a given record and all updates go first to the master. If the update arrive at a non-master replica, the replica forwards the request to the master replica for proper execution. Records in the same table might have different masters.

To propagate the changes to other replicas the masters publish the changes to the message broker, in this case YMB, and the updates are published to the same message broker so that updates are always delivered in commit order.

YMB guarantees that messages are not lost, and published messages are delivered to all topic subscribers even if a broker fails.

Inserts of records are forwarded to a tablet master so that all inserts are done in the same storage unit and thus enforcing primary key constraints.

To recover from failures, the tablet controller first requests a copy from a remote replica, then publishes a “checkpoint message” into YMB to guarantee that any in-flight operations are executed at the source tablet and after that the source tablet is copied to the destination.

Other Database System Functionality

The notification feature mentioned before is provided by allowing clients to subscribe to internal messages of PNUTS.

Hosted Database Service

PNUTS is a hosted, centrally-managed database service shared by multiple applications. To add capacity, they add more servers. The system then adapts by automatically shifting some load to the new servers.

References

Carlos Galdino
@carlosgaldino
github.com/carlosgaldino
blog.carlosgaldino.com