wiki · home

The Tail at Scale

In moderate-size systems, temporary high-latency episodes are not important. But at large scale these events might dominate the overall service performance. The article talks about techniques to help building large online services which are predictably responsive although being composed of less-predictable parts. Such systems are called “latency tail-tolerant”, or simply “tail-tolerant.”

Why Variability Exists?

High tail latency in individual components may be caused by variability of response and such variability can arise for many reasons, including:

Increased variability can also happen because of hardware trends:

Component-Level Variability Amplified By Scale

A common technique for reducing latency in large-scale systems is to parallelize sub-operations across many machines, where each sub-operation is co-located with its portion of a large dataset. Parallelization is achieved by fanning out the requests to leaf servers and merging the responses via a request-distribution tree.

At the service level the variability is noted when one of these sub-requests is handled by a component that is experiencing problems and the response needs to wait for such problematic server.

Reducing Component Variability

Some small engineering decisions might help to improve response-time variability and ensure that requests are serviced in a timely manner. Some examples of such decisions include the following:

Living with Latency Variability

The techniques shown above are essential for building high-performance interactive services, but are not enough to eliminate all latency variability.

Within Request Short-Term Adaptations

Within a single higher-level request the following techniques can be applied to reduce latency variability:

An alternative to both schemes above is to probe the servers first and dispatch the request to the least-loaded server. But bear in mind that the load might change between the time of probing and when sending the request.

Cross-Request Long-Term Adaptations

In this section, techniques are presented for reducing latency variability caused by coarser-grain phenomena (such as service-time variations and load imbalance).

Large Information Retrieval Systems

In large information-retrieval (IR) systems, speed is more than a performance metric; it is a necessity, as returning good results quickly is more important than returning the best results slowly. Two techniques apply to such systems, and any other system that deal with imprecise results:


Bear in mind that the techniques discussed above are far most applicable for operations that do not perform critical mutations of the system’s state. Operations that mutate state are easier to handle because the scale of latency-critical modifications in these services is generally small. Second, updates can often be performed off the critical path, after responding to the user1. Third, many services can be designed to tolerate inconsistent update models. And, finally, services that require strong consistent updates generally use quorum-based algorithms (e.g., Paxos, Raft) and since such algorithms need to commit to only a majority of replicas, they are inherently tail-tolerant.



  1. By enqueuing the operations, for example.↩︎