In moderate-size systems, temporary high-latency episodes are not important. But at large scale these events might dominate the overall service performance. The article talks about techniques to help building large online services which are predictably responsive although being composed of less-predictable parts. Such systems are called “latency tail-tolerant”, or simply “tail-tolerant.”
Why Variability Exists?
High tail latency in individual components may be caused by variability of response and such variability can arise for many reasons, including:
Shared Resources: Machines might be shared by different applications that compete for shared resources such as disk, network bandwidth, CPU cores, etc. Even in a single application you might have different requests contending for resources;
Daemons: Background daemons may use only limited resources on average but when scheduled can generate multi-millisecond hiccups;
Global resource sharing: Applications running on different machines might contend for global resources such as switches and shared file systems;
Maintenance activities: Background activities such as data reconstruction in distributed file systems, periodic log compaction, garbage collection, can cause periodic spikes in latency; and
Queueing: Multiple layers of queueing in intermediate servers and network switches amplify this variability.
Increased variability can also happen because of hardware trends:
Power limits: Modern CPUs are designed to temporarily run above their average power envelope, mitigating thermal effects by throttling if this activity is sustained for a long period;
Garbage collection: Solid-state storage devices provide very fast random read access, but the need to periodically garbage collect a large number of data blocks can increase read latency by a factor of 100 with even a modest level of write activity; and
Energy management: Power-saving modes in many types of devices save considerable energy but add additional latency when moving from inactive to active modes.
Component-Level Variability Amplified By Scale
A common technique for reducing latency in large-scale systems is to parallelize sub-operations across many machines, where each sub-operation is co-located with its portion of a large dataset. Parallelization is achieved by fanning out the requests to leaf servers and merging the responses via a request-distribution tree.
At the service level the variability is noted when one of these sub-requests is handled by a component that is experiencing problems and the response needs to wait for such problematic server.
Reducing Component Variability
Some small engineering decisions might help to improve response-time variability and ensure that requests are serviced in a timely manner. Some examples of such decisions include the following:
Differentiating service classes and higher-level queuing: Differentiated service classes can be used to prefer scheduling requests for which a user is waiting over non-interactive requests. Keep low-level queues short so higher-level policies take effect more quickly.
Reducing head-of-line blocking: It is sometimes useful for the system to break long-running requests into a sequence of smaller requests, allowing interleaving of the execution of other short-running requests.
Managing background activities and synchronized disruption: A combination of throttling, breaking down heavyweight operations into smaller operations, and triggering such operations at times of lower overall load is often able to reduce the effect of background activities on interactive request latency. If the service has big fan-out operations, it might be useful for the system to synchronize the background activities across different machines. This synchronization enforces a brief burst of activity on each machine simultaneously, slowing only interactive requests that are being handled during this brief period of background activity. In contrast, without synchronization, a few machines are always doing some background activity, pushing out the latency tail on all requests.
Living with Latency Variability
The techniques shown above are essential for building high-performance interactive services, but are not enough to eliminate all latency variability.
Within Request Short-Term Adaptations
Within a single higher-level request the following techniques can be applied to reduce latency variability:
Hedged requests: A simple way to curb latency variability is to issue the same request to multiple replicas and use the results from whichever replica responds first. The remaining requests can be canceled after receiving a response from one server and to improve the increasing load, the secondary request can be made only if the primary exceeds the 95th-percentile expected latency.
Tied requests: If the hedged requests begin execution in different servers at the same time, the system will waste resource unnecessarily. To avoid such cases the system can use tied requests where the servers know about other servers that also received the request and whenever a server s starts handling a request r it communicates with the other server t to tell that t can remove request r from its queue.
An alternative to both schemes above is to probe the servers first and dispatch the request to the least-loaded server. But bear in mind that the load might change between the time of probing and when sending the request.
Cross-Request Long-Term Adaptations
In this section, techniques are presented for reducing latency variability caused by coarser-grain phenomena (such as service-time variations and load imbalance).
Micro-partitions: To combat imbalance, you might generate many more partitions than there are machines in the service, and then do dynamic assignment and load balancing of these partitions to particular machines. This technique is similar to the virtual nodes presented in Chord .
Selective replication: This is an enhancement of the micro-partitioning scheme above, and consists of detecting or predicting certain items that are likely to case load imbalance and create additional replicas of such items. Then load-balancing systems can use the additional replicas to spread the load of these hot micro-partitions across multiple machines without having to actually move micro-partitions.
Latency-induced probation: By observing the latency distribution of responses from the various machines in the system, intermediate servers may detect situations where the system performs better by excluding a particularly slow machine, or putting it on probation. The source of slowness might be a temporary problem so the system might continue issuing shadow requests to these excluded servers, collecting metrics on their latency to later decide if such machines should be brought back to the “active” service.
Large Information Retrieval Systems
In large information-retrieval (IR) systems, speed is more than a performance metric; it is a necessity, as returning good results quickly is more important than returning the best results slowly. Two techniques apply to such systems, and any other system that deal with imprecise results:
Good enough: In large IR systems, once a sufficient fraction of all the leaf servers has responded, the user may be best served by being given slightly incomplete (“good-enough”) results in exchange for better end-to-end latency.
Canary requests: Another problem that can occur in systems with very high fan-out is that a particular request exercises an untested code path, causing crashes or extremely long delays on thousands of servers simultaneously. To prevent such correlated crash scenarios, you might use “canary requests”; rather than sending a request to thousands of leaf servers, the root server sends it first to one or two leaf servers. The remaining servers are only queried if the root gets a successful response from the canary in a reasonable period of time.
Bear in mind that the techniques discussed above are far most applicable for operations that do not perform critical mutations of the system’s state. Operations that mutate state are easier to handle because the scale of latency-critical modifications in these services is generally small. Second, updates can often be performed off the critical path, after responding to the user1. Third, many services can be designed to tolerate inconsistent update models. And, finally, services that require strong consistent updates generally use quorum-based algorithms (e.g., Paxos, Raft) and since such algorithms need to commit to only a majority of replicas, they are inherently tail-tolerant.
 J. Dean and L. A. Barroso, The Tail at Scale, Communications of the ACM, vol. 56, no. 2, pp. 74–80, Feb. 2013.
 I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, ACM SIGCOMM …, pp. 149–160, 2001.