Large-scale cluster management at Google with Borg
Google built a cluster management system called Borg that admits, schedules, starts, restarts, and monitors the full range of applications that Google runs. The main benefits that Borg provides are: hiding the details of resource management and failure handling so its users can focus on application development; operates with very high reliability and availability, supporting applications that do the same; lets its users to run workloads across tens of thousands of machines effectively.
The user perspective
Users submit work to Borg in the form of jobs, each of which consists of one or more tasks that all run the same program. Each job runs in one Borg cell, a set of machines that are managed as a unit.
The workload
The workload can be heterogeneous and it has two types: the first being long-running services that should “never” go down, and handle short-lived latency-sensitive requests. The second type is batch jobs that take from a few seconds to a few days to complete, and are much less sensitive to short-term performance variation.
Clusters and cells
The machines in a cell belong to a single cluster. A cluster lives inside a single datacenter building, and a collection of buildings makes up a site. A cluster usually hosts one large cell and may have a few smaller-scale test or special-purpose cells.
The machines in a cell are heterogeneous in several aspects, including: sizes (CPU, RAM, disk, network), processor type, performance, and capabilities like external IP address or flash storage. Most of these differences are hidden from the user by Borg.
Jobs and tasks
Each job has properties such as: its name, owner, and number of tasks it has. They can have constraints1 to force its tasks to run on machines with specific attributes like processor architecture, OS version, external IP address, etc.
Each task maps to a set of Linux processes running in a container on a machine and the vast majority does not run inside virtual machines.
The tasks also have properties, such as resource requirements and the task index within the job. These properties can be changed in a running job by pushing a new job configuration, and then instructing Borg to update the tasks. To reduce dependencies on the runtime environment, Borg programs are statically linked, and structured as packages, whose installation is orchestrated by Borg.
Other Borg jobs, monitoring systems, or a command-line tool can be used to operate on jobs via RPCs to Borg.
The following image illustrates the states that jobs and tasks go through their lifetime:
Allocs
A Borg alloc (allocation) is a reserved set of resources on a machine in which one or more tasks can be run; the resources remain assigned whether or not they are used.
Priority, quota, and admission control
Each job has a priority, a small positive integer. Resources for a high-priority task can be obtained at the expense of lower-priority task, even if that means preempting the lower-priority task. A preempted task is rescheduled somewhere else in the cell.
Quota is expressed as a vector of resources quantities (CPU, RAM, disk, etc) at a given priority, for a period of time which is mostly given in months. They are used to decide which jobs to admit for scheduling. Higher-priority quota costs more than lower-priority and quota allocation is handled outside Borg.
Naming and monitoring
Each task has a name that is created using the “Borg name service” (BNS) and this name includes the cell name, job name, and task number. The 50th task in a job jfoo
owned by user ubar
in cell cc
would be reachable via the following address: 50.jfoo.ubar.cc.borg.google.com
.
Borg Architecture
A Borg cell consists of a set of machines, a logically centralized controller called the Borgmaster, and an agent process called the Borglet that runs on each machine in a cell.
Borgmaster
The Borgmaster consists of two processes: the main Borgmaster process and a separate scheduler. The main process handles client RPCs that either mutate state (e.g., create job
) or provide read-only access to data (e.g., lookup job
). It also manages state machines for all of the objects in the system (machines, tasks, allocs, etc.), communicates with Borglets, and offers a web UI as a backup to Sigma2.
The Borgmaster is replicated five times, and each replica maintains an in-memory copy of the state of the cell, along with a copy in a highly-available, distributed, Paxos-based store on the replica’s local disks.
Scheduling
When a job is submitted, the Borgmaster records it in the Paxos store and adds the job’s task to the pending
queue. The scheduler scans this queue asynchronously and assigns tasks to machines if there are enough available resources meeting the job’s constraints. The scan is done from high to low, priority, modulated by a round-robin scheme within a priority to ensure fairness.
Borglet
The local Borg agent present on every machine in a cell is called Borglet. Its responsibilities include: starting and stopping tasks; restarting them if they fail; managing local resources by manipulating OS kernel settings; rolling over debug logs; and reporting the state of the machine to the Borgmaster and other monitoring systems.
To avoid the herding effect in case of failure, the Borgmaster controls the rate of communication with the Borglets. It polls the Borglets every few seconds to retrieve the machine’s current state and send any outstanding requests.
A Borglet is considered as down
if it does not respond to several poll messages. In this case, its tasks will be rescheduled on other machines and once communication is restored, the Borgmaster will tell the Borglet to kill the tasks that have been rescheduled, avoiding duplicates. If contact with a Borgmaster is lost, the Borglet continues normal operation, which means the currently-running tasks and services will remain up even if all Borgmaster replicas fail.
Availability
To provide availability Borg employs several steps, including:
- automatically rescheduling evicted tasks, on a new machine if necessary;
- reducing correlated failures by spreading tasks of a job across different failure domains (machines, racks, and power domains);
- limiting the allowed rate of task disruptions and the number of tasks from a job that can be simultaneously down during maintenance activities such as OS or machine upgrades;
- using declarative desired-state representations and idempotent mutating operations, so that a failed client can harmlessly resubmit any forgotten requests;
- rate-limiting finding new places for tasks from machines that become unreachable, because it cannot distinguish large-machine failure from a network partition;
- avoiding repeating
task::machine
pairings that cause tasks or machines crashes;
Borgmaster uses a combination of techniques that enable it to achieve 99.99% availability in practice: replication for machine failures; admission control to avoid overload; and deploying instances using simple, low-level tools to minimize external dependencies. Each cell is independent of the others to minimize the chance of correlated operator errors and failure propagation. These goals, not scalability limitations, are the primary argument against larger cells.
References
- [1] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, Large-scale cluster management at Google with Borg, presented at the EuroSys ’15: Proceedings of the Tenth European Conference on Computer Systems, New York, New York, USA, 2015, pp. 18–17.