From the Raft paper:
Raft is a consensus algorithm for managing a replicated log.
It is used to achieve an agreement between multiple entities allowing them to serve as a coherent group that can tolerate failures of some of its members.
For example, you can have several Key/Value servers and want them to have the same values so in case some of them fail the system will still operate correctly since the data is the same. Using a consensus algorithm can help in this situation where for each operation, the servers agree on what is/should be the correct value.
As mentioned above you might want to use a consensus algorithm when you have multiple servers providing some kind of service, where the servers are viewed as a single group and data is replicated between them so if some of them fail the group can still provide the data that was previously stored.
Consensus algorithms must satisfy some properties in order to be useful1. They are the following:
v, then all correct processes decide
vmust been proposed by some process.
The algorithm centers around the role of the leader. One process is elected and then has the responsibility for managing the replicated log. Management means accepting log entries from clients, replicating them to other processes, and telling processes that is safe to apply log entries to their state machines.
What Raft does is to break the consensus problem into three subproblems:
With consensus you will probably work with a cluster of servers that most likely will have an odd number of members. The Raft paper says that a typical number of members for a Raft cluster is five. This means the group can tolerate two failures and still be able to work properly.
Generally speaking, if the cluster has
2 * F + 1 members it can tolerate
F failures and still function correctly.
There are three states that each server might be at any given time: leader, follower, or candidate. Under normal circumstances there will be exactly one leader and the rest of the servers will be followers. The followers don’t do anything besides responding to requests made by the leader2. As mentioned before, the leader handles clients requests, which then will result in requests made to the followers. Servers will become candidates when an election happens so the cluster can have a new leader to manage everything.
The same way it happens in real life when you elect politicians to act as president, governor, etc for a given period of time it happens with a Raft cluster. The leaders are elected for a period of time that is called term. Terms can have an arbitrary length and are numbered with consecutive integers. The election process starts a new term where one or more candidates can run and if elected, it will act as leader for the specified term until a new election is started. In case the candidates are tied the servers will start a new election (which means a new term) and proceed from there. At any given term at most one server will be leader.
The terms are also used to detect obsolete information like stale leaders. All servers store the current term they think is in effect. When the servers communicate with each other they exchange their current terms and if a server finds out that its current term is smaller than the other’s it will then update its current term to the larger value. In case this happen with candidates or the leader, they will also transition to the follower state. Any request with a stale term is rejected by the recipient.
The possible states and transitions between them are illustrated in the image below:
starts up ─┐ times out, │ ┌─────────starts ─────────┐ receives votes │ │ election │ ┌────from majority ───┐ │ │ ▼ │ of servers ▼ │ ┌──────────┴─┐ ┌──────────┴─┐ ┌────────────┐ └────▶│ Follower │ ┌────│ Candidate │──┐ │ Leader │ └────────────┘ │ └────────────┘ │ └────────────┘ ▲ ▲ discovers current ▲ │ │ │ │ leader or new term │times out, │ │ └─────────────┘ └───new ─┘ │ │ election │ │ │ │ │ │ │ └─────────────────────discovers server ─────────────────┘ with higher term
The communication between the servers is made via remote procedure calls (RPCs). The RPCs are retried in case the servers don’t receive a response in a timely manner.
All servers are followers when they start. They continue in this state as long as they receive valid requests from the leader or candidate. Leaders send periodic heartbeats to all followers to maintain their authority. When a follower doesn’t receive any requests over a period of time which is called election timeout the follower assumes there isn’t a leader and it will begin an election to choose a new leader.
To begin the election the follower increments its current term value and transitions to the candidate state. It issues a vote for itself and send RPCs requesting votes from other servers. These RPCs are called
┌──Vote─┐ │ │ │ ▼ ┌─────────┐ ┌┴────────┐ ┌─────────┐ │ Server │◀────RequestVote─────│ Server │────RequestVote──────▶│ Server │ └─────────┘ └─┬─────┬─┘ └─────────┘ │ │ │ │ ┌──RequestVote──┘ └───RequestVote──┐ │ │ │ │ │ │ ▼ ▼ ┌─────────┐ ┌─────────┐ │ Server │ │ Server │ └─────────┘ └─────────┘
The candidate will continue in this state until one of the following three things happen:
A candidate wins the election if it receives votes for the same term from a majority of servers in the entire cluster. Each server will vote for at most one candidate in a given term. The candidate is chosen by the first
RequestVote received3. Right after winning an election the now leader will send heartbeat messages to the servers in order to establish its authority and prevent any new election.
During the election the candidate can receive requests from other servers which might include a server claiming to be leader. If the leader’s current term is the same or bigger than the candidate’s term then the candidate recognizes the leader as legitimate and transitions to the follower state. If the term is smaller the candidate rejects the request and stays as candidate.
A tie can also happen during the election. If many followers become candidates at the same time the votes could be split and neither candidate will have the majority of votes.
To prevent split votes from happening indefinitely, Raft uses randomized election timeouts that are selected from a fixed interval, for example, 150-300ms. This results in servers timing out at different periods which will mean that in most cases a single server times out, starting a new election and winning it before other servers time out. The same mechanism helps in case of a tie because the servers will restart their randomized election timeout at the beginning of an election.
Clients interact with the leader by requesting commands to be executed by the replicated state machines. When receiving the request the leader appends the command to its log and send a request for the other servers asking them to do the same thing. This request is called
AppendEntries. When the new log entry is safely replicated the leader applies the command to its state machine and return the result of that execution to the client. Even in case of followers crashing, or network problems, the leader will keep trying to contact the followers and send the
AppendEntries until all followers have the same log as the leader.
┌───────┐ │Client │─────────Request────────┐ └───────┘ │ │ │ │ │ ▼ ┏━━━━━━━━━┓ ┌───────AppendEntries────────┃ Leader ┃────────AppendEntries────────┐ │ ┗━━━━━━━━━┛ │ │ │ │ │ │ │ │ │ ▼ │ │ ▼ ┌─────────┐ ┌──AppendEntries─┘ └──AppendEntries──┐ ┌─────────┐ │Follower │ │ │ │Follower │ └─────────┘ │ │ └─────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ┌─────────┐ ┌─────────┐ │Follower │ │Follower │ └─────────┘ └─────────┘
Each log entry contains the command to be applied by the state machine and the term representing when the request was received by the leader. The term is used to detect inconsistencies between the logs from different members of the cluster. Each entry has also a number that identifies its position in the log.
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │┌────────┐┌───────┐│ │┌────────┐┌───────┐│ │┌────────┐┌───────┐│ ││index: 1││term: 1││ ││index: 2││term: 1││ ││index: 3││term: 2││ │└────────┘└───────┘│ │└────────┘└───────┘│ │└────────┘└───────┘│ │ ┌───────────────┐ │──────────▶│ ┌───────────────┐ │─────────▶│ ┌───────────────┐ │ │ │command: x = 2 │ │ │ │command: y = 1 │ │ │ │command: x = 3 │ │ │ └───────────────┘ │ │ └───────────────┘ │ │ └───────────────┘ │ └───────────────────┘ └───────────────────┘ └───────────────────┘
The leader has the responsibility of deciding when it is safe to apply a given entry to the state machines. When is safe to apply an entry it is said that the entry is committed. The log entry is committed when the leader that created the entry has replicated it on a majority of servers. Raft guarantees that committed entries are durable and will eventually be executed by all other available members of the cluster. Previous entries are committed as well once a subsequent entry is committed. It doesn’t matter if the entries were accepted by the current leader or a previous leader.
AppendEntries request the leader also includes the highest index of an entry that is committed. When the followers receive the request they check this index and execute an entry to their state machine if they haven’t done it before. In case they haven’t applied several entries they will do it in the order they appear in the log.
Raft has a Log Matching Property, composed by two properties, that states:
The first derives from the fact that a leader creates at most one log entry in a specific index for a given term and log entries never change their position. The second property comes from the fact that when sending
AppendEntries requests, the leader includes the index and term of the entry that immediately precedes the new entries. If a follower doesn’t have the previous entry it will reject the request and the leader will send a new
AppendEntries with the missing entries as well.
Such inconsistencies are handled by the leader forcing the followers to match their logs with the leader’s. This means that entries in the follower logs will be overwritten with entries from the leader’s log. The leader finds the latest matching entry between its log and the follower log, deletes anything from that point in the follower log and sends the correct entries after that point from the leader log.
The steps mentioned earlier are not sufficient for ensuring that the same commands are executed in the same order by all state machines. Raft adds restrictions about who can be elected and how entries from previous terms can be committed.
All committed entries from previous terms must be present on each new leader from the moment of its election. This means that log entries only flow from leaders to followers.
A candidate can only be elected if it has all committed entries in its log. And to win the election the candidate must contact a majority of the alive servers which means that at least one of these other servers must also have the committed entries in their logs.
The candidate sends the
RequestVote RPC with information about its own log and the voter denies the vote if the candidate’s log is outdated. The rule is as follows: if the last entries in the logs have a different term, the log with the later term is more up-to-date. If the logs have the same term for their last entries then the log which is longer (largest index for last entry) is more up-to-date.
As mentioned before, the leader knows when it’s safe to apply an entry from its current term once the entry is stored on a majority of the servers. In case the leader crashes a new leader will continue replicating the entry to the other servers but it can’t conclude that an entry from a previous term is committed when it is stored on a majority of servers. The following image illustrates this case:
1 2 1 2 1 2 3 1 2 3 1 2 3 ┏━━━━┳━━━━┓ ┌────┬────┐ ┏━━━━┳━━━━┳━━━━┓ ┌────┬ ─ ─ ┏━━━━┳━━━━┳━━━━┓ S1 ┃ 1 ┃ 2 ┃ │ 1 │ 2 │ ┃ 1 ┃ 2 ┃ 4 ┃ │ 1 │ 3 │ ┃ 1 ┃ 2 ┃ 4 ┃ ┗━━━━┻━━━━┛ └────┴────┘ ┗━━━━┻━━━━┻━━━━┛ └────┴ ─ ─ ┗━━━━┻━━━━┻━━━━┛ ┌────┬────┐ ┌────┬────┐ ┌────┬────┐ ┌────┬ ─ ─ ┌────┬────┬ ─ ─ S2 │ 1 │ 2 │ │ 1 │ 2 │ │ 1 │ 2 │ │ 1 │ 3 │ │ 1 │ 2 │ 4 │ └────┴────┘ └────┴────┘ └────┴────┘ └────┴ ─ ─ └────┴────┴ ─ ─ ┌────┐ ┌────┐ ┌────┬ ─ ─ ┌────┬ ─ ─ ┌────┬────┬ ─ ─ S3 │ 1 │ │ 1 │ │ 1 │ 2 │ │ 1 │ 3 │ │ 1 │ 2 │ 4 │ └────┘ └────┘ └────┴ ─ ─ └────┴ ─ ─ └────┴────┴ ─ ─ ┌────┐ ┌────┐ ┌────┐ ┌────┬ ─ ─ ┌────┐ S4 │ 1 │ │ 1 │ │ 1 │ │ 1 │ 3 │ │ 1 │ └────┘ └────┘ └────┘ └────┴ ─ ─ └────┘ ┌────┐ ┏━━━━┳━━━━┓ ┌────┬────┐ ┏━━━━┳━━━━┓ ┌────┬────┐ S5 │ 1 │ ┃ 1 ┃ 3 ┃ │ 1 │ 3 │ ┃ 1 ┃ 3 ┃ │ 1 │ 3 │ └────┘ ┗━━━━┻━━━━┛ └────┴────┘ ┗━━━━┻━━━━┛ └────┴────┘ (a) (b) (c) (d) (e)
In (a) S1 is the leader (bold box) and replicates the entry to S2 before crashing. In (b) S5 is elected (with votes from S3, S4 and itself) and accepts a new entry at index 2. In (c) S5 crashes and S1 is elected, the entry from term 2 is replicated in a majority of servers but is not committed. In case of S1 crashing and and S5 being elected again it could overwrite the entries from term 2 as in (d).
To avoid such problems, Raft never commits log entries from previous terms by counting replicas. Only entries from the current term are committed by counting replicas. Because when an entry is committed the previous entries are committed as well, if case (e) happened and S1 committed the entry from term 4 the entry from term 2 would be committed as well. This would make impossible for S5 to win an election again and overwrite entries.
Raft guarantees the following properties. You can read the paper for more information and arguments to sustain that the properties hold at all times.
The logs grow as time passes and clients interact with the leader. In a real system they can’t grow unbounded because they would require a lot of memory and more time to replay. One approach to compact the logs and discard obsolete information is snapshotting. This is done by writing the current state to a snapshot on stable storage and discarding the entire log up to the point that was stored in the snapshot.
In Raft each server takes snapshots independently, considering just the committed entries in its log. The current state of the state machine is written to the snapshot together with other data: the last included index which is the index of the last entry the snapshot replaces in the log, and the last included term which is the term for the same entry. When the server completes writing the snapshot it may remove all log entries up to the last included index and also can get rid of any prior snapshot.
1 2 3 4 5 6 7 ┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐ │ term: 1 │ term: 1 │ term: 1 │ term: 2 │ term: 3 │ term: 3 │ term: 3 │ before │ x ◀─── 3 │ y ◀─── 9 │ y ◀─── 7 │ x ◀─── 2 │ x ◀─── 5 │ y ◀───11 │ x ◀─── 0 │ └──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ snapshot 6 7 ┌────────────────────────┐ ┌──────────┬──────────┐ │ last included index: 5 │ │ term: 3 │ term: 3 │ after │ last included term: 3 │ │ y ◀───11 │ x ◀─── 0 │ │ state machine state: │ └──────────┴──────────┘ │ x ◀─── 5 │ │ y ◀─── 7 │ └────────────────────────┘
The leader might need to send snapshots to followers that lag behind. This could happen if a slow follower or a new member that just joined the cluster don’t have the necessary log information and the leader might already created a snapshot of its log.
In this case the leader will send the
InstallSnapshot RPC to bring the follower up-to-date. The follower decides what to do when receiving the RPC. In the case the snapshot contains new information not already in its log, the follower deletes the entire log. If the snapshot describes a prefix of the follower’s log then it can remove the entries that are covered by the snapshot and retain the new entries the snapshot does not cover.
This is just an overview about the algorithm. The paper, of course, has more information such as cluster membership changes, the information each server holds, the definition of each RPC, the exact rules the servers must obey when receiving and sending requests, etc. If you want to understand more about it maybe because you want to implement the algorithm itself4, or if you intend to use some implementation, be sure to read the paper, it is a pleasant paper to read.
As you have seen, the post didn’t compare Raft with any other consensus algorithm. The paper has a comparison with Paxos, including results from a study group to compare the understandability of Paxos and Raft. For more information about other consensus algorithms you can search for different algorithms like Paxos, Viewstamped Replication and Zab, to name a few.
Instead of simply outputting default values, for example.↩
If contacted by clients, followers reject the request but can tell the client which server might be the leader at that point.↩
Which I recommend because you will learn a lot from it. I’m biased because that was my case. :)↩