forked from hashicorp/nomad
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request hashicorp#491 from dnephin/dnephin/add-dev-docs
First draft of developer docs
- Loading branch information
Showing
2 changed files
with
151 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# Raft Developer Documentation | ||
|
||
This documentation provides a high level introduction to the `hashicorp/raft` | ||
implementation. The intended audience is anyone interested in understanding | ||
or contributing to the code. | ||
|
||
## Contents | ||
|
||
1. [Terminology](#terminology) | ||
2. [Operations](#operations) | ||
1. [Apply](./apply.md) | ||
3. [Threads](#threads) | ||
|
||
|
||
## Terminology | ||
|
||
This documentation uses the following terms as defined. | ||
|
||
* **Cluster** - the set of peers in the raft configuration | ||
* **Peer** - a node that participates in the consensus protocol using `hashicorp/raft`. A | ||
peer may be in one of the following states: **follower**, **candidate**, or **leader**. | ||
* **Log** - the full set of log entries. | ||
* **Log Entry** - an entry in the log. Each entry has an index that is used to order it | ||
relative to other log entries. | ||
* **Committed** - A log entry is considered committed if it is safe for that entry to be | ||
applied to state machines. A log entry is committed once the leader that created the | ||
entry has replicated it on a majority of the peers. A peer has successfully | ||
replicated the entry once it is persisted. | ||
* **Applied** - log entry applied to the state machine (FSM) | ||
* **Term** - raft divides time into terms of arbitrary length. Terms are numbered with | ||
consecutive integers. Each term begins with an election, in which one or more candidates | ||
attempt to become leader. If a candidate wins the election, then it serves as leader for | ||
the rest of the term. If the election ends with a split vote, the term will end with no | ||
leader. | ||
* **FSM** - finite state machine, stores the cluster state | ||
* **Client** - the application that uses the `hashicorp/raft` library | ||
|
||
## Operations | ||
|
||
### Leader Write | ||
|
||
Most write operations must be performed on the leader. | ||
|
||
* RequestConfigChange - update the raft peer list configuration | ||
* Apply - apply a log entry to the log on a majority of peers, and the FSM. See [raft apply](apply.md) for more details. | ||
* Barrier - a special Apply that does not modify the FSM, used to wait for previous logs to be applied | ||
* LeadershipTransfer - stop accepting client requests, and tell a different peer to start a leadership election | ||
* Restore (Snapshot) - overwrite the cluster state with the contents of the snapshot (excluding cluster configuration) | ||
* VerifyLeader - send a heartbeat to all voters to confirm the peer is still the leader | ||
|
||
### Follower Write | ||
|
||
* BootstrapCluster - store the cluster configuration in the local log store | ||
|
||
|
||
### Read | ||
|
||
Read operations can be performed on a peer in any state. | ||
|
||
* AppliedIndex - get the index of the last log entry applied to the FSM | ||
* GetConfiguration - return the latest cluster configuration | ||
* LastContact - get the last time this peer made contact with the leader | ||
* LastIndex - get the index of the latest stored log entry | ||
* Leader - get the address of the peer that is currently the leader | ||
* Snapshot - snapshot the current state of the FSM into a file | ||
* State - return the state of the peer | ||
* Stats - return some stats about the peer and the cluster | ||
|
||
## Threads | ||
|
||
Raft uses the following threads to handle operations. The name of the thread is in bold, | ||
and a short description of the operation handled by the thread follows. The main thread is | ||
responsible for handling many operations. | ||
|
||
* **run** (main thread) - different behaviour based on peer state | ||
* follower | ||
* processRPC (from rpcCh) | ||
* AppendEntries | ||
* RequestVote | ||
* InstallSnapshot | ||
* TimeoutNow | ||
* liveBootstrap (from bootstrapCh) | ||
* periodic heartbeatTimer (HeartbeatTimeout) | ||
* candidate - starts an election for itself when called | ||
* processRPC (from rpcCh) - same as follower | ||
* acceptVote (from askPeerForVote) | ||
* leader - first starts replication to all peers, and applies a Noop log to ensure the new leader has committed up to the commit index | ||
* processRPC (from rpcCh) - same as follower, however we don’t actually expect to receive any RPCs other than a RequestVote | ||
* leadershipTransfer (from leadershipTransferCh) - | ||
* commit (from commitCh) - | ||
* verifyLeader (from verifyCh) - | ||
* user restore snapshot (from userRestoreCh) - | ||
* changeConfig (from configurationChangeCh) - | ||
* dispatchLogs (from applyCh) - handle client Raft.Apply requests by persisting logs to disk, and notifying replication goroutines to replicate the new logs | ||
* checkLease (periodically LeaseTimeout) - | ||
* **runFSM** - has exclusive access to the FSM, all reads and writes must send a message to this thread. Commands: | ||
* apply logs to the FSM, from the fsmMutateCh, from processLogs, from leaderLoop (leader) or appendEntries RPC (follower/candidate) | ||
* restore a snapshot to the FSM, from the fsmMutateCh, from restoreUserSnapshot (leader) or installSnapshot RPC (follower/candidate) | ||
* capture snapshot, from fsmSnapshotCh, from takeSnapshot (runSnapshot thread) | ||
* **runSnapshot** - handles the slower part of taking a snapshot. From a pointer captured by the FSM.Snapshot operation, this thread persists the snapshot by calling FSMSnapshot.Persist. Also calls compactLogs to delete old logs. | ||
* periodically (SnapshotInterval) takeSnapshot for log compaction | ||
* user snapshot, from userSnapshotCh, takeSnapshot to return to the user | ||
* **askPeerForVote (candidate only)** - short lived goroutine that synchronously sends a RequestVote RPC to all voting peers, and waits for the response. One goroutine per voting peer. | ||
* **replicate (leader only)** - long running goroutine that synchronously sends log entry AppendEntry RPCs to all peers. Also starts the heartbeat thread, and possibly the pipelineDecode thread. Runs sendLatestSnapshot when AppendEntry fails. | ||
* **heartbeat (leader only)** - long running goroutine that synchronously sends heartbeat AppendEntry RPCs to all peers. | ||
* **pipelineDecode (leader only)** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Raft Apply | ||
|
||
Apply is the primary operation provided by raft. | ||
|
||
|
||
This sequence diagram shows the steps involved in a `raft.Apply` operation. Each box | ||
across the top is a separate thread. The name in the box identifies the state of the peer | ||
(leader or follower) and the thread (`<peer state>:<thread name>`). When there are | ||
multiple copies of the thread, it is indicated with `(each peer)`. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
autonumber | ||
participant client | ||
participant leadermain as leader:main | ||
participant leaderfsm as leader:fsm | ||
participant leaderreplicate as leader:replicate (each peer) | ||
participant followermain as follower:main (each peer) | ||
participant followerfsm as follower:fsm (each peer) | ||
client-)leadermain: applyCh to dispatchLogs | ||
leadermain->>leadermain: store logs to disk | ||
leadermain-)leaderreplicate: triggerCh | ||
leaderreplicate-->>followermain: AppendEntries RPC | ||
followermain->>followermain: store logs to disk | ||
opt leader commit index is ahead of peer commit index | ||
followermain-)followerfsm: fsmMutateCh <br>apply committed logs | ||
followerfsm->>followerfsm: fsm.Apply | ||
end | ||
followermain-->>leaderreplicate: respond success=true | ||
leaderreplicate->>leaderreplicate: update commitment | ||
opt quorum commit index has increased | ||
leaderreplicate-)leadermain: commitCh | ||
leadermain-)leaderfsm: fsmMutateCh | ||
leaderfsm->>leaderfsm: fsm.Apply | ||
leaderfsm-)client: future.respond | ||
end | ||
``` |