raft: relax fsync requirements #124278

pav-kv · 2024-05-16T12:20:26Z

Raft requires fsync on casting votes and accepting entries into the log. All decisions on the critical path towards entry commits must be locally durable. A node restart in this model is indistinguishable from a node being slow.

This has a couple of downsides. For one, fsync latency adds into commit latency.

More critically, in practice, some environments either explicitly disable fsync (to avoid the aforementioned latency), or have system bugs in fsync itself which result in violating this requirement. This leads at best to node crashes in raft (e.g. when the follower state regresses after a restart, and the leader tries to move it forward with out-of-bound updates) and loss of quorum, and in worst cases to busy loops, silent durability loss, or silently uncommitting entries.

Raft's failure model can be relaxed, in order to workaround a minority of nodes having fsync problems. It is sufficient to require a "collective" durability: the state is durable on a quorum, but every individual replica can fall behind on restart. This model is supported in Viewstamped Replication (sections 4.3, 5.1), and can be adopted in any consensus algorithm.

To support this model, a restarted node must not immediately assume its state has been durable. Instead, it starts as a learner, learns the up-to-date state (or makes sure its state is ok) before proceeding to be a voter. The node can do so by coordinating with a quorum and/or the leader. In a way, a replica restart is equivalent to a permanent death of this replica, and a safe reconfiguration in which its new incarnation rejoins the group.

Jira issue: CRDB-38804

Epic CRDB-39898

Related/duplicate to #88442
Related to #113147

blathers-crl · 2024-05-16T12:20:29Z

cc @cockroachdb/replication

pav-kv · 2024-05-17T11:34:04Z

There are probably some nuances here.

The replica can't be fully "inactive" while it learns the state. Consider a scenario in which all 3/3 replicas restarted simultaneously. They should be able to learn even though there is no active leader, and no replica transitioned to voter yet. One way to make it work is that all replicas are learners and readable by other replicas (kind of "read-only voter"), so that everyone can bootstrap from a quorum. This scenario is when the risk of losing state is the highest: if there is no quorum with up-to-date state, everyone will bootstrap stale. This risk exists today too, if fsync is disabled.

How does this work with config changes? What if a starting up replica regressed to a pre-config-change state? Config changes modify quorums, and this should be reflected in the recovery procedure. Potentially, config changes are one of few cases when durability guarantees must be stronger.

pav-kv · 2024-10-07T01:23:10Z

We can solve a sub-problem of this that will cover a lot of cases:

A follower crashes and restarts with a regressed log.
In the meantime, the leader did not change.

Today, the leader sends the Match index to the follower in MsgApp messages. The follower crashes if it detects that its log regressed.

Instead, we can do a simple modification:

The follower replies to the leader with a special MsgAppResp saying: I regressed to this LogMark.
The leader regresses this follower's Match index (potentially to 0 if the LogMark is from a lower term), and transitions its replication flow to StateProbe.
The leader computes the would-be committed index (a median of the Match indices) accounting for this updated Match, and if it's below the current committed index (which would be uncommon if this follower wasn't "critical"), it surfaces a warning to the operator. The leader does not regress the committed index though, it keeps the old one.
The leader continues normal operation.

NB: with this change, the leader will not be able to commit any new entries until a quorum of Match indices catches up to the old committed index. For example, if this restarted follower was "critical", it will need to be caught up first. The situation would be equivalent to this follower being down - if it's critical, the range is stalled; if it's not critical, it's just being caught up in the background.

pav-kv added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels May 16, 2024

blathers-crl bot added the T-kv-replication label May 16, 2024

pav-kv mentioned this issue May 16, 2024

raft: make follower commit index advancement safe #122100

Closed

pav-kv changed the title ~~raft: support relaxed fsync requirements~~ raft: relax fsync requirements May 16, 2024

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: relax fsync requirements #124278

raft: relax fsync requirements #124278

pav-kv commented May 16, 2024 •

edited

Loading

blathers-crl bot commented May 16, 2024

pav-kv commented May 17, 2024 •

edited

Loading

pav-kv commented Oct 7, 2024

raft: relax fsync requirements #124278

raft: relax fsync requirements #124278

Comments

pav-kv commented May 16, 2024 • edited Loading

blathers-crl bot commented May 16, 2024

pav-kv commented May 17, 2024 • edited Loading

pav-kv commented Oct 7, 2024

pav-kv commented May 16, 2024 •

edited

Loading

pav-kv commented May 17, 2024 •

edited

Loading