Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: relax fsync requirements #124278

Open
pav-kv opened this issue May 16, 2024 · 3 comments
Open

raft: relax fsync requirements #124278

pav-kv opened this issue May 16, 2024 · 3 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@pav-kv
Copy link
Collaborator

pav-kv commented May 16, 2024

Raft requires fsync on casting votes and accepting entries into the log. All decisions on the critical path towards entry commits must be locally durable. A node restart in this model is indistinguishable from a node being slow.

This has a couple of downsides. For one, fsync latency adds into commit latency.

More critically, in practice, some environments either explicitly disable fsync (to avoid the aforementioned latency), or have system bugs in fsync itself which result in violating this requirement. This leads at best to node crashes in raft (e.g. when the follower state regresses after a restart, and the leader tries to move it forward with out-of-bound updates) and loss of quorum, and in worst cases to busy loops, silent durability loss, or silently uncommitting entries.

Raft's failure model can be relaxed, in order to workaround a minority of nodes having fsync problems. It is sufficient to require a "collective" durability: the state is durable on a quorum, but every individual replica can fall behind on restart. This model is supported in Viewstamped Replication (sections 4.3, 5.1), and can be adopted in any consensus algorithm.

To support this model, a restarted node must not immediately assume its state has been durable. Instead, it starts as a learner, learns the up-to-date state (or makes sure its state is ok) before proceeding to be a voter. The node can do so by coordinating with a quorum and/or the leader. In a way, a replica restart is equivalent to a permanent death of this replica, and a safe reconfiguration in which its new incarnation rejoins the group.

Jira issue: CRDB-38804

Epic CRDB-39898

Related/duplicate to #88442
Related to #113147

@pav-kv pav-kv added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels May 16, 2024
Copy link

blathers-crl bot commented May 16, 2024

cc @cockroachdb/replication

@pav-kv pav-kv changed the title raft: support relaxed fsync requirements raft: relax fsync requirements May 16, 2024
@pav-kv
Copy link
Collaborator Author

pav-kv commented May 17, 2024

There are probably some nuances here.

The replica can't be fully "inactive" while it learns the state. Consider a scenario in which all 3/3 replicas restarted simultaneously. They should be able to learn even though there is no active leader, and no replica transitioned to voter yet. One way to make it work is that all replicas are learners and readable by other replicas (kind of "read-only voter"), so that everyone can bootstrap from a quorum. This scenario is when the risk of losing state is the highest: if there is no quorum with up-to-date state, everyone will bootstrap stale. This risk exists today too, if fsync is disabled.

How does this work with config changes? What if a starting up replica regressed to a pre-config-change state? Config changes modify quorums, and this should be reflected in the recovery procedure. Potentially, config changes are one of few cases when durability guarantees must be stronger.

@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 7, 2024

We can solve a sub-problem of this that will cover a lot of cases:

  • A follower crashes and restarts with a regressed log.
  • In the meantime, the leader did not change.

Today, the leader sends the Match index to the follower in MsgApp messages. The follower crashes if it detects that its log regressed.

Instead, we can do a simple modification:

  1. The follower replies to the leader with a special MsgAppResp saying: I regressed to this LogMark.
  2. The leader regresses this follower's Match index (potentially to 0 if the LogMark is from a lower term), and transitions its replication flow to StateProbe.
  3. The leader computes the would-be committed index (a median of the Match indices) accounting for this updated Match, and if it's below the current committed index (which would be uncommon if this follower wasn't "critical"), it surfaces a warning to the operator. The leader does not regress the committed index though, it keeps the old one.
  4. The leader continues normal operation.

NB: with this change, the leader will not be able to commit any new entries until a quorum of Match indices catches up to the old committed index. For example, if this restarted follower was "critical", it will need to be caught up first. The situation would be equivalent to this follower being down - if it's critical, the range is stalled; if it's not critical, it's just being caught up in the background.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

1 participant