-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: relax fsync requirements #124278
Comments
cc @cockroachdb/replication |
There are probably some nuances here. The replica can't be fully "inactive" while it learns the state. Consider a scenario in which all 3/3 replicas restarted simultaneously. They should be able to learn even though there is no active leader, and no replica transitioned to voter yet. One way to make it work is that all replicas are learners and readable by other replicas (kind of "read-only voter"), so that everyone can bootstrap from a quorum. This scenario is when the risk of losing state is the highest: if there is no quorum with up-to-date state, everyone will bootstrap stale. This risk exists today too, if fsync is disabled. How does this work with config changes? What if a starting up replica regressed to a pre-config-change state? Config changes modify quorums, and this should be reflected in the recovery procedure. Potentially, config changes are one of few cases when durability guarantees must be stronger. |
We can solve a sub-problem of this that will cover a lot of cases:
Today, the leader sends the Instead, we can do a simple modification:
NB: with this change, the leader will not be able to commit any new entries until a quorum of |
Raft requires fsync on casting votes and accepting entries into the log. All decisions on the critical path towards entry commits must be locally durable. A node restart in this model is indistinguishable from a node being slow.
This has a couple of downsides. For one, fsync latency adds into commit latency.
More critically, in practice, some environments either explicitly disable fsync (to avoid the aforementioned latency), or have system bugs in fsync itself which result in violating this requirement. This leads at best to node crashes in raft (e.g. when the follower state regresses after a restart, and the leader tries to move it forward with out-of-bound updates) and loss of quorum, and in worst cases to busy loops, silent durability loss, or silently uncommitting entries.
Raft's failure model can be relaxed, in order to workaround a minority of nodes having fsync problems. It is sufficient to require a "collective" durability: the state is durable on a quorum, but every individual replica can fall behind on restart. This model is supported in Viewstamped Replication (sections 4.3, 5.1), and can be adopted in any consensus algorithm.
To support this model, a restarted node must not immediately assume its state has been durable. Instead, it starts as a learner, learns the up-to-date state (or makes sure its state is ok) before proceeding to be a voter. The node can do so by coordinating with a quorum and/or the leader. In a way, a replica restart is equivalent to a permanent death of this replica, and a safe reconfiguration in which its new incarnation rejoins the group.
Jira issue: CRDB-38804
Epic CRDB-39898
Related/duplicate to #88442
Related to #113147
The text was updated successfully, but these errors were encountered: