-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Handle raft.ErrProposalDropped #21849
Comments
Proposal forwarding has become a general problem recently. We see this come up both in #37906 and #42821. In the former, we want to make sure that a follower that's behind can't propose (and get) a lease. Allowing only the raft leader to add proposals to its log would be a way to get that (mod "command runs on old raft leader while there's already another one"). In the latter, we had to add tricky code below raft to re-add commands under a new lease proposed index if we could detect that they had not applied and were no longer applicable. This solution is basically technical debt and it has cost us numerous subtle bugs over time. We want to get rid of it, which means making sure that lease applied indexes don't generally reorder. Again, this can be achieved with good enough accuracy by only ever adding to the local log, i.e. never forwarding proposals. It's not trivial to set this up, but ultimately I think we'll be better off. The rules would be something like this:
Since only the leaseholder proposes to Raft, the leaseholder should be able to hold or regain membership just fine. We just have to make sure that a follower won't ever campaign against an active leaseholder. |
(cc @nvanbenschoten and @ajwerner since we both talked about this) |
There was a bug in range quiescence due to which commands would hang in raft for minutes before actually getting replicated. This would occur whenever a range was quiesced but a follower replica which didn't know the (Raft) leader would receive a request. This request would be evaluated and put into the Raft proposal buffer, and a ready check would be enqueued. However, no ready would be produced (since the proposal got dropped by raft; leader unknown) and so the replica would not unquiesce. This commit prevents this by always waking up the group if the proposal buffer was initially nonempty, even if an empty Ready is produced. It goes further than that by trying to ensure that a leader is always known while quiesced. Previously, on an incoming request to quiesce, we did not verify that the raft group had learned the leader's identity. One shortcoming here is that in the situation in which the proposal would originally hang "forever", it will now hang for one heartbeat timeout where ideally it would be proposed more reactively. Since this is so rare I didn't try to address this. Instead, refer to the ideas in cockroachdb#37906 (comment) and cockroachdb#21849 for future changes that could mitigate this. Without this PR, the test would fail around 10% of the time. With this change, it passed 40 iterations in a row without a hitch, via: ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9 Release justification: bug fix Release note (bug fix): a rare case in which requests to a quiesced range could hang in the KV replication layer was fixed. This would manifest as a message saying "have been waiting ... for proposing" even though no loss of quorum occurred.
There was a bug in range quiescence due to which commands would hang in raft for minutes before actually getting replicated. This would occur whenever a range was quiesced but a follower replica which didn't know the (Raft) leader would receive a request. This request would be evaluated and put into the Raft proposal buffer, and a ready check would be enqueued. However, no ready would be produced (since the proposal got dropped by raft; leader unknown) and so the replica would not unquiesce. This commit prevents this by always waking up the group if the proposal buffer was initially nonempty, even if an empty Ready is produced. It goes further than that by trying to ensure that a leader is always known while quiesced. Previously, on an incoming request to quiesce, we did not verify that the raft group had learned the leader's identity. One shortcoming here is that in the situation in which the proposal would originally hang "forever", it will now hang for one heartbeat timeout where ideally it would be proposed more reactively. Since this is so rare I didn't try to address this. Instead, refer to the ideas in cockroachdb#37906 (comment) and cockroachdb#21849 for future changes that could mitigate this. Without this PR, the test would fail around 10% of the time. With this change, it passed 40 iterations in a row without a hitch, via: ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9 Release justification: bug fix Release note (bug fix): a rare case in which requests to a quiesced range could hang in the KV replication layer was fixed. This would manifest as a message saying "have been waiting ... for proposing" even though no loss of quorum occurred.
46045: kvserver: avoid hanging proposal after leader goes down r=nvanbenschoten,petermattis a=tbg Deflakes gossip/chaos/nodes=9, i.e. Fixes #38829. There was a bug in range quiescence due to which commands would hang in raft for minutes before actually getting replicated. This would occur whenever a range was quiesced but a follower replica which didn't know the (Raft) leader would receive a request. This request would be evaluated and put into the Raft proposal buffer, and a ready check would be enqueued. However, no ready would be produced (since the proposal got dropped by raft; leader unknown) and so the replica would not unquiesce. This commit prevents this by always waking up the group if the proposal buffer was initially nonempty, even if an empty Ready is produced. It goes further than that by trying to ensure that a leader is always known while quiesced. Previously, on an incoming request to quiesce, we did not verify that the raft group had learned the leader's identity. One shortcoming here is that in the situation in which the proposal would originally hang "forever", it will now hang for one heartbeat or election timeout where ideally it would be proposed more reactively. Since this is so rare I didn't try to address this. Instead, refer to the ideas in #37906 (comment) and #21849 for future changes that could mitigate this. Without this PR, the test would fail around 10% of the time. With this change, it passed 40 iterations in a row without a hitch, via: ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9 Release justification: bug fix Release note (bug fix): a rare case in which requests to a quiesced range could hang in the KV replication layer was fixed. This would manifest as a message saying "have been waiting ... for proposing" even though no loss of quorum occurred. Co-authored-by: Peter Mattis <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>
We have marked this issue as stale because it has been inactive for |
Some related musings #102956 (comment) |
etcd-io/etcd#9067 introduced a new error
ErrProposalDropped
. Properly handling this error will allow us to reduce the occurrence of ambiguous failures (Replica.executeWriteBatch
doesn't need to return anAmbiguousResultError
if it has not successfully proposed), and may allow us to be more intelligent in our raft-level retries.This error alone does not allow us to eliminate time-based reproposals because raft's MsgProp forwarding is fire-and-forget. However, if we disabled raft-level forwarding and did our own forwarding, I think we could make the retry logic more deterministic (or at least hide the timing elements in the RPC layer).
Note that in typical usage of raft, you'd respond to this error by passing it up the stack to a layer that can try again on a different replica. We can't do that because of our leases - until the lease expires, no other node could successfully handle the request, so we have to just wait and retry on the lease holder. (we might be able to use this to make lease requests themselves fail-fast, though).
Jira issue: CRDB-5872
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: