storage: Handle raft.ErrProposalDropped #21849

bdarnell · 2018-01-28T16:52:27Z

etcd-io/etcd#9067 introduced a new error ErrProposalDropped. Properly handling this error will allow us to reduce the occurrence of ambiguous failures (Replica.executeWriteBatch doesn't need to return an AmbiguousResultError if it has not successfully proposed), and may allow us to be more intelligent in our raft-level retries.

This error alone does not allow us to eliminate time-based reproposals because raft's MsgProp forwarding is fire-and-forget. However, if we disabled raft-level forwarding and did our own forwarding, I think we could make the retry logic more deterministic (or at least hide the timing elements in the RPC layer).

Note that in typical usage of raft, you'd respond to this error by passing it up the stack to a layer that can try again on a different replica. We can't do that because of our leases - until the lease expires, no other node could successfully handle the request, so we have to just wait and retry on the lease holder. (we might be able to use this to make lease requests themselves fail-fast, though).

Jira issue: CRDB-5872

Epic CRDB-39898

The text was updated successfully, but these errors were encountered:

tbg · 2019-12-04T15:51:05Z

Proposal forwarding has become a general problem recently. We see this come up both in #37906 and #42821.

In the former, we want to make sure that a follower that's behind can't propose (and get) a lease. Allowing only the raft leader to add proposals to its log would be a way to get that (mod "command runs on old raft leader while there's already another one").

In the latter, we had to add tricky code below raft to re-add commands under a new lease proposed index if we could detect that they had not applied and were no longer applicable. This solution is basically technical debt and it has cost us numerous subtle bugs over time. We want to get rid of it, which means making sure that lease applied indexes don't generally reorder. Again, this can be achieved with good enough accuracy by only ever adding to the local log, i.e. never forwarding proposals.

It's not trivial to set this up, but ultimately I think we'll be better off.

The rules would be something like this:

don't propose to raft unless local raft instance believes it is leader (i.e. block on a state change, perhaps wake up/campaign eagerly when appropriate). Due to our existing heuristics for campaigning actively when it's appropriate the situation in which there's no leader should be transient (and rare). Note that we want to only ever propose with a reasonable max lease index, but this will be given because we wouldn't be proposing before seeing our own lease, at which point we're the only one proposing and know the current LAI with perfect accuracy.
disable proposal forwarding. On ErrProposalDropped, go back up the stack into 1) because the leader has likely changed. If there's a raft leader and no lease is held, direct the command to that replica instead. (This last item also makes sure that when there's no lease, the raft leader will be the node that gets to request a lease).
lease transfers might need some smoothing out. The transferer initially holds both lease and raft leadership. Once it has proposed the lease transfer, it immediately transfers Raft leadership over. (I don't know if the process is reactive enough today; there's chance it leaves the raft leadership transfer to the tick-based mechanism today).

Since only the leaseholder proposes to Raft, the leaseholder should be able to hold or regain membership just fine. We just have to make sure that a follower won't ever campaign against an active leaseholder.

tbg · 2019-12-04T15:51:30Z

(cc @nvanbenschoten and @ajwerner since we both talked about this)

There was a bug in range quiescence due to which commands would hang in raft for minutes before actually getting replicated. This would occur whenever a range was quiesced but a follower replica which didn't know the (Raft) leader would receive a request. This request would be evaluated and put into the Raft proposal buffer, and a ready check would be enqueued. However, no ready would be produced (since the proposal got dropped by raft; leader unknown) and so the replica would not unquiesce. This commit prevents this by always waking up the group if the proposal buffer was initially nonempty, even if an empty Ready is produced. It goes further than that by trying to ensure that a leader is always known while quiesced. Previously, on an incoming request to quiesce, we did not verify that the raft group had learned the leader's identity. One shortcoming here is that in the situation in which the proposal would originally hang "forever", it will now hang for one heartbeat timeout where ideally it would be proposed more reactively. Since this is so rare I didn't try to address this. Instead, refer to the ideas in cockroachdb#37906 (comment) and cockroachdb#21849 for future changes that could mitigate this. Without this PR, the test would fail around 10% of the time. With this change, it passed 40 iterations in a row without a hitch, via: ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9 Release justification: bug fix Release note (bug fix): a rare case in which requests to a quiesced range could hang in the KV replication layer was fixed. This would manifest as a message saying "have been waiting ... for proposing" even though no loss of quorum occurred.

46045: kvserver: avoid hanging proposal after leader goes down r=nvanbenschoten,petermattis a=tbg Deflakes gossip/chaos/nodes=9, i.e. Fixes #38829. There was a bug in range quiescence due to which commands would hang in raft for minutes before actually getting replicated. This would occur whenever a range was quiesced but a follower replica which didn't know the (Raft) leader would receive a request. This request would be evaluated and put into the Raft proposal buffer, and a ready check would be enqueued. However, no ready would be produced (since the proposal got dropped by raft; leader unknown) and so the replica would not unquiesce. This commit prevents this by always waking up the group if the proposal buffer was initially nonempty, even if an empty Ready is produced. It goes further than that by trying to ensure that a leader is always known while quiesced. Previously, on an incoming request to quiesce, we did not verify that the raft group had learned the leader's identity. One shortcoming here is that in the situation in which the proposal would originally hang "forever", it will now hang for one heartbeat or election timeout where ideally it would be proposed more reactively. Since this is so rare I didn't try to address this. Instead, refer to the ideas in #37906 (comment) and #21849 for future changes that could mitigate this. Without this PR, the test would fail around 10% of the time. With this change, it passed 40 iterations in a row without a hitch, via: ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9 Release justification: bug fix Release note (bug fix): a rare case in which requests to a quiesced range could hang in the KV replication layer was fixed. This would manifest as a message saying "have been waiting ... for proposing" even though no loss of quorum occurred. Co-authored-by: Peter Mattis <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>

github-actions · 2021-06-08T02:12:12Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

tbg · 2023-05-10T10:51:05Z

Some related musings #102956 (comment)

bdarnell added this to the 2.1 milestone Jan 28, 2018

bdarnell self-assigned this Jan 28, 2018

bdarnell added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 26, 2018

tbg modified the milestones: 2.1, 2.2 Jul 19, 2018

petermattis mentioned this issue Jul 19, 2018

storage: excessively large raft logs #27772

Closed

nvanbenschoten mentioned this issue Jul 20, 2018

storage: prevent unbounded raft log growth without quorum #27774

Merged

petermattis removed this from the 2.2 milestone Oct 5, 2018

bdarnell mentioned this issue Jan 4, 2019

storage: reproposals generated after range creation #33466

Closed

tbg mentioned this issue Mar 25, 2019

roachtest: election-after-restart failed [skipped] #35047

Closed

bdarnell mentioned this issue Apr 2, 2019

core: Latency spike when starting a previously node killed using pkill #36397

Closed

bdarnell mentioned this issue Jun 27, 2019

storage: restarted node in need of snapshots can be wedged for long time #37906

Closed

bdarnell removed their assignment Jul 31, 2019

tbg assigned tbg and nvanbenschoten Dec 4, 2019

tbg mentioned this issue Mar 12, 2020

kvserver: avoid hanging proposal after leader goes down #46045

Merged

irfansharif mentioned this issue Sep 23, 2020

roachtest: election-after-restart failed [skipped] #54246

Closed

benlatern mentioned this issue Dec 4, 2020

kv: question about the difference between RaftCommandSize and RaftMaxUncommittedEntriesSize #57358

Closed

github-actions bot added the no-issue-activity label Jun 8, 2021

nvanbenschoten removed the no-issue-activity label Jun 8, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

erikgrinaker added T-kv-replication and removed T-kv KV Team labels May 31, 2022

tbg removed their assignment May 31, 2022

tbg unassigned nvanbenschoten May 10, 2023

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Handle raft.ErrProposalDropped #21849

storage: Handle raft.ErrProposalDropped #21849

bdarnell commented Jan 28, 2018 •

edited by exalate-issue-sync bot

Loading

tbg commented Dec 4, 2019

tbg commented Dec 4, 2019

github-actions bot commented Jun 8, 2021

tbg commented May 10, 2023

storage: Handle raft.ErrProposalDropped #21849

storage: Handle raft.ErrProposalDropped #21849

Comments

bdarnell commented Jan 28, 2018 • edited by exalate-issue-sync bot Loading

tbg commented Dec 4, 2019

tbg commented Dec 4, 2019

github-actions bot commented Jun 8, 2021

tbg commented May 10, 2023

bdarnell commented Jan 28, 2018 •

edited by exalate-issue-sync bot

Loading