-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: avoid hanging proposal after leader goes down #46045
Conversation
Deflake `gossip/chaos` by adding a missing `waitForFullReplication`. This test loops, killing a node and then verifying that the remaining nodes in the cluster stabilize on the same view of gossip connectivity. Periodically the test was failing because gossip wasn't stabilizing. The root issue was that the SQL query to retrieve the gossip connectivity from one node was hanging. And that query was hanging due to unavailability of a range. Logs show that the leaseholder for that range was on a down node and that the range only seemed to contain a single replica. This could happen near the start of the test if we started killing nodes before full replication was achieved. Fixes cockroachdb#38829 Release note: None
Release justification: testing change Release note: None
Release justification: comment-only change Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tracking this down! Might want to get another set of #kv eyes on this in order to spread the knowledge.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 2 of 2 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice find! How far back do we need to backport this?
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 3 of 3 files at r4.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/replica_raft.go, line 415 at r4 (raw file):
err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) { numFlushed := r.mu.proposalBuf.Len() if err := r.mu.proposalBuf.FlushLockedWithRaftGroup(raftGroup); err != nil {
I would lean towards returning the number of commands that were flushed from this function (i.e. used
in FlushLockedWithRaftGroup
). That avoids extra atomic accesses that seem a little racy, though I can't construct a scenario where anything actually goes wrong.
pkg/kv/kvserver/replica_raft.go, line 415 at r4 (raw file):
err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) { numFlushed := r.mu.proposalBuf.Len() if err := r.mu.proposalBuf.FlushLockedWithRaftGroup(raftGroup); err != nil {
An alternative approach is to propagate raft.ErrProposalDropped
errors up to here and unquiesce in that case as well. I'll defer to your preference.
pkg/kv/kvserver/store_raft.go, line 230 at r4 (raw file):
// quiescing if there's outstanding work. r.mu.Lock() status := r.raftStatusRLocked()
Would the BasicStatus()
do?
903f1e5
to
71e5ab5
Compare
There was a bug in range quiescence due to which commands would hang in raft for minutes before actually getting replicated. This would occur whenever a range was quiesced but a follower replica which didn't know the (Raft) leader would receive a request. This request would be evaluated and put into the Raft proposal buffer, and a ready check would be enqueued. However, no ready would be produced (since the proposal got dropped by raft; leader unknown) and so the replica would not unquiesce. This commit prevents this by always waking up the group if the proposal buffer was initially nonempty, even if an empty Ready is produced. It goes further than that by trying to ensure that a leader is always known while quiesced. Previously, on an incoming request to quiesce, we did not verify that the raft group had learned the leader's identity. One shortcoming here is that in the situation in which the proposal would originally hang "forever", it will now hang for one heartbeat timeout where ideally it would be proposed more reactively. Since this is so rare I didn't try to address this. Instead, refer to the ideas in cockroachdb#37906 (comment) and cockroachdb#21849 for future changes that could mitigate this. Without this PR, the test would fail around 10% of the time. With this change, it passed 40 iterations in a row without a hitch, via: ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9 Release justification: bug fix Release note (bug fix): a rare case in which requests to a quiesced range could hang in the KV replication layer was fixed. This would manifest as a message saying "have been waiting ... for proposing" even though no loss of quorum occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r=nvanbenschoten,petermattis
How far back do we need to backport this?
I don't see why this wasn't always a problem (though the roachtest didn't always fail...)
I'll definitely backport to 19.2 and there should be a similar simplified patch we can make to 19.1.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten and @petermattis)
pkg/kv/kvserver/replica_raft.go, line 415 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
An alternative approach is to propagate
raft.ErrProposalDropped
errors up to here and unquiesce in that case as well. I'll defer to your preference.
I'll leave as is.
bors pls bors r=nvanbenschoten,petermattis |
Build succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale)
Deflakes gossip/chaos/nodes=9, i.e.
Fixes #38829.
There was a bug in range quiescence due to which commands would hang in
raft for minutes before actually getting replicated. This would occur
whenever a range was quiesced but a follower replica which didn't know
the (Raft) leader would receive a request. This request would be
evaluated and put into the Raft proposal buffer, and a ready check would
be enqueued. However, no ready would be produced (since the proposal got
dropped by raft; leader unknown) and so the replica would not unquiesce.
This commit prevents this by always waking up the group if the proposal
buffer was initially nonempty, even if an empty Ready is produced.
It goes further than that by trying to ensure that a leader is always
known while quiesced. Previously, on an incoming request to quiesce, we
did not verify that the raft group had learned the leader's identity.
One shortcoming here is that in the situation in which the proposal
would originally hang "forever", it will now hang for one heartbeat
or election timeout where ideally it would be proposed more reactively. Since
this is so rare I didn't try to address this. Instead, refer to
the ideas in
#37906 (comment)
and
#21849
for future changes that could mitigate this.
Without this PR, the test would fail around 10% of the time. With this
change, it passed 40 iterations in a row without a hitch, via:
Release justification: bug fix
Release note (bug fix): a rare case in which requests to a quiesced
range could hang in the KV replication layer was fixed. This would
manifest as a message saying "have been waiting ... for proposing" even
though no loss of quorum occurred.