release-22.2: kvserver: don't quiesce with follower in StateProbe #103877
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 1/1 commits from #103827.
/cc @cockroachdb/release
Release justification: partially avoids a bug that could lead to lease transfers being rejected.
shouldReplicaQuiesce checks that all followers are fully caught up, but it's
still possible to have a follower in StateProbe (because we call
rawNode.ReportUnreachable
when an outgoing message gets dropped).Persistent StateProbe is problematic: we check for it before lease transfers, so
despite the follower being fully caught up, we'd refuse the transfer.
Unfortunately, this commit itself is not enough: even if the range is not
quiesced, it needs to replicate a new entry to rectify the situation (i.e.
switch follower back to StateReplicate). This is because at the time of writing,
receipt of a heartbeat response from the follower is not enough to move it back
to StateReplicate.
This was fixed upstream, in #103826.
However, this is still not enough! If the range quiesces successfully, and
then
ReportUnreachable
is called, we still end up in the same state; this is now tracked in #103828.I ran into the above issue on
#99191, which adds persistent
circuit breakers, when stressing
TestStoreMetrics
. That test happens torestart n2 when it's fully caught up and due to the persistence of the circuit
breakers when it comes up the leader will move it into StateProbe (since we can
end up dropping the first heartbeat sent to it as it comes up, since the breaker
hasn't untripped yet).
But, I believe that this bug is real even without this breaker re-work, just
harder to trigger.
Epic: none
Release note (bug fix): fixed a problem that could lead to erroneously refused
lease transfers (error message: "refusing to transfer lease to [...] because
target may need a Raft snapshot: replica in StateProbe"