-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: don't quiesce with follower in StateProbe #103827
Conversation
shouldReplicaQuiesce checks that all followers are fully caught up, but it's still possible to have a follower in StateProbe (because we call `rawNode.ReportUnreachable` when an outgoing message gets dropped). Persistent StateProbe is problematic: we check for it before lease transfers, so despite the follower being fully caught up, we'd refuse the transfer. Unfortunately, this commit itself is not enough: even if the range is not quiesced, it needs to replicate a new entry to rectify the situation (i.e. switch follower back to StateReplicate). This is because at the time of writing, receipt of a heartbeat response from the follower is not enough to move it back to StateReplicate. This was fixed upstream, in cockroachdb#103826. However, this is still not enough! If the range quiesces successfully, and *then* `ReportUnreachable` is called, we still end up in the same state. TODO file issue about this. I ran into the above issue on cockroachdb#99191, which adds persistent circuit breakers, when stressing `TestStoreMetrics`. That test happens to restart n2 when it's fully caught up and due to the persistence of the circuit breakers when it comes up the leader will move it into StateProbe (since we can end up dropping the first heartbeat sent to it as it comes up, since the breaker hasn't untripped yet). But, I believe that this bug is real even without this breaker re-work, just harder to trigger. Epic: none Release note (bug fix): fixed a problem that could lead to erroneously refused lease transfers (error message: "refusing to transfer lease to [...] because target may need a Raft snapshot: replica in StateProbe"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Backportable?
TFTR! Backports should be fine, I'll give it a try. Test failures in acceptance/version-upgrade are unrelated:
|
Build failed: |
I am stumped by how I've gotten this mixed version error three times in a row. The test has been flaky for a while, but I seem to be getting exceptionally unlucky (slack thread1). Footnotes |
bors r=erikgrinaker Passes on gceworker. |
This PR was included in a batch that was canceled, it will be automatically retried |
bors r=erikgrinaker |
Already running a review |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 0e608b0 to blathers/backport-release-22.2-103827: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
shouldReplicaQuiesce checks that all followers are fully caught up, but it's
still possible to have a follower in StateProbe (because we call
rawNode.ReportUnreachable
when an outgoing message gets dropped).Persistent StateProbe is problematic: we check for it before lease transfers, so
despite the follower being fully caught up, we'd refuse the transfer.
Unfortunately, this commit itself is not enough: even if the range is not
quiesced, it needs to replicate a new entry to rectify the situation (i.e.
switch follower back to StateReplicate). This is because at the time of writing,
receipt of a heartbeat response from the follower is not enough to move it back
to StateReplicate.
This was fixed upstream, in #103826.
However, this is still not enough! If the range quiesces successfully, and
then
ReportUnreachable
is called, we still end up in the same state; this is now tracked in #103828.I ran into the above issue on
#99191, which adds persistent
circuit breakers, when stressing
TestStoreMetrics
. That test happens torestart n2 when it's fully caught up and due to the persistence of the circuit
breakers when it comes up the leader will move it into StateProbe (since we can
end up dropping the first heartbeat sent to it as it comes up, since the breaker
hasn't untripped yet).
But, I believe that this bug is real even without this breaker re-work, just
harder to trigger.
Epic: none
Release note (bug fix): fixed a problem that could lead to erroneously refused
lease transfers (error message: "refusing to transfer lease to [...] because
target may need a Raft snapshot: replica in StateProbe"