-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: calls to ReportUnreachable must unquiesce raft group #103828
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Comments
cc @cockroachdb/replication |
tbg
added
the
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
label
May 24, 2023
craig bot
pushed a commit
that referenced
this issue
May 24, 2023
103738: copy: fix nil pointer in COPY telemetry logging r=rafiss a=rafiss fixes #102494 Release note (bug fix): Fixed a panic that could occur while a COPY statement is logged for telemetry purposes. 103827: kvserver: don't quiesce with follower in StateProbe r=erikgrinaker a=tbg shouldReplicaQuiesce checks that all followers are fully caught up, but it's still possible to have a follower in StateProbe (because we call `rawNode.ReportUnreachable` when an outgoing message gets dropped). Persistent StateProbe is problematic: we check for it before lease transfers, so despite the follower being fully caught up, we'd refuse the transfer. Unfortunately, this commit itself is not enough: even if the range is not quiesced, it needs to replicate a new entry to rectify the situation (i.e. switch follower back to StateReplicate). This is because at the time of writing, receipt of a heartbeat response from the follower is not enough to move it back to StateReplicate. This was fixed upstream, in #103826. However, this is still not enough! If the range quiesces successfully, and *then* `ReportUnreachable` is called, we still end up in the same state; this is now tracked in #103828. I ran into the above issue on #99191, which adds persistent circuit breakers, when stressing `TestStoreMetrics`. That test happens to restart n2 when it's fully caught up and due to the persistence of the circuit breakers when it comes up the leader will move it into StateProbe (since we can end up dropping the first heartbeat sent to it as it comes up, since the breaker hasn't untripped yet). But, I believe that this bug is real even without this breaker re-work, just harder to trigger. Epic: none Release note (bug fix): fixed a problem that could lead to erroneously refused lease transfers (error message: "refusing to transfer lease to [...] because target may need a Raft snapshot: replica in StateProbe" Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
Maybe a simpler thing to do is to turn |
This was referenced Jul 6, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Describe the problem
See #103827.
When the lease is quiesced, and
ReportUnreachable
is called for a follower,the follower will transition to
StateProbe
but the leader is quiesced, so hasno chance to "reset" the state should the follower prove to actually be around.
This can lead to various problems such as lease transfers being rejected1,
which would only self-resolve if the range got unquiesced for some other reason.
To Reproduce
Without #103827 and #103826 but with #99191, stressing
TestStoreMetrics
reproduces the more straightforward version of the race where
ReportUnreachable
is called before the group quiesces. I'm not aware of arepro for the case described in this issue, which I derived as being possible
after having seen the
TestStoreMetrics
failure.Jira issue: CRDB-28223
Footnotes
https://github.com/cockroachdb/cockroach/blob/fc277ea056b049e225e162748ef483a91cacd1ec/pkg/kv/kvserver/replica_range_lease.go#L951-L958 ↩
The text was updated successfully, but these errors were encountered: