Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: calls to ReportUnreachable must unquiesce raft group #103828

Closed
tbg opened this issue May 24, 2023 · 2 comments · Fixed by #104212
Closed

kvserver: calls to ReportUnreachable must unquiesce raft group #103828

tbg opened this issue May 24, 2023 · 2 comments · Fixed by #104212
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented May 24, 2023

Describe the problem

See #103827.

When the lease is quiesced, and ReportUnreachable is called for a follower,
the follower will transition to StateProbe but the leader is quiesced, so has
no chance to "reset" the state should the follower prove to actually be around.

This can lead to various problems such as lease transfers being rejected1,
which would only self-resolve if the range got unquiesced for some other reason.

To Reproduce

Without #103827 and #103826 but with #99191, stressing TestStoreMetrics
reproduces the more straightforward version of the race where
ReportUnreachable is called before the group quiesces. I'm not aware of a
repro for the case described in this issue, which I derived as being possible
after having seen the TestStoreMetrics failure.

Jira issue: CRDB-28223

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/fc277ea056b049e225e162748ef483a91cacd1ec/pkg/kv/kvserver/replica_range_lease.go#L951-L958

@blathers-crl
Copy link

blathers-crl bot commented May 24, 2023

cc @cockroachdb/replication

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 24, 2023
@cockroachdb cockroachdb deleted a comment from blathers-crl bot May 24, 2023
craig bot pushed a commit that referenced this issue May 24, 2023
103738: copy: fix nil pointer in COPY telemetry logging r=rafiss a=rafiss

fixes #102494

Release note (bug fix): Fixed a panic that could occur while a COPY statement is logged for telemetry purposes.

103827: kvserver: don't quiesce with follower in StateProbe r=erikgrinaker a=tbg

shouldReplicaQuiesce checks that all followers are fully caught up, but it's
still possible to have a follower in StateProbe (because we call
`rawNode.ReportUnreachable` when an outgoing message gets dropped).

Persistent StateProbe is problematic: we check for it before lease transfers, so
despite the follower being fully caught up, we'd refuse the transfer.

Unfortunately, this commit itself is not enough: even if the range is not
quiesced, it needs to replicate a new entry to rectify the situation (i.e.
switch follower back to StateReplicate). This is because at the time of writing,
receipt of a heartbeat response from the follower is not enough to move it back
to StateReplicate.
This was fixed upstream, in #103826.

However, this is still not enough! If the range quiesces successfully, and
*then* `ReportUnreachable` is called, we still end up in the same state; this is now tracked in #103828.

I ran into the above issue on
#99191, which adds persistent
circuit breakers, when stressing `TestStoreMetrics`. That test happens to
restart n2 when it's fully caught up and due to the persistence of the circuit
breakers when it comes up the leader will move it into StateProbe (since we can
end up dropping the first heartbeat sent to it as it comes up, since the breaker
hasn't untripped yet).

But, I believe that this bug is real even without this breaker re-work, just
harder to trigger.

Epic: none
Release note (bug fix): fixed a problem that could lead to erroneously refused
lease transfers (error message: "refusing to transfer lease to [...] because
target may need a Raft snapshot: replica in StateProbe"


Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
@tbg
Copy link
Member Author

tbg commented May 25, 2023

Maybe a simpler thing to do is to turn ReportUnreachable on a quiesced follower into a no-op, i.e. not actually call it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants