Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: don't quiesce with follower in StateProbe #103827

Merged
merged 1 commit into from
May 24, 2023

Conversation

tbg
Copy link
Member

@tbg tbg commented May 24, 2023

shouldReplicaQuiesce checks that all followers are fully caught up, but it's
still possible to have a follower in StateProbe (because we call
rawNode.ReportUnreachable when an outgoing message gets dropped).

Persistent StateProbe is problematic: we check for it before lease transfers, so
despite the follower being fully caught up, we'd refuse the transfer.

Unfortunately, this commit itself is not enough: even if the range is not
quiesced, it needs to replicate a new entry to rectify the situation (i.e.
switch follower back to StateReplicate). This is because at the time of writing,
receipt of a heartbeat response from the follower is not enough to move it back
to StateReplicate.
This was fixed upstream, in #103826.

However, this is still not enough! If the range quiesces successfully, and
then ReportUnreachable is called, we still end up in the same state; this is now tracked in #103828.

I ran into the above issue on
#99191, which adds persistent
circuit breakers, when stressing TestStoreMetrics. That test happens to
restart n2 when it's fully caught up and due to the persistence of the circuit
breakers when it comes up the leader will move it into StateProbe (since we can
end up dropping the first heartbeat sent to it as it comes up, since the breaker
hasn't untripped yet).

But, I believe that this bug is real even without this breaker re-work, just
harder to trigger.

Epic: none
Release note (bug fix): fixed a problem that could lead to erroneously refused
lease transfers (error message: "refusing to transfer lease to [...] because
target may need a Raft snapshot: replica in StateProbe"

shouldReplicaQuiesce checks that all followers are fully caught up, but it's
still possible to have a follower in StateProbe (because we call
`rawNode.ReportUnreachable` when an outgoing message gets dropped).

Persistent StateProbe is problematic: we check for it before lease transfers, so
despite the follower being fully caught up, we'd refuse the transfer.

Unfortunately, this commit itself is not enough: even if the range is not
quiesced, it needs to replicate a new entry to rectify the situation (i.e.
switch follower back to StateReplicate). This is because at the time of writing,
receipt of a heartbeat response from the follower is not enough to move it back
to StateReplicate.
This was fixed upstream, in cockroachdb#103826.

However, this is still not enough! If the range quiesces successfully, and
*then* `ReportUnreachable` is called, we still end up in the same state.

TODO file issue about this.

I ran into the above issue on
cockroachdb#99191, which adds persistent
circuit breakers, when stressing `TestStoreMetrics`. That test happens to
restart n2 when it's fully caught up and due to the persistence of the circuit
breakers when it comes up the leader will move it into StateProbe (since we can
end up dropping the first heartbeat sent to it as it comes up, since the breaker
hasn't untripped yet).

But, I believe that this bug is real even without this breaker re-work, just
harder to trigger.

Epic: none
Release note (bug fix): fixed a problem that could lead to erroneously refused
lease transfers (error message: "refusing to transfer lease to [...] because
target may need a Raft snapshot: replica in StateProbe"
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from erikgrinaker May 24, 2023 10:26
@tbg tbg marked this pull request as ready for review May 24, 2023 10:39
@tbg tbg requested a review from a team May 24, 2023 10:39
@tbg tbg requested a review from a team as a code owner May 24, 2023 10:39
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Backportable?

@tbg tbg added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels May 24, 2023
@tbg
Copy link
Member Author

tbg commented May 24, 2023

TFTR!
bors r=erikgrinaker

Backports should be fine, I'll give it a try.

Test failures in acceptance/version-upgrade are unrelated:

Error: ***UNEXPECTED ERROR; Received an unexpected execution error.: ERROR: index element ?c😀ol6_9 of type jsonb is not indexable in a non-inverted index (SQLSTATE 42P16)

Error: ***UNEXPECTED ERROR; Received an unexpected execution error.: ERROR: index element "c%vol9"_10 of type jsonb is not indexable in a non-inverted index (SQLSTATE 42P16)

@craig
Copy link
Contributor

craig bot commented May 24, 2023

Build failed:

@tbg
Copy link
Member Author

tbg commented May 24, 2023

I am stumped by how I've gotten this mixed version error three times in a row. The test has been flaky for a while, but I seem to be getting exceptionally unlucky (slack thread1).
Going to run it a few more times locally just to check there can't possibly be anything in this PR triggering this (not sure how that would even be possible).

Footnotes

  1. https://cockroachlabs.slack.com/archives/C016CAD2HQ8/p1684931703555999

@tbg
Copy link
Member Author

tbg commented May 24, 2023

bors r=erikgrinaker

Passes on gceworker.

@craig
Copy link
Contributor

craig bot commented May 24, 2023

This PR was included in a batch that was canceled, it will be automatically retried

@tbg
Copy link
Member Author

tbg commented May 24, 2023

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented May 24, 2023

Already running a review

@craig
Copy link
Contributor

craig bot commented May 24, 2023

Build succeeded:

@craig craig bot merged commit c83a0bb into cockroachdb:master May 24, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 24, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 0e608b0 to blathers/backport-release-22.2-103827: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants