Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: kv: retry liveness heartbeat on race with insufficient expiration #130623

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Sep 12, 2024

Backport 1/1 commits from #124885.

Fixes #130492.

/cc @cockroachdb/release

Release justification: bug fix


Fixes #124693.
Fixes #125287.

This commit adds logic to retry the synchronous liveness heartbeat which ensures that the liveness record has a later expiration than the prior lease by the time the lease promotion goes into effect, which was added in #123442. This heartbeat may need to be retried because it may have raced with another liveness heartbeat which did not extend the liveness expiration far enough.

We opt to allow this race and retry across it instead of detecting it and returning an error from NodeLiveness.Heartbeat because:

  1. returning an error would have a larger blast radius and could cause other issues.
  2. returning an error wouldn't actually fix the tests that are failing, because they would still get an error, just a different one.

Before this commit, TestLeaseQueueProactiveEnqueueOnPreferences would hit this case fail under deadlock and stress every ~150 iterations. After this commit, the test passes continuously under deadlock stress for over 2000 runs.

This makes #123442 even uglier. The nicer solution is #125235, but that is not backportable. Still, we're planning to address that issue as part of the leader leases work, so this is a temporary fix.

This also removes a TODO added in 1dc18df. As mentioned above, we don't address it, but instead document the behavior.

Release note (bug fix): resolves a concerning log message that says "expiration of liveness record ... is not greater than expiration of the previous lease ... after liveness heartbeat". This message is no longer possible.

Fixes cockroachdb#124693.
Fixes cockroachdb#125287.

This commit adds logic to retry the synchronous liveness heartbeat which
ensures that the liveness record has a later expiration than the prior
lease by the time the lease promotion goes into effect, which was added
in cockroachdb#123442. This heartbeat may need to be retried because it may have
raced with another liveness heartbeat which did not extend the liveness
expiration far enough.

We opt to allow this race and retry across it instead of detecting it
and returning an error from `NodeLiveness.Heartbeat` because:
1. returning an error would have a larger blast radius and could cause
   other issues.
2. returning an error wouldn't actually fix the tests that are failing,
   because they would still get an error, just a different one.

Before this commit, `TestLeaseQueueProactiveEnqueueOnPreferences` would
hit this case fail under deadlock and stress every ~150 iterations.
After this commit, the test passes continuously under deadlock stress
for over 2000 runs.

This makes cockroachdb#123442 even uglier. The nicer solution is cockroachdb#125235, but that
is not backportable. Still, we're planning to address that issue as part
of the leader leases work, so this is a temporary fix.

This also removes a TODO added in 1dc18df. As mentioned above, we don't
address it, but instead document the behavior.

Release note (bug fix): resolves a concerning log message that says
"expiration of liveness record ... is not greater than expiration of the
previous lease ... after liveness heartbeat". This message is no longer
possible.
Copy link

blathers-crl bot commented Sep 12, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Sep 12, 2024
Copy link

blathers-crl bot commented Sep 12, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten merged commit 4124ac5 into cockroachdb:release-23.1 Sep 13, 2024
6 checks passed
@nvanbenschoten nvanbenschoten deleted the backport23.1-124885 branch September 19, 2024 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants