Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: limit concurrent consistency checks #77433

Merged
merged 1 commit into from
Mar 10, 2022

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Mar 7, 2022

Consistency checks run asynchronously below Raft on all replicas using a
background context, but sharing a common per-store rate limit. It was
possible for many concurrent checks to build up over time, which would
reduce the rate available to each check, exacerbating the problem and
preventing any of them from completing in a reasonable time.

This patch adds two new limits for asynchronous consistency checks:

  • consistencyCheckAsyncConcurrency: limits the number of concurrent
    asynchronous consistency checks to 7 per store. Additional consistency
    checks will error.

  • consistencyCheckAsyncTimeout: sets an upper timeout of 1 hour for
    each consistency check, to prevent them from running forever.

CHECK_STATS checks are exempt from the limit, as these are cheap and
the DistSender will run these in parallel across ranges, e.g. via
crdb_internal.check_consistency().

There are likely better solutions to this problem, but this is a simple
backportable stopgap.

Touches #77432.
Resolves #77080.

Release justification: fixes for high-priority or high-severity bugs in
existing functionality.

Release note (bug fix): Added a limit of 7 concurrent asynchronous
consistency checks per store, with an upper timeout of 1 hour. This
prevents abandoned consistency checks from building up in some
circumstances, which could lead to increasing disk usage as they held
onto Pebble snapshots.

@erikgrinaker erikgrinaker requested review from tbg and aliher1911 March 7, 2022 14:25
@erikgrinaker erikgrinaker self-assigned this Mar 7, 2022
@erikgrinaker erikgrinaker requested a review from a team as a code owner March 7, 2022 14:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Mar 9, 2022

I had to increase the limit here from 4 to 7, since I realized that all nodes schedule consistency checks on other follower nodes. Consider e.g. a 7-node cluster with a replication factor of 7: this is expected to run 7 concurrent consistency checks on every node. Beyond 10 these checks are always going to fail (because of the timeout and rate limit), so 7 seems like a reasonable and conservative choice.

@nicktrav
Copy link
Collaborator

nicktrav commented Mar 9, 2022

Fwiw - this probably isn't unexpected, but this appears to solve (or at least mitigate) an issue we've been seeing with the clearrange roachtests (#77080). Thanks for the fix!

@erikgrinaker
Copy link
Contributor Author

Fwiw - this probably isn't unexpected, but this appears to solve (or at least mitigate) an issue we've been seeing with the clearrange roachtests (#77080). Thanks for the fix!

Great to hear! I have some test failures to figure out, but will get this merged asap.

Consistency checks run asynchronously below Raft on all replicas using a
background context, but sharing a common per-store rate limit. It was
possible for many concurrent checks to build up over time, which would
reduce the rate available to each check, exacerbating the problem and
preventing any of them from completing in a reasonable time.

This patch adds two new limits for asynchronous consistency checks:

* `consistencyCheckAsyncConcurrency`: limits the number of concurrent
  asynchronous consistency checks to 7 per store. Additional consistency
  checks will error.

* `consistencyCheckAsyncTimeout`: sets an upper timeout of 1 hour for
  each consistency check, to prevent them from running forever.

`CHECK_STATS` checks are exempt from the limit, as these are cheap and
the DistSender will run these in parallel across ranges, e.g. via
`crdb_internal.check_consistency()`.

There are likely better solutions to this problem, but this is a simple
backportable stopgap.

Release justification: fixes for high-priority or high-severity bugs in
existing functionality.

Release note (bug fix): Added a limit of 7 concurrent asynchronous
consistency checks per store, with an upper timeout of 1 hour. This
prevents abandoned consistency checks from building up in some
circumstances, which could lead to increasing disk usage as they held
onto Pebble snapshots.
@erikgrinaker
Copy link
Contributor Author

TFTR!

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Mar 10, 2022

Build succeeded:

@craig craig bot merged commit c985a38 into cockroachdb:master Mar 10, 2022
@blathers-crl
Copy link

blathers-crl bot commented Mar 10, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating backport branch refs/heads/blathers/backport-release-21.1-77433: POST https://api.github.com/repos/cockroachlabs/cockroach/git/refs: 403 Resource not accessible by integration []

Backport to branch 21.1.x failed. See errors above.


error creating backport branch refs/heads/blathers/backport-release-21.2-77433: POST https://api.github.com/repos/cockroachlabs/cockroach/git/refs: 403 Resource not accessible by integration []

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: clearrange/checks=true failed
4 participants