-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: limit concurrent consistency checks #77433
Conversation
f3de95f
to
2d42982
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
2d42982
to
67142d9
Compare
67142d9
to
0f6a2a8
Compare
I had to increase the limit here from 4 to 7, since I realized that all nodes schedule consistency checks on other follower nodes. Consider e.g. a 7-node cluster with a replication factor of 7: this is expected to run 7 concurrent consistency checks on every node. Beyond 10 these checks are always going to fail (because of the timeout and rate limit), so 7 seems like a reasonable and conservative choice. |
Fwiw - this probably isn't unexpected, but this appears to solve (or at least mitigate) an issue we've been seeing with the clearrange roachtests (#77080). Thanks for the fix! |
Great to hear! I have some test failures to figure out, but will get this merged asap. |
Consistency checks run asynchronously below Raft on all replicas using a background context, but sharing a common per-store rate limit. It was possible for many concurrent checks to build up over time, which would reduce the rate available to each check, exacerbating the problem and preventing any of them from completing in a reasonable time. This patch adds two new limits for asynchronous consistency checks: * `consistencyCheckAsyncConcurrency`: limits the number of concurrent asynchronous consistency checks to 7 per store. Additional consistency checks will error. * `consistencyCheckAsyncTimeout`: sets an upper timeout of 1 hour for each consistency check, to prevent them from running forever. `CHECK_STATS` checks are exempt from the limit, as these are cheap and the DistSender will run these in parallel across ranges, e.g. via `crdb_internal.check_consistency()`. There are likely better solutions to this problem, but this is a simple backportable stopgap. Release justification: fixes for high-priority or high-severity bugs in existing functionality. Release note (bug fix): Added a limit of 7 concurrent asynchronous consistency checks per store, with an upper timeout of 1 hour. This prevents abandoned consistency checks from building up in some circumstances, which could lead to increasing disk usage as they held onto Pebble snapshots.
0f6a2a8
to
ee6e44a
Compare
TFTR! bors r=tbg |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating backport branch refs/heads/blathers/backport-release-21.1-77433: POST https://api.github.com/repos/cockroachlabs/cockroach/git/refs: 403 Resource not accessible by integration [] Backport to branch 21.1.x failed. See errors above. error creating backport branch refs/heads/blathers/backport-release-21.2-77433: POST https://api.github.com/repos/cockroachlabs/cockroach/git/refs: 403 Resource not accessible by integration [] Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Consistency checks run asynchronously below Raft on all replicas using a
background context, but sharing a common per-store rate limit. It was
possible for many concurrent checks to build up over time, which would
reduce the rate available to each check, exacerbating the problem and
preventing any of them from completing in a reasonable time.
This patch adds two new limits for asynchronous consistency checks:
consistencyCheckAsyncConcurrency
: limits the number of concurrentasynchronous consistency checks to 7 per store. Additional consistency
checks will error.
consistencyCheckAsyncTimeout
: sets an upper timeout of 1 hour foreach consistency check, to prevent them from running forever.
CHECK_STATS
checks are exempt from the limit, as these are cheap andthe DistSender will run these in parallel across ranges, e.g. via
crdb_internal.check_consistency()
.There are likely better solutions to this problem, but this is a simple
backportable stopgap.
Touches #77432.
Resolves #77080.
Release justification: fixes for high-priority or high-severity bugs in
existing functionality.
Release note (bug fix): Added a limit of 7 concurrent asynchronous
consistency checks per store, with an upper timeout of 1 hour. This
prevents abandoned consistency checks from building up in some
circumstances, which could lead to increasing disk usage as they held
onto Pebble snapshots.