-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: consider making server.consistency_check.max_rate
private
#83190
Comments
@mwang1026 there is potentially a larger docs task we could take on here, e.g. adding a page called "how to run REALLY big clusters", but tbh I would recommend against it since I suspect the cost/benefit is not really worth it. Pareto says most users (90%? 99%?) don't need to run such large clusters, and it would be a BEAR to keep updated, and require help from the engineering team like every release. And it probably wouldn't help much anyway since every large user probably has pretty custom needs based on their setup. My random $0.02 |
Submitted #83332 to make it public, since it's occasionally useful -- in particular, to disable the consistency checker if it ends up misbehaving (see e.g. #77432). |
cc @cockroachdb/replication |
server.consistency_check.max_rate
from docsserver.consistency_check.max_rate
private
@erikgrinaker, can you or someone else in the know please open a docs issue for us to add some guidance around when and why to use these settings? Ideally, any setting we make public and expose in the generate docs should have some guidance. Perhaps it's a documented known limitation with a workaround? Or perhaps it's a troubleshooting doc that starts with symptom noticed that is caused by the "starvation" you mention in the PR. |
@jseldess I filed cockroachdb/docs#14456 to capture the fact that some more docs work is needed here. Assigned to myself to follow up |
There's still some uncertainty here around whether we want to make these settings private or public. My preference is to make them public, but deferring to @mwang1026. |
@erikgrinaker if we do make them public, what is our recommendation for when you might change them? And what safe values might be? |
Depends on how much spare disk bandwidth you have, how much data you have, and how often you want consistency checks to make a full pass across all of the data. For example, if you have 12 TB of logical data on 4 nodes, and you want to make sure the consistency check gets through all of it in 24 hours, then you'll need to set the setting to: 12 TB / 4 nodes / 86400 s = 34.7 MB/s Of course, this assumes that you actually have 34.7 MB/s of spare disk bandwidth not needed by the workload. If you leave it at the default 8 MB/s setting, then the consistency checks will take this long to get through all of the data: 12 TB / 4 nodes / 8 MB/s = 375000 s = 4.3 days This means that, in this particular scenario, it may take over 4 days to discover silent disk corruption (or a CRDB data corruption bug). The longer this period is, the higher the chance that the corruption might spread from a leaseholder to new follower replicas. That said, we're planning on removing this limit outright, and using admission control to use as much resources as the system can spare (but no more). |
As of today (2022-06-22) our cluster settings docs have an entry for
server.consistency_check.max_rate
that readsThe other setting it refers to,
server.consistency_check.interval
, is private. However it appears you need to use them both together?The commit message where this setting was added says
As usual this is creating a tradeoff. We are getting some questions about "what does this setting mean" and "why is the other setting it refers to not listed in the document" but it seems like this is a setting that most users should probably not be touching unless they are quite expert at running large CockroachDB clusters and/or working closely with our support team. (Guessing this number of users can probably be counted on one's fingers?)
An (easier, faster, but not necessarily better) alternative to hiding the setting is to expose the other setting it references,
server.consistency_check.interval
, which would at least give us consistency (no pun intended) within the cluster settings page. And it would be fast to do. :-)Jira issue: CRDB-16913
gz#20417
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: