kvserver: consider making `server.consistency_check.max_rate` private #83190

rmloveland · 2022-06-22T15:43:55Z

As of today (2022-06-22) our cluster settings docs have an entry for server.consistency_check.max_rate that reads

the rate limit (bytes/sec) to use for consistency checks; used in conjunction with server.consistency_check.interval to control the frequency of consistency checks. Note that setting this too high can negatively impact performance.

The other setting it refers to, server.consistency_check.interval, is private. However it appears you need to use them both together?

The commit message where this setting was added says

This control is necessary to ensure smooth performance on a cluster with large node sizes (i.e. in the 10TB+ range)

As usual this is creating a tradeoff. We are getting some questions about "what does this setting mean" and "why is the other setting it refers to not listed in the document" but it seems like this is a setting that most users should probably not be touching unless they are quite expert at running large CockroachDB clusters and/or working closely with our support team. (Guessing this number of users can probably be counted on one's fingers?)

An (easier, faster, but not necessarily better) alternative to hiding the setting is to expose the other setting it references, server.consistency_check.interval, which would at least give us consistency (no pun intended) within the cluster settings page. And it would be fast to do. :-)

Jira issue: CRDB-16913

gz#20417

Epic CRDB-39898

The text was updated successfully, but these errors were encountered:

rmloveland · 2022-06-22T15:48:27Z

@mwang1026 there is potentially a larger docs task we could take on here, e.g. adding a page called "how to run REALLY big clusters", but tbh I would recommend against it since I suspect the cost/benefit is not really worth it. Pareto says most users (90%? 99%?) don't need to run such large clusters, and it would be a BEAR to keep updated, and require help from the engineering team like every release. And it probably wouldn't help much anyway since every large user probably has pretty custom needs based on their setup. My random $0.02

erikgrinaker · 2022-06-24T13:36:09Z

An (easier, faster, but not necessarily better) alternative to hiding the setting is to expose the other setting it references, server.consistency_check.interval

Submitted #83332 to make it public, since it's occasionally useful -- in particular, to disable the consistency checker if it ends up misbehaving (see e.g. #77432).

blathers-crl · 2022-06-24T13:36:24Z

cc @cockroachdb/replication

jseldess · 2022-06-24T14:29:39Z

@erikgrinaker, can you or someone else in the know please open a docs issue for us to add some guidance around when and why to use these settings? Ideally, any setting we make public and expose in the generate docs should have some guidance. Perhaps it's a documented known limitation with a workaround? Or perhaps it's a troubleshooting doc that starts with symptom noticed that is caused by the "starvation" you mention in the PR.

rmloveland · 2022-07-06T15:00:05Z

@jseldess I filed cockroachdb/docs#14456 to capture the fact that some more docs work is needed here. Assigned to myself to follow up

erikgrinaker · 2022-07-06T15:05:04Z

There's still some uncertainty here around whether we want to make these settings private or public. My preference is to make them public, but deferring to @mwang1026.

mwang1026 · 2022-07-08T14:39:59Z

@erikgrinaker if we do make them public, what is our recommendation for when you might change them? And what safe values might be?

erikgrinaker · 2022-11-03T15:19:41Z

what is our recommendation for when you might change them? And what safe values might be?

Depends on how much spare disk bandwidth you have, how much data you have, and how often you want consistency checks to make a full pass across all of the data.

For example, if you have 12 TB of logical data on 4 nodes, and you want to make sure the consistency check gets through all of it in 24 hours, then you'll need to set the setting to:

12 TB / 4 nodes / 86400 s = 34.7 MB/s

Of course, this assumes that you actually have 34.7 MB/s of spare disk bandwidth not needed by the workload. If you leave it at the default 8 MB/s setting, then the consistency checks will take this long to get through all of the data:

12 TB / 4 nodes / 8 MB/s = 375000 s = 4.3 days

This means that, in this particular scenario, it may take over 4 days to discover silent disk corruption (or a CRDB data corruption bug). The longer this period is, the higher the chance that the corruption might spread from a leaseholder to new follower replicas.

That said, we're planning on removing this limit outright, and using admission control to use as much resources as the system can spare (but no more).

rmloveland added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jun 22, 2022

erikgrinaker mentioned this issue Jun 24, 2022

kvserver: make server.consistency_check.interval public #83332

Closed

erikgrinaker added the A-kv-replication Relating to Raft, consensus, and coordination. label Jun 24, 2022

blathers-crl bot added the T-kv-replication label Jun 24, 2022

erikgrinaker added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Jun 24, 2022

erikgrinaker changed the title ~~Consider hiding server.consistency_check.max_rate from docs~~ kvserver: consider making server.consistency_check.max_rate private Jun 24, 2022

rmloveland mentioned this issue Jul 6, 2022

Add some guidance around when and why to use the combo of server.consistency_check.max_rate and server.consistency_check.interval cockroachdb/docs#14456

Open

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: consider making `server.consistency_check.max_rate` private #83190

kvserver: consider making `server.consistency_check.max_rate` private #83190

rmloveland commented Jun 22, 2022 •

edited by exalate-issue-sync bot

Loading

rmloveland commented Jun 22, 2022

erikgrinaker commented Jun 24, 2022

blathers-crl bot commented Jun 24, 2022

jseldess commented Jun 24, 2022 •

edited

Loading

rmloveland commented Jul 6, 2022

erikgrinaker commented Jul 6, 2022

mwang1026 commented Jul 8, 2022

erikgrinaker commented Nov 3, 2022

kvserver: consider making server.consistency_check.max_rate private #83190

kvserver: consider making server.consistency_check.max_rate private #83190

Comments

rmloveland commented Jun 22, 2022 • edited by exalate-issue-sync bot Loading

rmloveland commented Jun 22, 2022

erikgrinaker commented Jun 24, 2022

blathers-crl bot commented Jun 24, 2022

jseldess commented Jun 24, 2022 • edited Loading

rmloveland commented Jul 6, 2022

erikgrinaker commented Jul 6, 2022

mwang1026 commented Jul 8, 2022

erikgrinaker commented Nov 3, 2022

kvserver: consider making `server.consistency_check.max_rate` private #83190

kvserver: consider making `server.consistency_check.max_rate` private #83190

rmloveland commented Jun 22, 2022 •

edited by exalate-issue-sync bot

Loading

jseldess commented Jun 24, 2022 •

edited

Loading