-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: roll out replica pausing #86775
Comments
This comment was marked as resolved.
This comment was marked as resolved.
cc @cockroachdb/replication |
(Morally) reverts cockroachdb#86147. We'll roll this setting out over time. See: cockroachdb#86775 Release justification: more conservative choice for a new cluster setting Release note (ops change): the admission.kv.pause_replication_io_threshold cluster setting now default to zero (off). This supersedes an earlier release note about this setting.
86625: roachtest: compile cockroach-short with --crdb_test and use in sqlsmith r=yuzefovich a=yuzefovich This commit adds another step to the nightly roachtest invocation to compile `cockroach-short` binary with `--crdb_test` build tag. Then it also adds plumbing throughout the roachtest infrastructure to expose that newly-compiled binary which is now used in 50% cases in the sqlsmith roachtest. Fixes: #83186. Release justification: testing-only change. Release note: None 86675: builtins: fix json_build_object for enums and void in the key r=yuzefovich a=yuzefovich Previously, we would run into an internal error with `json_build_object` builtin if an enum or void was passed for the key part, and this is now fixed. This commit also sorts all datums lexicographically to make it easier to see whether all scalar datums are mentioned. Fixes: #84368. Release justification: bug fix. Release note (bug fix): Previously, CockroachDB would return an internal error when evaluating `json_build_object` builtin when an enum or a void datums were passed as the first argument, and this is now fixed. 86759: storage: implement `String()` for `MVCCRangeKeyStack` r=jbowens a=erikgrinaker Release justification: bug fixes and low-risk updates to new functionality Release note: None 86776: kvserver: disable replica pausing by default r=erikgrinaker a=tbg (Morally) reverts #86147. We'll roll this setting out over time. See: #86775 Release justification: more conservative choice for a new cluster setting Release note (ops change): the admission.kv.pause_replication_io_threshold cluster setting now default to zero (off). This supersedes an earlier release note about this setting. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
Met today with @joshimhoff @jason-crl @mwang1026 @erikgrinaker=, raw notes My suggestions post that meeting:
The value of 2.2 reflects the L0FileCountThreshold of 2000 in this alerting config; we want pausing to kick in only when replication traffic is the primary driver of overload and admission control is unable to keep the file count reasonable. Btw, #87424 tracks introducing the proper IOThreshold so we could alert instead on that metric with a cutoff of 2.0 (admission control activates at 1.0). By activating pausing only when admission control can't prevent LSM inversion, we are conservatively using pausing only in a subset of cases in which it could be beneficial. However this simplifies operations because otherwise we might see pausing without the user having a degraded experience, and it will be unclear what steps the SRE should take to resolve the problem.
SRE can reach out liberally to the Replication team to highlight inverted LSMs. SRE should be proactive in reaching out to KV L2 should the probing feature be suspected to be causing problems, for example due to more Raft snapshots. The cluster setting can be set to zero to disable pausing immediately.
|
https://github.com/cockroachlabs/support/issues/1823 interesting incident where pausing might've triggered. Related slack |
Here's a CentMon view for IOThreshold, might be helpful to find interesting clusters on CC |
TODO(sumeer): close this after creating an issue for replication admission control rollout for regular traffic, where follower pausing and allocator will have to play a role |
In #86147 we enabled a mechanism that alleviates raft append traffic's impact on I/O overload by letting raft leaders selectively pause replication streams to followers. Since we are unable to perform as much production-grade testing of this feature during the stability period as planned, we now prefer an incremental roll-out:
Jira issue: CRDB-18925
The text was updated successfully, but these errors were encountered: