Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: roll out replica pausing #86775

Open
2 of 5 tasks
tbg opened this issue Aug 24, 2022 · 6 comments
Open
2 of 5 tasks

kvserver: roll out replica pausing #86775

tbg opened this issue Aug 24, 2022 · 6 comments
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Aug 24, 2022

In #86147 we enabled a mechanism that alleviates raft append traffic's impact on I/O overload by letting raft leaders selectively pause replication streams to followers. Since we are unable to perform as much production-grade testing of this feature during the stability period as planned, we now prefer an incremental roll-out:

Jira issue: CRDB-18925

@tbg tbg self-assigned this Aug 24, 2022
@blathers-crl

This comment was marked as resolved.

@blathers-crl
Copy link

blathers-crl bot commented Aug 24, 2022

cc @cockroachdb/replication

@blathers-crl blathers-crl bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Aug 24, 2022
tbg added a commit to tbg/cockroach that referenced this issue Aug 24, 2022
(Morally) reverts cockroachdb#86147. We'll roll this setting out over time.

See: cockroachdb#86775

Release justification: more conservative choice for a new cluster setting
Release note (ops change): the
admission.kv.pause_replication_io_threshold cluster setting now default
to zero (off). This supersedes an earlier release note about this
setting.
craig bot pushed a commit that referenced this issue Aug 25, 2022
86625: roachtest: compile cockroach-short with --crdb_test and use in sqlsmith r=yuzefovich a=yuzefovich

This commit adds another step to the nightly roachtest invocation to
compile `cockroach-short` binary with `--crdb_test` build tag. Then it
also adds plumbing throughout the roachtest infrastructure to expose
that newly-compiled binary which is now used in 50% cases in the sqlsmith
roachtest.

Fixes: #83186.

Release justification: testing-only change.

Release note: None

86675: builtins: fix json_build_object for enums and void in the key r=yuzefovich a=yuzefovich

Previously, we would run into an internal error with `json_build_object`
builtin if an enum or void was passed for the key part, and this is now
fixed. This commit also sorts all datums lexicographically to make it
easier to see whether all scalar datums are mentioned.

Fixes: #84368.

Release justification: bug fix.

Release note (bug fix): Previously, CockroachDB would return an internal
error when evaluating `json_build_object` builtin when an enum or a void
datums were passed as the first argument, and this is now fixed.

86759: storage: implement `String()` for `MVCCRangeKeyStack` r=jbowens a=erikgrinaker

Release justification: bug fixes and low-risk updates to new functionality

Release note: None

86776: kvserver: disable replica pausing by default r=erikgrinaker a=tbg

(Morally) reverts #86147. We'll roll this setting out over time.

See: #86775

Release justification: more conservative choice for a new cluster setting
Release note (ops change): the
admission.kv.pause_replication_io_threshold cluster setting now default
to zero (off). This supersedes an earlier release note about this
setting.


Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
@tbg
Copy link
Member Author

tbg commented Sep 6, 2022

Met today with @joshimhoff @jason-crl @mwang1026 @erikgrinaker=, raw notes

My suggestions post that meeting:

  • SRE to roll out set cluster setting admission.kv.pause_replication_io_threshold = 2.2 on Serverless and then over time on CC dedicated.

The value of 2.2 reflects the L0FileCountThreshold of 2000 in this alerting config; we want pausing to kick in only when replication traffic is the primary driver of overload and admission control is unable to keep the file count reasonable. Btw, #87424 tracks introducing the proper IOThreshold so we could alert instead on that metric with a cutoff of 2.0 (admission control activates at 1.0).

By activating pausing only when admission control can't prevent LSM inversion, we are conservatively using pausing only in a subset of cases in which it could be beneficial. However this simplifies operations because otherwise we might see pausing without the user having a degraded experience, and it will be unclear what steps the SRE should take to resolve the problem.

  • SRE to fix alerting for the underreplicated ranges metric (apparently alert is silenced due to problems in single-node clusters). This is related since paused followers will typically be tracked as underreplicated as they fall behind, reflecting the higher risk of a node unavailability translating into range unavailability.

SRE can reach out liberally to the Replication team to highlight inverted LSMs. SRE should be proactive in reaching out to KV L2 should the probing feature be suspected to be causing problems, for example due to more Raft snapshots. The cluster setting can be set to zero to disable pausing immediately.

  • REPL to periodically review CentMon for active pausing incidents (based on some Slack reminder? And need a CentMon dashboard for it)

  • to write a pausing runbook and socialize with TSEs.

@tbg
Copy link
Member Author

tbg commented Oct 4, 2022

https://github.com/cockroachlabs/support/issues/1823 interesting incident where pausing might've triggered. Related slack

@tbg
Copy link
Member Author

tbg commented Oct 31, 2022

Here's a CentMon view for IOThreshold, might be helpful to find interesting clusters on CC

https://cortex.centralized-monitoring.cockroachlabs.cloud/grafana/d/G90Zee7Vz/inverted-lsm?orgId=1&from=now-1h&to=now

@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Dec 8, 2022
@exalate-issue-sync exalate-issue-sync bot assigned irfansharif and unassigned tbg Dec 8, 2022
@irfansharif irfansharif removed their assignment Mar 10, 2023
@sumeerbhola
Copy link
Collaborator

TODO(sumeer): close this after creating an issue for replication admission control rollout for regular traffic, where follower pausing and allocator will have to play a role

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

4 participants