kvserver: default admission.kv.pause_replication_io_threshold to a nonzero value #83920

tbg · 2022-07-06T18:09:40Z

Is your feature request related to a problem? Please describe.

The aforementioned cluster setting configures the functionality introduced in #83851. As is, the default disables pausing replication to followers.

Describe the solution you'd like

After additional verification, we should install a nonzero default value (strawman: 0.8) for this setting.

Describe alternatives you've considered

We could default the setting to zero in 22.2 and enable it only on-demand (i.e. in the presence of follower write overload).

Additional context

Jira issue: CRDB-17353

Epic CRDB-15069

blathers-crl · 2022-07-06T18:09:42Z

cc @cockroachdb/replication

tbg · 2022-07-06T18:21:59Z

Of particular concern is the behavior when there are multiple overloaded followers. Despite attempts to stick to one follower to drop messages to, updates in the gossiped view could lead to switching these followers relatively quickly and thus interrupting both streams for some time. For example, if we have

s1
s2 (overload)
s3 (overload)

and we chose to drop to s3, but then s3 quickly recovers while s2 doesn't, we could end up dropping to s2 the moment we stop dropping to s3. s3 won't have had a chance to catch up yet!

I think we are actually smart enough about this to avoid it - we only consider an overload signal if the corresponding follower is in good health and if the quorum is too. In the situation above, s3 would be probing (since we report it as unreachable to raft), so we wouldn't consider it as being able to participate in quorum right away, and thus wouldn't allow dropping traffic to s2. Still, it's hard to have much confidence in that working as precisely as it would have to.

edit: I added code to #83851 to consider followers as useful for quorum only if they're tracked by the quota pool, which should address this concern. Even if we flip-flip around, with the default quota pool size of 8mb, I think we normally wouldn't see disruptions.

This commit implements the stop-gap solution to raft traffic contributing to I/O overload that was discussed[^1] in cockroachdb#79215. The newly introduced `admission.kv.pause_replication_io_threshold` cluster setting can be used to define an I/O overload score above which raft leaders will attempt to pause replication to nonessential followers, to given them a chance at tidying up their LSM. A follower is "nonessential" if it is either a non-voter, or if we think (and are fairly certain that) quorum can be reached without it. If there are multiple overloaded followers, we pick as many as we can without losing quorum, and we try to do so in a way that's a) random, i.e. different raft leaders will pick a different set, to give each store a share of the relief, and b) stable, i.e. each raft leader will pick the same set at least for a little while, to avoid interrupting a quorum by rapidly switching the set of active followers (but see here[^2]). The implementation of this is as follows: - on Store, materialize (on each tick) a `storeOverloadMap` from gossip - each Replica, on tick, from this map compute the set of followers to pause, and store it in a per-Replica map. - consult this latter map - when sending `MsgApp` from `handleRaftReady` - when releasing proposal quota. This commit by default disables this new functionality by setting the cluster setting to zero. This has the effect of an empty `storeOverloadMap` and thus, after one round of gossip and subsequent tick (i.e. within seconds), an empty per-Replica paused followers map. Additionally, it's worth pointing out the mixed-version behavior: the old nodes' stores will be observed as gossiping a zero IOThreshold, which is considered not overloaded, i.e. replication streams to old nodes will never be paused. Fixes cockroachdb#79215. [^1]: cockroachdb#79215 (comment) [^2]: cockroachdb#83920 (comment) Release note (ops change): the `admission.kv.pause_replication_io_threshold` cluster setting can be set to a nonzero value to reduce I/O throughput on followers that are driven towards an inverted LSM by replication traffic. The functionality is disabled by default. A suggested value is 0.8, meaning that replication traffic to nonessential followers is paused before these followers will begin throttling their foreground traffic.

83851: kvserver: avoid replicating to followers that are I/O overloaded r=erikgrinaker a=tbg This commit implements the stop-gap solution to raft traffic contributing to I/O overload that was discussed[^1] in #79215. The newly introduced `admission.kv.pause_replication_io_threshold` cluster setting can be used to define an I/O overload score above which raft leaders will attempt to pause replication to nonessential followers, to given them a chance at tidying up their LSM. A follower is "nonessential" if it is either a non-voter, or if we think (and are fairly certain that) quorum can be reached without it. If there are multiple overloaded followers, we pick as many as we can without losing quorum, and we try to do so in a way that's a) random, i.e. different raft leaders will pick a different set, to give each store a share of the relief, and b) stable, i.e. each raft leader will pick the same set at least for a little while, to avoid interrupting a quorum by rapidly switching the set of active followers (but see here[^2]). The implementation of this is as follows: - on Store, materialize (on each tick) a `storeOverloadMap` from gossip - each Replica, on tick, from this map compute the set of followers to pause, and store it in a per-Replica map. - consult this latter map - when sending `MsgApp` from `handleRaftReady` - when releasing proposal quota. This commit by default disables this new functionality by setting the cluster setting to zero. This has the effect of an empty `storeOverloadMap` and thus, after one round of gossip and subsequent tick (i.e. within seconds), an empty per-Replica paused followers map. Additionally, it's worth pointing out the mixed-version behavior: the old nodes' stores will be observed as gossiping a zero IOThreshold, which is considered not overloaded, i.e. replication streams to old nodes will never be paused. Owing to [experiments] (with an IOThreshold cutoff score of 0.8) , more commits were added to improve on the impact of snapshots. With just the basic commit described above, the overloaded store will - at least in the experiment - manage to control the shape of the LSM, though barely. This is because once followers are paused, some of the ranges will undergo non-cooperative log truncations, meaning that the paused follower will be cut off from the log. This then initiates a steady stream of (raft) snapshots to the follower, and much of these snapshots ingests into L0, driving up the L0 **file** count (and thus penalizing foreground traffic to n3, which is not causing any of this), as seen here (red line): <img width="816" alt="image" src="https://user-images.githubusercontent.com/5076964/178611060-dfbe88ce-567e-46fd-92d4-1b037e984efb.png"> <img width="817" alt="image" src="https://user-images.githubusercontent.com/5076964/178611028-b36859b1-758b-4ea9-aeb3-d6d0081278f0.png"> <img width="820" alt="image" src="https://user-images.githubusercontent.com/5076964/178611147-142e34ce-15c2-4467-8b88-5449585391b7.png"> On the plus side, even with that, the foreground traffic is not impacted by the ailing follower. The chart below shows the p50, p99, and p100 for the kv0 workload whose leases are on the non-throttled nodes n1 and n2. The first annotation is when n3 gets unthrottled, and the second annotation marks it crashing (out of disk). <img width="1223" alt="image" src="https://user-images.githubusercontent.com/5076964/178736513-e2ab12e1-a4c9-4ccd-8152-e853f6c051fd.png"> The first additional commit attempts to alleviate this issue by blocking (raft) snapshots that are about to be sent to a paused follower. With it, the L0 file count is under noticeably tighter control: <img width="815" alt="image" src="https://user-images.githubusercontent.com/5076964/178611379-8de5b6bb-9c33-4ade-a925-55f0b849d175.png"> and this is due to the snapshots flowing only when followers are not paused: <img width="817" alt="image" src="https://user-images.githubusercontent.com/5076964/178611479-999813b2-7dae-4cf6-a9c0-f637967743fb.png"> <img width="820" alt="image" src="https://user-images.githubusercontent.com/5076964/178611537-ebbee74d-7cb0-4439-869c-7e58a10691de.png"> A now-removed additional commit raised the bar for non-cooperative log truncation to the configured max range size while a paused follower is present. In theory, this should significantly cut down on the number of replicas needing snapshots, but in practice the experiments show that whenever the follower de-loads for a brief moment (as it frequently does in this experiment since the overload is entirely driven by raft traffic), a non-cooperative truncation using the old threshold occurs and we lose the benefit of having delayed the truncation in the first place. That attempt still helped a little bit (making it somewhat less likely for the IOThreshold score to crack 1.0) but in this PR, this approach was extended to pause truncation *before* the follower enters pausable regime, and to do so regardless of the pausing status for the follower. This operates on the assumption that if a follower is permanently overloaded, it will never fall too far below the pausable threshold, and so we would be delaying snapshots as long as this could possibly make sense. This has a few potential downsides: First, retaining the raft log for extended periods of time causes higher LSM write amp. I think this is not a big problem - the old threshold, 8mb of log per replica, is already large enough to have most log entries flushed to L0, and so the bulk of the write amplification will be incurred either way. Second, many large raft logs might run the cluster out of disk; it would be better to have an explicit trade-off between delaying truncations and that but at present there is no such mechanism. Third, there might be a performance penalty on the healthy nodes due to lower cache efficiencies both in pebble and the raft entry cache. And fourth, catching up on many very long raft logs at the same time could known weaknesses in the raft transport related to memory usage with increased likelihood. Fixes #79215. [experiments]: #81516 [^1]: #79215 (comment) [^2]: #83920 (comment) Release note (ops change): the `admission.kv.pause_replication_io_threshold` cluster setting can be set to a nonzero value to reduce I/O throughput on followers that are driven towards an inverted LSM by replication traffic. The functionality is disabled by default. A suggested value is 0.8, meaning that replication traffic to nonessential followers is paused before these followers will begin throttling their foreground traffic. Co-authored-by: Tobias Grieger <[email protected]>

nvanbenschoten · 2022-07-21T22:08:32Z

I think we'll also need to address #84884 before enabling this protection by default.

86139: ui/cluster-ui: prevent returning new obj from session storage every time r=xinhaoz a=xinhaoz When we retrieve from localSettings in db-console, we retrieve in the following order: `localSettings ?? sessionStorage ?? defaultValue` The value from the session storage is retrieved via parsing a JSON string and returning a new object. In the active execution pages, we update the URL search string every time the filters object changes to reflect any filters selected. Due to the local settings behaviour, we could end up in an infinite update loop if we continously read from session storage since a new object was being returned each time (e.g. on a hard refresh). To prevent this, when retrieving a stored value, if it exists in session storage but not local storage, we can set the local setting to be parsed object from session storage. Release note (bug fix): active execution pages will no longer crash if there are no filters set in local settings. Release justification: bug fix 86147: kvserver: default admission.kv.pause_replication_io_threshold to 0.8 r=erikgrinaker a=tbg Closes #83920. Release note (ops change): The `admission.kv.pause_replication_threshold` cluster setting is now set to a default value of 0.8. On a fully migrated v22.2+ deployment, this will allow the KV layer to pause replication streams to followers located on stores that are close to activating their I/O admission control subsystem (thereby protecting these followers from additional overload). The cluster setting can be disabled by setting to zero. Release justification: enables new functionality 86152: docs: remove 'api change' from git commit scripts r=knz a=taroface The recently published API Support Policy and updated [release note guidelines](https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/186548364/Release+notes) resulted in removing `api change` from the list of valid release note categories. This removes the syntax from the release note generation. Release justification: changes to doc only Release note: None Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Ryan Kuo <[email protected]>

tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Jul 6, 2022

tbg mentioned this issue Jul 12, 2022

kvserver: avoid replicating to followers that are I/O overloaded #83851

Merged

exalate-issue-sync bot assigned tbg Aug 3, 2022

tbg mentioned this issue Aug 15, 2022

kvserver: default admission.kv.pause_replication_io_threshold to 0.8 #86147

Merged

craig bot closed this as completed in 1fa00d4 Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: default admission.kv.pause_replication_io_threshold to a nonzero value #83920

kvserver: default admission.kv.pause_replication_io_threshold to a nonzero value #83920

tbg commented Jul 6, 2022 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Jul 6, 2022

tbg commented Jul 6, 2022 •

edited

Loading

nvanbenschoten commented Jul 21, 2022

kvserver: default admission.kv.pause_replication_io_threshold to a nonzero value #83920

kvserver: default admission.kv.pause_replication_io_threshold to a nonzero value #83920

Comments

tbg commented Jul 6, 2022 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Jul 6, 2022

tbg commented Jul 6, 2022 • edited Loading

nvanbenschoten commented Jul 21, 2022

tbg commented Jul 6, 2022 •

edited by exalate-issue-sync bot

Loading

tbg commented Jul 6, 2022 •

edited

Loading