Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: default admission.kv.pause_replication_io_threshold to a nonzero value #83920

Closed
tbg opened this issue Jul 6, 2022 · 3 comments · Fixed by #86147
Closed

kvserver: default admission.kv.pause_replication_io_threshold to a nonzero value #83920

tbg opened this issue Jul 6, 2022 · 3 comments · Fixed by #86147
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Jul 6, 2022

Is your feature request related to a problem? Please describe.

The aforementioned cluster setting configures the functionality introduced in #83851. As is, the default disables pausing replication to followers.

Describe the solution you'd like

After additional verification, we should install a nonzero default value (strawman: 0.8) for this setting.

Describe alternatives you've considered

We could default the setting to zero in 22.2 and enable it only on-demand (i.e. in the presence of follower write overload).

Additional context

Jira issue: CRDB-17353

Epic CRDB-15069

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Jul 6, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 6, 2022

cc @cockroachdb/replication

@tbg
Copy link
Member Author

tbg commented Jul 6, 2022

Of particular concern is the behavior when there are multiple overloaded followers. Despite attempts to stick to one follower to drop messages to, updates in the gossiped view could lead to switching these followers relatively quickly and thus interrupting both streams for some time. For example, if we have

s1
s2 (overload)
s3 (overload)

and we chose to drop to s3, but then s3 quickly recovers while s2 doesn't, we could end up dropping to s2 the moment we stop dropping to s3. s3 won't have had a chance to catch up yet!

I think we are actually smart enough about this to avoid it - we only consider an overload signal if the corresponding follower is in good health and if the quorum is too. In the situation above, s3 would be probing (since we report it as unreachable to raft), so we wouldn't consider it as being able to participate in quorum right away, and thus wouldn't allow dropping traffic to s2. Still, it's hard to have much confidence in that working as precisely as it would have to.

edit: I added code to #83851 to consider followers as useful for quorum only if they're tracked by the quota pool, which should address this concern. Even if we flip-flip around, with the default quota pool size of 8mb, I think we normally wouldn't see disruptions.

tbg added a commit to tbg/cockroach that referenced this issue Jul 12, 2022
This commit implements the stop-gap solution to raft traffic
contributing to I/O overload that was discussed[^1] in cockroachdb#79215.

The newly introduced `admission.kv.pause_replication_io_threshold`
cluster setting can be used to define an I/O overload score above which
raft leaders will attempt to pause replication to nonessential
followers, to given them a chance at tidying up their LSM.

A follower is "nonessential" if it is either a non-voter, or if we think
(and are fairly certain that) quorum can be reached without it. If there
are multiple overloaded followers, we pick as many as we can without
losing quorum, and we try to do so in a way that's a) random, i.e.
different raft leaders will pick a different set, to give each store a
share of the relief, and b) stable, i.e. each raft leader will pick the
same set at least for a little while, to avoid interrupting a quorum by
rapidly switching the set of active followers (but see here[^2]).

The implementation of this is as follows:
- on Store, materialize (on each tick) a `storeOverloadMap` from gossip
- each Replica, on tick, from this map compute the set of followers
  to pause, and store it in a per-Replica map.
- consult this latter map
    - when sending `MsgApp` from `handleRaftReady`
    - when releasing proposal quota.

This commit by default disables this new functionality by setting the
cluster setting to zero. This has the effect of an empty
`storeOverloadMap` and thus, after one round of gossip and subsequent
tick (i.e. within seconds), an empty per-Replica paused followers map.

Additionally, it's worth pointing out the mixed-version behavior: the
old nodes' stores will be observed as gossiping a zero IOThreshold,
which is considered not overloaded, i.e. replication streams to old
nodes will never be paused.

Fixes cockroachdb#79215.

[^1]: cockroachdb#79215 (comment)
[^2]: cockroachdb#83920 (comment)

Release note (ops change): the
`admission.kv.pause_replication_io_threshold` cluster setting can be set
to a nonzero value to reduce I/O throughput on followers that are driven
towards an inverted LSM by replication traffic. The functionality is
disabled by default. A suggested value is 0.8, meaning that replication
traffic to nonessential followers is paused before these followers will
begin throttling their foreground traffic.
tbg added a commit to tbg/cockroach that referenced this issue Jul 15, 2022
This commit implements the stop-gap solution to raft traffic
contributing to I/O overload that was discussed[^1] in cockroachdb#79215.

The newly introduced `admission.kv.pause_replication_io_threshold`
cluster setting can be used to define an I/O overload score above which
raft leaders will attempt to pause replication to nonessential
followers, to given them a chance at tidying up their LSM.

A follower is "nonessential" if it is either a non-voter, or if we think
(and are fairly certain that) quorum can be reached without it. If there
are multiple overloaded followers, we pick as many as we can without
losing quorum, and we try to do so in a way that's a) random, i.e.
different raft leaders will pick a different set, to give each store a
share of the relief, and b) stable, i.e. each raft leader will pick the
same set at least for a little while, to avoid interrupting a quorum by
rapidly switching the set of active followers (but see here[^2]).

The implementation of this is as follows:
- on Store, materialize (on each tick) a `storeOverloadMap` from gossip
- each Replica, on tick, from this map compute the set of followers
  to pause, and store it in a per-Replica map.
- consult this latter map
    - when sending `MsgApp` from `handleRaftReady`
    - when releasing proposal quota.

This commit by default disables this new functionality by setting the
cluster setting to zero. This has the effect of an empty
`storeOverloadMap` and thus, after one round of gossip and subsequent
tick (i.e. within seconds), an empty per-Replica paused followers map.

Additionally, it's worth pointing out the mixed-version behavior: the
old nodes' stores will be observed as gossiping a zero IOThreshold,
which is considered not overloaded, i.e. replication streams to old
nodes will never be paused.

Fixes cockroachdb#79215.

[^1]: cockroachdb#79215 (comment)
[^2]: cockroachdb#83920 (comment)

Release note (ops change): the
`admission.kv.pause_replication_io_threshold` cluster setting can be set
to a nonzero value to reduce I/O throughput on followers that are driven
towards an inverted LSM by replication traffic. The functionality is
disabled by default. A suggested value is 0.8, meaning that replication
traffic to nonessential followers is paused before these followers will
begin throttling their foreground traffic.
craig bot pushed a commit that referenced this issue Jul 15, 2022
83851: kvserver: avoid replicating to followers that are I/O overloaded r=erikgrinaker a=tbg

This commit implements the stop-gap solution to raft traffic contributing to
I/O overload that was discussed[^1] in #79215.

The newly introduced `admission.kv.pause_replication_io_threshold` cluster
setting can be used to define an I/O overload score above which raft leaders
will attempt to pause replication to nonessential followers, to given them a
chance at tidying up their LSM.

A follower is "nonessential" if it is either a non-voter, or if we think (and
are fairly certain that) quorum can be reached without it. If there are
multiple overloaded followers, we pick as many as we can without losing quorum,
and we try to do so in a way that's a) random, i.e.  different raft leaders
will pick a different set, to give each store a share of the relief, and b)
stable, i.e. each raft leader will pick the same set at least for a little
while, to avoid interrupting a quorum by rapidly switching the set of active
followers (but see here[^2]).

The implementation of this is as follows:
- on Store, materialize (on each tick) a `storeOverloadMap` from gossip
- each Replica, on tick, from this map compute the set of followers to pause,
  and store it in a per-Replica map.
- consult this latter map
    - when sending `MsgApp` from `handleRaftReady`
    - when releasing proposal quota.

This commit by default disables this new functionality by setting the cluster
setting to zero. This has the effect of an empty `storeOverloadMap` and thus,
after one round of gossip and subsequent tick (i.e. within seconds), an empty
per-Replica paused followers map.

Additionally, it's worth pointing out the mixed-version behavior: the old
nodes' stores will be observed as gossiping a zero IOThreshold, which is
considered not overloaded, i.e. replication streams to old nodes will never be
paused.

Owing to [experiments] (with an IOThreshold cutoff score of 0.8) , more commits
were added to improve on the impact of snapshots. With just the basic commit
described above, the overloaded store will - at least in the experiment -
manage to control the shape of the LSM, though barely. This is because once
followers are paused, some of the ranges will undergo non-cooperative log
truncations, meaning that the paused follower will be cut off from the log.
This then initiates a steady stream of (raft) snapshots to the follower, and
much of these snapshots ingests into L0, driving up the L0 **file** count (and
thus penalizing foreground traffic to n3, which is not causing any of this), as
seen here (red line):

<img width="816" alt="image"
src="https://user-images.githubusercontent.com/5076964/178611060-dfbe88ce-567e-46fd-92d4-1b037e984efb.png">

<img width="817" alt="image"
src="https://user-images.githubusercontent.com/5076964/178611028-b36859b1-758b-4ea9-aeb3-d6d0081278f0.png">

<img width="820" alt="image"
src="https://user-images.githubusercontent.com/5076964/178611147-142e34ce-15c2-4467-8b88-5449585391b7.png">

On the plus side, even with that, the foreground traffic is not impacted by the
ailing follower. The chart below shows the p50, p99, and p100 for the kv0 workload
whose leases are on the non-throttled nodes n1 and n2. The first annotation is when
n3 gets unthrottled, and the second annotation marks it crashing (out of disk).

<img width="1223" alt="image" src="https://user-images.githubusercontent.com/5076964/178736513-e2ab12e1-a4c9-4ccd-8152-e853f6c051fd.png">

The first additional commit attempts to alleviate this issue by blocking (raft)
snapshots that are about to be sent to a paused follower. With it, the L0 file
count is under noticeably tighter control:

<img width="815" alt="image"
src="https://user-images.githubusercontent.com/5076964/178611379-8de5b6bb-9c33-4ade-a925-55f0b849d175.png">

and this is due to the snapshots flowing only when followers are not paused:

<img width="817" alt="image"
src="https://user-images.githubusercontent.com/5076964/178611479-999813b2-7dae-4cf6-a9c0-f637967743fb.png">

<img width="820" alt="image"
src="https://user-images.githubusercontent.com/5076964/178611537-ebbee74d-7cb0-4439-869c-7e58a10691de.png">

A now-removed additional commit raised the bar for non-cooperative log
truncation to the configured max range size while a paused follower is present.
In theory, this should significantly cut down on the number of replicas needing
snapshots, but in practice the experiments show that whenever the follower
de-loads for a brief moment (as it frequently does in this experiment since the
overload is entirely driven by raft traffic), a non-cooperative truncation
using the old threshold occurs and we lose the benefit of having delayed the
truncation in the first place. That attempt still helped a little bit (making
it somewhat less likely for the IOThreshold score to crack 1.0) but in this PR,
this approach was extended to pause truncation *before* the follower enters
pausable regime, and to do so regardless of the pausing status for the
follower. This operates on the assumption that if a follower is permanently
overloaded, it will never fall too far below the pausable threshold, and so we
would be delaying snapshots as long as this could possibly make sense.

This has a few potential downsides: First, retaining the raft log for
extended periods of time causes higher LSM write amp. I think this is
not a big problem - the old threshold, 8mb of log per replica, is
already large enough to have most log entries flushed to L0, and so the
bulk of the write amplification will be incurred either way. Second,
many large raft logs might run the cluster out of disk; it would be
better to have an explicit trade-off between delaying truncations
and that but at present there is no such mechanism. Third, there
might be a performance penalty on the healthy nodes due to lower
cache efficiencies both in pebble and the raft entry cache. And
fourth, catching up on many very long raft logs at the same time
could known weaknesses in the raft transport related to memory
usage with increased likelihood.

Fixes #79215.

[experiments]: #81516


[^1]: #79215 (comment)
[^2]: #83920 (comment)

Release note (ops change): the `admission.kv.pause_replication_io_threshold`
cluster setting can be set to a nonzero value to reduce I/O throughput on
followers that are driven towards an inverted LSM by replication traffic. The
functionality is disabled by default. A suggested value is 0.8, meaning that
replication traffic to nonessential followers is paused before these followers
will begin throttling their foreground traffic.


Co-authored-by: Tobias Grieger <[email protected]>
@nvanbenschoten
Copy link
Member

I think we'll also need to address #84884 before enabling this protection by default.

craig bot pushed a commit that referenced this issue Aug 15, 2022
86139: ui/cluster-ui: prevent returning new obj from session storage every time r=xinhaoz a=xinhaoz

When we retrieve from localSettings in db-console, we retrieve
in the following order: `localSettings ?? sessionStorage ?? defaultValue`
The value from the session storage is retrieved via parsing a JSON
string and returning a new object. In the active execution pages, we
update the URL search string every time the filters object changes to
reflect any filters selected. Due to the local settings behaviour,
we could end up in an infinite update loop if we continously read from
session storage since a new object was being returned each time
(e.g. on a hard refresh). To prevent this, when retrieving a stored
value, if it exists in session storage but not local storage, we can
set the local setting to be parsed object from session storage.

Release note (bug fix): active execution pages will no longer crash
if there are no filters set in local settings.

Release justification: bug fix

86147: kvserver: default admission.kv.pause_replication_io_threshold to 0.8 r=erikgrinaker a=tbg

Closes #83920.

Release note (ops change): The
`admission.kv.pause_replication_threshold` cluster setting is now set to
a default value of 0.8. On a fully migrated v22.2+ deployment, this will
allow the KV layer to pause replication streams to followers located on
stores that are close to activating their I/O admission control
subsystem (thereby protecting these followers from additional overload).
The cluster setting can be disabled by setting to zero.

Release justification: enables new functionality


86152: docs: remove 'api change' from git commit scripts r=knz a=taroface

The recently published API Support Policy and updated [release note guidelines](https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/186548364/Release+notes) resulted in removing `api change` from the list of valid release note categories. This removes the syntax from the release note generation.

Release justification: changes to doc only
Release note: None

Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Ryan Kuo <[email protected]>
@craig craig bot closed this as completed in 1fa00d4 Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants