admission: experiment with quota pool awareness of IO overload on remote stores #82132

tbg · 2022-05-31T12:34:27Z

Is your feature request related to a problem? Please describe.

See #79215. At time of writing, MsgApp is not subject to admission control. Direct throttling at the receiver would lead to quota pool fulness and thus throttling at the user. While this may be desirable in some settings (i.e. a marginally slower follower that without such throttling would fall further and further behind), in practice a big problem is posed by stores that are so slow that they have an unacceptable effect on latency if taken into account for throttling.

Describe the solution you'd like

We should consider such stores as "non-live" for the purposes of replication. Concretely, they ought to be treated like followers in StateProbing (which is in (*RawNode).Status().Progress[replicaID].State); raft will only append to them ~once per second until they acknowledge. By artificially treating followers as probed as long as their (gossiped, in this experiment) store stats indicate that admission control on them isn't throttling, we can cheaply emulate the behavior of a "real" flow control mechanism (#79755) that necessarily would have to be more complex (multi-tenancy, priorities, etc).

Essentially the intuition behind this approach is that "a follower that isn't "live" for the purpose of quickly acking MsgApps should be treated as absent. In the experiment we are substituting "live" for "has no inverted LSM as determined by admission control" which is a sensible first approximation. A more complete solution (which is out of scope here) might be based on the store's recent measured write throughput which leaseholders can use to shape traffic to the node. However, this is verging into distributed rate limiting (a slow store will have leaders on many other stores; how do they coordinate who gets to go first) and is out of scope here. (I'll point out that we already do a form of distributed rate limiting for the tenant request units, so some reuse might be possible).

Describe alternatives you've considered
There are possibly other alternatives, which should be linked into the table in #79215.

Jira issue: CRDB-16308

Epic CRDB-15069

The text was updated successfully, but these errors were encountered:

sumeerbhola · 2022-06-08T15:49:34Z

A more complete solution (which is out of scope here) might be based on the store's recent measured write throughput which leaseholders can use to shape traffic to the node. However, this is verging into distributed rate limiting (a slow store will have leaders on many other stores; how do they coordinate who gets to go first) and is out of scope here.

(adding what I mentioned in our synchronous discussion)
Randomization could avoid the need for explicit coordination. For example, 50% of the leaseholders could start treating the follower as "non-live" if the sub-level count became > 20, then another 50% when the sub-level count exceeded 25 and so on. We could also mix in some knowledge of whether the range is only receiving low priority backfill traffic into this probability value, i.e., use a much higher probability for such ranges.

tbg · 2022-06-21T18:21:02Z

I ran a first experiment here: upon sending a MsgApp, check whether the recipient store has L0 overload, and if so, just drop the MsgApp:

diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go
index f80f91c4cd..c954375490 100644
--- a/pkg/kv/kvserver/replica_raft.go
+++ b/pkg/kv/kvserver/replica_raft.go
@@ -1496,6 +1496,10 @@ func (r *Replica) sendRaftMessagesRaftMuLocked(ctx context.Context, messages []r
 	}
 }
 
+var errOverloaded = errors.New("follower is under IO overload; dropping MsgApp")
+
+var toErrEveryN = log.Every(time.Second)
+
 // sendRaftMessageRaftMuLocked sends a Raft message.
 func (r *Replica) sendRaftMessageRaftMuLocked(ctx context.Context, msg raftpb.Message) {
 	r.mu.RLock()
@@ -1518,6 +1522,10 @@ func (r *Replica) sendRaftMessageRaftMuLocked(ctx context.Context, msg raftpb.Me
 				startKey = r.descRLocked().StartKey
 			}
 		})
+		if storeDesc, ok := r.store.allocator.StorePool.GetStoreDescriptor(toReplica.StoreID); ok && storeDesc.Capacity.L0Sublevels > 20 {
+			// NB: this isn't good enough for prod. If multiple followers are overloaded, we must still send to a quorum.
+			toErr = errOverloaded
+		}
 	}
 	r.mu.RUnlock()
 
@@ -1527,8 +1535,10 @@ func (r *Replica) sendRaftMessageRaftMuLocked(ctx context.Context, msg raftpb.Me
 		return
 	}
 	if toErr != nil {
-		log.Warningf(ctx, "failed to look up recipient replica %d in r%d while sending %s: %s",
-			msg.To, r.RangeID, msg.Type, toErr)
+		if toErrEveryN.ShouldLog() {
+			log.Warningf(ctx, "failed to look up recipient replica %d in r%d while sending %s: %s",
+				msg.To, r.RangeID, msg.Type, toErr)
+		}
 		return
 	}

I then ran #82968, invoked via

 gh pr checkout -f 81516 && ./dev build
roachtest  && bin/roachtest run --port 0 --cockroach artifacts/cockroach admission/follower-overload/presplit-with-leases --debug --cloud aws --cluster tobias-aws-drop-msgapp

After the cluster had been running for about an hour, I ssh'ed into n3 and stopped fio. Instead, I added more aggressive disk overload a la @irfansharif (this will be the first annotation in the graphs):

sudo bash -c 'echo "259:0  20971520" > /sys/fs/cgroup/blkio/system.slice/cockroach.service/blkio.throttle.write_bps_device'

which I removed only a few hours later (second annotation).

This experiment validated a few things I had expected, namely:

The LSM on the follower was "protected". File count and sublevel count breached above the admission control limit periodically, but would drop below again as well (as other nodes would start dropping MsgApp):

the quota pools on the remote nodes filled up occasionally, stalling the entire workload (bad), but when this wasn't happening, p99s for kv0 were essentially unaffected (~30ms p99; good). Now if we had added the "quota pool awareness" to the experiment, we would expect not to see the foreground workload stalls (at least for the workload that has its leases on n1 and n2, both unaffected by overload). So this continues to be a good next thing to try, and possible band-aid for 22.2.

As expected, the workload that had its leases on n3 degraded, since admission control would periodically turn on and do what it could to protect the node (throttle foreground traffic):

Interestingly, I also managed to ad-hoc reproduce #82116 when I removed the rate limit. This is because as n3 fell behind, it started to lose access to the log, resulting in snapshots sent to it:

You can see that when the rate limit was lifted, throughput for these snapshots increased, and this correlates with an instance of L0 overload, for which the likely explanation is that much of these snapshots went into L0:

tbg · 2022-06-23T19:38:55Z

I re-ran the experiment today and without the config hiccups, it performed as expected. n3 would go overloaded, we'd stop appending to it (all the while avoiding quota pool build-up). This cycle repeated a few times, at which point (I believe) most of the replicas on n3 had been cut off from replication traffic permanently due to log truncations, and snapshots would only trickle through due to the bandwidth limits. The kv0 workload directed at n1 and n2 performed consistently throughout (i.e. didn't get bogged down by n3's demise); when the throttle was removed, n3 caught up and now I have a healthy cluster. The one "unrealistic" thing here is that we "ignore" n3 for 10 minutes, whereas L0 really drops sub-critical within seconds. This is an artifact of using the windowed L0 signal that is part of the store gossip; I should repeat this with a more up-to-date signal. But all in all this is about as good as CRDB can perform under these circumstances.

^-- NB: no idea about the latency spike at ~130pm, it doesn't correlate with anything I can see in the other graphs and I don't think it is related to the experiment. These are p50/p90/p100 btw.

tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels May 31, 2022

tbg mentioned this issue May 31, 2022

kvserver: throttle writes on followers #79215

Closed

erikgrinaker added the T-kv-replication label Jun 2, 2022

exalate-issue-sync bot assigned tbg Jun 7, 2022

tbg mentioned this issue Jun 10, 2022

kvserver: permanent starvation of foreground traffic in admission/follower-overload/presplit-with-leases #82713

Open

This comment was marked as off-topic.

Sign in to view

exalate-issue-sync bot closed this as completed Jun 27, 2022

tbg mentioned this issue Jun 27, 2022

admission,kvserver: subject snapshot ingestion to admission control #80607

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: experiment with quota pool awareness of IO overload on remote stores #82132

admission: experiment with quota pool awareness of IO overload on remote stores #82132

tbg commented May 31, 2022 •

edited by irfansharif

Loading

sumeerbhola commented Jun 8, 2022

tbg commented Jun 21, 2022

This comment was marked as off-topic.

tbg commented Jun 23, 2022 •

edited

Loading

admission: experiment with quota pool awareness of IO overload on remote stores #82132

admission: experiment with quota pool awareness of IO overload on remote stores #82132

Comments

tbg commented May 31, 2022 • edited by irfansharif Loading

sumeerbhola commented Jun 8, 2022

tbg commented Jun 21, 2022

This comment was marked as off-topic.

tbg commented Jun 23, 2022 • edited Loading

tbg commented May 31, 2022 •

edited by irfansharif

Loading

tbg commented Jun 23, 2022 •

edited

Loading