-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: experiment with quota pool awareness of IO overload on remote stores #82132
Comments
(adding what I mentioned in our synchronous discussion) |
I ran a first experiment here: upon sending a MsgApp, check whether the recipient store has L0 overload, and if so, just drop the MsgApp: diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go
index f80f91c4cd..c954375490 100644
--- a/pkg/kv/kvserver/replica_raft.go
+++ b/pkg/kv/kvserver/replica_raft.go
@@ -1496,6 +1496,10 @@ func (r *Replica) sendRaftMessagesRaftMuLocked(ctx context.Context, messages []r
}
}
+var errOverloaded = errors.New("follower is under IO overload; dropping MsgApp")
+
+var toErrEveryN = log.Every(time.Second)
+
// sendRaftMessageRaftMuLocked sends a Raft message.
func (r *Replica) sendRaftMessageRaftMuLocked(ctx context.Context, msg raftpb.Message) {
r.mu.RLock()
@@ -1518,6 +1522,10 @@ func (r *Replica) sendRaftMessageRaftMuLocked(ctx context.Context, msg raftpb.Me
startKey = r.descRLocked().StartKey
}
})
+ if storeDesc, ok := r.store.allocator.StorePool.GetStoreDescriptor(toReplica.StoreID); ok && storeDesc.Capacity.L0Sublevels > 20 {
+ // NB: this isn't good enough for prod. If multiple followers are overloaded, we must still send to a quorum.
+ toErr = errOverloaded
+ }
}
r.mu.RUnlock()
@@ -1527,8 +1535,10 @@ func (r *Replica) sendRaftMessageRaftMuLocked(ctx context.Context, msg raftpb.Me
return
}
if toErr != nil {
- log.Warningf(ctx, "failed to look up recipient replica %d in r%d while sending %s: %s",
- msg.To, r.RangeID, msg.Type, toErr)
+ if toErrEveryN.ShouldLog() {
+ log.Warningf(ctx, "failed to look up recipient replica %d in r%d while sending %s: %s",
+ msg.To, r.RangeID, msg.Type, toErr)
+ }
return
} I then ran #82968, invoked via
After the cluster had been running for about an hour, I ssh'ed into n3 and stopped
which I removed only a few hours later (second annotation). This experiment validated a few things I had expected, namely: The LSM on the follower was "protected". File count and sublevel count breached above the admission control limit periodically, but would drop below again as well (as other nodes would start dropping MsgApp): the quota pools on the remote nodes filled up occasionally, stalling the entire workload (bad), but when this wasn't happening, p99s for kv0 were essentially unaffected (~30ms p99; good). Now if we had added the "quota pool awareness" to the experiment, we would expect not to see the foreground workload stalls (at least for the workload that has its leases on n1 and n2, both unaffected by overload). So this continues to be a good next thing to try, and possible band-aid for 22.2. As expected, the workload that had its leases on n3 degraded, since admission control would periodically turn on and do what it could to protect the node (throttle foreground traffic): Interestingly, I also managed to ad-hoc reproduce #82116 when I removed the rate limit. This is because as n3 fell behind, it started to lose access to the log, resulting in snapshots sent to it: You can see that when the rate limit was lifted, throughput for these snapshots increased, and this correlates with an instance of L0 overload, for which the likely explanation is that much of these snapshots went into L0: |
This comment was marked as off-topic.
This comment was marked as off-topic.
I re-ran the experiment today and without the config hiccups, it performed as expected. n3 would go overloaded, we'd stop appending to it (all the while avoiding quota pool build-up). This cycle repeated a few times, at which point (I believe) most of the replicas on n3 had been cut off from replication traffic permanently due to log truncations, and snapshots would only trickle through due to the bandwidth limits. The kv0 workload directed at n1 and n2 performed consistently throughout (i.e. didn't get bogged down by n3's demise); when the throttle was removed, n3 caught up and now I have a healthy cluster. The one "unrealistic" thing here is that we "ignore" n3 for 10 minutes, whereas L0 really drops sub-critical within seconds. This is an artifact of using the windowed L0 signal that is part of the store gossip; I should repeat this with a more up-to-date signal. But all in all this is about as good as CRDB can perform under these circumstances. ^-- NB: no idea about the latency spike at ~130pm, it doesn't correlate with anything I can see in the other graphs and I don't think it is related to the experiment. These are p50/p90/p100 btw. |
Is your feature request related to a problem? Please describe.
See #79215. At time of writing, MsgApp is not subject to admission control. Direct throttling at the receiver would lead to quota pool fulness and thus throttling at the user. While this may be desirable in some settings (i.e. a marginally slower follower that without such throttling would fall further and further behind), in practice a big problem is posed by stores that are so slow that they have an unacceptable effect on latency if taken into account for throttling.
Describe the solution you'd like
We should consider such stores as "non-live" for the purposes of replication. Concretely, they ought to be treated like followers in
StateProbing
(which is in(*RawNode).Status().Progress[replicaID].State
); raft will only append to them ~once per second until they acknowledge. By artificially treating followers as probed as long as their (gossiped, in this experiment) store stats indicate that admission control on them isn't throttling, we can cheaply emulate the behavior of a "real" flow control mechanism (#79755) that necessarily would have to be more complex (multi-tenancy, priorities, etc).Essentially the intuition behind this approach is that "a follower that isn't "live" for the purpose of quickly acking MsgApps should be treated as absent. In the experiment we are substituting "live" for "has no inverted LSM as determined by admission control" which is a sensible first approximation. A more complete solution (which is out of scope here) might be based on the store's recent measured write throughput which leaseholders can use to shape traffic to the node. However, this is verging into distributed rate limiting (a slow store will have leaders on many other stores; how do they coordinate who gets to go first) and is out of scope here. (I'll point out that we already do a form of distributed rate limiting for the tenant request units, so some reuse might be possible).
Describe alternatives you've considered
There are possibly other alternatives, which should be linked into the table in #79215.
Jira issue: CRDB-16308
Epic CRDB-15069
The text was updated successfully, but these errors were encountered: