-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: permanent starvation of foreground traffic in admission/follower-overload/presplit-with-leases #82713
Comments
cc @cockroachdb/replication |
To see how this cluster would fare without the overload, I disabled fio. The snapshots cleared and Looking at the graphs, I also saw that for some reason kv0 (against n1/n2 leases) was doing quite poorly intermittently, and had done poorly for some time before I turned off fio on n3. This all didn't make sense, as progress on kv0 was made using n1 and n2, and these nodes were both "healthy" as far as I could tell. I found that n2 had experienced a write stall that matched up:
(graph is in UTC+2 so it does match up). #82440 might have helped here. As far as root causes of the flush problems, we see that read throughput really picks up (and starves write throughput) during this incident: I was suspecting a stats job based on a conversation with @nicktrav but I couldn't line them up. However, it did line up with an earlier blip, so the stats job definitely gets us into trouble, too. What's interesting is that kv0 continued to do poorly; it wasn't an isolated hit; kv0 has been doing poorly ever since, though of course I went ahead and disable the auto stats job anyway, just in case. When kv0 got slow the next time, I was able to get a statement bundle and see that we had been held up by CPU admission control: I figured I was chasing diminishing returns at this point. Looking at the disk throughput, all nodes were still maxing out, despite "only" 1mb/s of goodput through the cluster. This time around though, the LSMs on n1 and n2 were also deteriorating. I'm going to stop looking at this cluster and refocus on the case where only the follower is overloaded. My take-away here is again that running on stock low-throughput gp3 disks is quite brittle. The first course of action when problems occur would always be to bump the provisioned throughput, but this takes a long time to kick in, if it even does at all (#82109 (comment)). |
@sumeerbhola appreciate your thoughts on this bit:
(See towards the end of the first comment for more details). cc @irfansharif |
Describe the problem
I am running
admission/follower-overload/presplit-with-leases
.Main load driver is kv0 at 2mb/s, with leases on n1 and n2. So n3 receives all IO load via raft appends and has no way to push back (see #79215). n3 has fio running, which starves CRDB for disk throughput.
There is a workload with leases (only) on n3 (kv50 with ~100qps and small writes) to observe the effects on foreground traffic directed directly at n3. I am expecting this workload to do very very poorly. After all, admission control on n3 is permanently trying to reduce incoming IO, but raft traffic is not within its purview. The only thing admission control on n3 can push back on is the foreground kv50 workload.
It is doing poorly, but a bit more poorly than I imagined: it hasn't gotten a single request through in hours.
TL;DR is that admission control is not letting any of the kv50 writes on n3 through; all goroutines have been stuck for 426 minutes waiting for KV stores admission wait. Not much is added to L0 though a little bit is; 1000 HeartbeatTxn/sec are executing. I don't understand why that leads to indefinite starvation, since these requests should be bypassing kv store admission control completely.
Longer notes from my investigations follow.
though miraculously the LSM isn't falling behind indefinitely - because n1 and n2 are also running at the limit (thanks AWS, apparently 2mb/s of goodput is also too much); to the best of my knowledge, the quota pools are not full:
A snapshot of the dashboard is here: https://snapshots.raintank.io/dashboard/snapshot/vLe0E27ulOyi4r5BWV9LCF5jYeg0SC27
The admission control logs are here. We seem to be in a pattern in which we compact nothing out of L0 for a few 15s periods (as expected, since IO is severely contended); we basically let nothing in except writes that get to bypass admission control. But what's curious is that actual L0 growth is similarly absent - that is, no incoming MsgApp are processed by n3. But the kv0 workload is making progress, so there really ought to be a 2mb/s stream of traffic arriving at n3 via raft. The explanation for that is in the
raftlog_behind
metric: n1 and n2 report that they have followers that are "way" behind. These followers are likely just the replicas on n3:so in some way that as I type this I don't understand, the prototype idea #82132 is playing out "naturally" (the follower isn't dropping any incoming MsgApps; I checked via
raft_rcvd_dropped_bytes
.So we're not accepting much, since there aren't many compactions. Every now and then, L0 grows by ~22mb, this happens every third interval on average, so every ~45s with a little bit of variation. I first thought these must be spurts of raft appends, but then I went and checked the snapshot graphs - n3 is receiving steady 2mb/s (more or less) of snapshot traffic, and the raft snap queue shows a steady stream of failures. Sure enough, I checked a few ranges, and they all have their n3 follower cut off from the log, which explains why we're not seeing replication traffic to n3 at all.
We aren't managing to get any of these snapshots through, it seems. Maybe some of this is due to delegated snapshots:
This looks as though we are giving ourselves a 1h timeout, but somehow there is an effective 2m30s timeout under the hood; I am also seeing more descriptive failures (these are the only two failures I saw while glancing through the logs, and they're in roughly equal proportion; possibly the snapshot error on the delegated sender translates to the other error on the "delegater").
E220610 09:32:17.471803 10273322 kv/kvserver/queue.go:1096 ⋮ [n1,raftsnapshot,s1,r109/1:‹/Table/106/1/2{64829…-73991…}›] 10489 ‹rpc error: code = DeadlineExceeded desc = giving up during snapshot reservation due to "kv.snapshot_receiver.reservation_queue_timeout_fraction": context deadline exceeded›1
I actually don't think this caused any of the L0 writes (since we would need to actually ingest a snapshot for this to be the case), but now we understand why kv0 isn't adding any writes to L0 on n3 - n1 and n2 basically can't append to any of their n3 followers, so they don't.
So how's kv50 doing? Why isn't it making any progress?
The kv50 workload has only one range, so I took at look at it. I found it looking healthy, but trying to insert into it hung forever (at least a minute; still hanging now). Refreshing the range status a few times I noticed that MVCCStats remained the same, but the log index grew quickly, roughly matching the "Average Keys Written Per Second" reading of ~900. Digging in with logspy (grep for
r146
) I found that they're allHeartbeatTxn
requests, which is interesting since we saw these cause problems in another recent outage. Notably, HeartbeatTxn bypasses admission control, and I think this effectively starves the kv50 workload. Looking at n3's goroutines, we can see this here. This shows the first block (14 goroutines with kv50 read stacks) and the second block (986 kv0 write goroutines) all stuck, and stuck for a very long time (7.5h to date) trying to get IO tokens. Moreover, the 986 matches very well with the ops/sec we've seen on the kv50 range, and which were observed to all be HeartbeatTxn (moreover, the workload, despite a rate limit of 100, runs with a concurrency of 1000 to make it more open-loopy).I thought that explained everything, that somehow the HeartbeatTxn starved these goroutines, but looking at
cockroach/pkg/kv/kvserver/store.go
Lines 3790 to 3821 in eeb7236
StoreWorkQueue
. Looking at the granter logs this is corroborated - we don't see it accept ~1k requests per second, but more like 20. However, the heartbeats do contribute to L0 growth - but only when the memtable is flushed, which maybe, sort of, explains the L0 growth pattern we've been seeing (where nothing happens for ~45s, and then ~20mb get added to L0). As a result, very little in IO tokens is given out (I've seen as low as 9.1kb, but it's usually a couple of MB) and also we attach a high L0 penalty to each admission (like 1-2mb).I can't quite understand how that leads to permanent starvation, though. Certainly if some higher-priority operation consumed from the store work queue I could see how the kv0 writes could be left dangling forever. But in the absence of that (can I check somehow?), shouldn't some upsert ultimately get the go-ahead? That write might and will fail (run into timestamp cache) - sure - but the goroutines above show that nobody ever got to go ahead (986+14=1000, so we're seeing everyone).
To Reproduce
see above. Running the roachtest would hopefully reproduce this.
Expected behavior
Severely degraded performance of the kv50 workload, but not standstill for 7h+.
Environment:
Add any other context about the problem here.
Jira issue: CRDB-16626
Footnotes
TODO(tbg): file an issue about this. ↩
The text was updated successfully, but these errors were encountered: