kvserver: permanent starvation of foreground traffic in admission/follower-overload/presplit-with-leases #82713

tbg · 2022-06-10T10:16:25Z

Describe the problem

I am running admission/follower-overload/presplit-with-leases.

Main load driver is kv0 at 2mb/s, with leases on n1 and n2. So n3 receives all IO load via raft appends and has no way to push back (see #79215). n3 has fio running, which starves CRDB for disk throughput.

There is a workload with leases (only) on n3 (kv50 with ~100qps and small writes) to observe the effects on foreground traffic directed directly at n3. I am expecting this workload to do very very poorly. After all, admission control on n3 is permanently trying to reduce incoming IO, but raft traffic is not within its purview. The only thing admission control on n3 can push back on is the foreground kv50 workload.

It is doing poorly, but a bit more poorly than I imagined: it hasn't gotten a single request through in hours.

TL;DR is that admission control is not letting any of the kv50 writes on n3 through; all goroutines have been stuck for 426 minutes waiting for KV stores admission wait. Not much is added to L0 though a little bit is; 1000 HeartbeatTxn/sec are executing. I don't understand why that leads to indefinite starvation, since these requests should be bypassing kv store admission control completely.

Longer notes from my investigations follow.

though miraculously the LSM isn't falling behind indefinitely - because n1 and n2 are also running at the limit (thanks AWS, apparently 2mb/s of goodput is also too much); to the best of my knowledge, the quota pools are not full:

A snapshot of the dashboard is here: https://snapshots.raintank.io/dashboard/snapshot/vLe0E27ulOyi4r5BWV9LCF5jYeg0SC27

The admission control logs are here. We seem to be in a pattern in which we compact nothing out of L0 for a few 15s periods (as expected, since IO is severely contended); we basically let nothing in except writes that get to bypass admission control. But what's curious is that actual L0 growth is similarly absent - that is, no incoming MsgApp are processed by n3. But the kv0 workload is making progress, so there really ought to be a 2mb/s stream of traffic arriving at n3 via raft. The explanation for that is in the raftlog_behind metric: n1 and n2 report that they have followers that are "way" behind. These followers are likely just the replicas on n3:

so in some way that as I type this I don't understand, the prototype idea #82132 is playing out "naturally" (the follower isn't dropping any incoming MsgApps; I checked via raft_rcvd_dropped_bytes.

So we're not accepting much, since there aren't many compactions. Every now and then, L0 grows by ~22mb, this happens every third interval on average, so every ~45s with a little bit of variation. I first thought these must be spurts of raft appends, but then I went and checked the snapshot graphs - n3 is receiving steady 2mb/s (more or less) of snapshot traffic, and the raft snap queue shows a steady stream of failures. Sure enough, I checked a few ranges, and they all have their n3 follower cut off from the log, which explains why we're not seeing replication traffic to n3 at all.

We aren't managing to get any of these snapshots through, it seems. Maybe some of this is due to delegated snapshots:

E220610 09:34:47.472336 10285660 kv/kvserver/queue.go:1096 ⋮ [n1,raftsnapshot,s1,r92/1:‹/Table/106/1/-4{65734…-56587…}›] 10518 ‹operation "raftsnapshot queue process replica 92" timed out after 2m30s (given timeout 2m30s)›: ‹operation "delegate-snapshot" timed out after 2m30s (given timeout 1h0m0s)›: (n1,s1):1: remote failed to send snapshot: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›

This looks as though we are giving ourselves a 1h timeout, but somehow there is an effective 2m30s timeout under the hood; I am also seeing more descriptive failures (these are the only two failures I saw while glancing through the logs, and they're in roughly equal proportion; possibly the snapshot error on the delegated sender translates to the other error on the "delegater").

E220610 09:32:17.471803 10273322 kv/kvserver/queue.go:1096 ⋮ [n1,raftsnapshot,s1,r109/1:‹/Table/106/1/2{64829…-73991…}›] 10489 ‹rpc error: code = DeadlineExceeded desc = giving up during snapshot reservation due to "kv.snapshot_receiver.reservation_queue_timeout_fraction": context deadline exceeded›¹

I actually don't think this caused any of the L0 writes (since we would need to actually ingest a snapshot for this to be the case), but now we understand why kv0 isn't adding any writes to L0 on n3 - n1 and n2 basically can't append to any of their n3 followers, so they don't.

So how's kv50 doing? Why isn't it making any progress?

The kv50 workload has only one range, so I took at look at it. I found it looking healthy, but trying to insert into it hung forever (at least a minute; still hanging now). Refreshing the range status a few times I noticed that MVCCStats remained the same, but the log index grew quickly, roughly matching the "Average Keys Written Per Second" reading of ~900. Digging in with logspy (grep for r146) I found that they're all HeartbeatTxn requests, which is interesting since we saw these cause problems in another recent outage. Notably, HeartbeatTxn bypasses admission control, and I think this effectively starves the kv50 workload. Looking at n3's goroutines, we can see this here. This shows the first block (14 goroutines with kv50 read stacks) and the second block (986 kv0 write goroutines) all stuck, and stuck for a very long time (7.5h to date) trying to get IO tokens. Moreover, the 986 matches very well with the ops/sec we've seen on the kv50 range, and which were observed to all be HeartbeatTxn (moreover, the workload, despite a rate limit of 100, runs with a concurrency of 1000 to make it more open-loopy).

Signature #43: 14 routines: select [426 mins]
Created by: conn.go:643 pgwire.(*conn).processCommandsAsync()
0	admission	work_queue.go:616	(*WorkQueue).Admit(#3168, {#5, *}, {{1}, 0, *, 0, 0, 0, 0x12afe})
1	admission	work_queue.go:1624	(*StoreWorkQueue).Admit(#3168, {#5, *}, {{{1}, 0, *, 0, 0, 0, 0x12afe}, ...})
2	kvserver	store.go:3803	KVAdmissionControllerImpl.AdmitKVWork({#646, #655, #565}, {#5, *}, {*}, *)
3	server	node.go:1001	(*Node).batchInternal(#656, {#5, *}, {*}, *)
[...]

Signature #44: 986 routines: select [426 mins]
Created by: conn.go:643 pgwire.(*conn).processCommandsAsync()
0	admission	work_queue.go:616	(*WorkQueue).Admit(#3168, {#5, *}, {{1}, 0, *, 0, 0, 0, 0x12afe})
1	admission	work_queue.go:1624	(*StoreWorkQueue).Admit(#3168, {#5, *}, {{{1}, 0, *, 0, 0, 0, 0x12afe}, ...})
2	kvserver	store.go:3803	KVAdmissionControllerImpl.AdmitKVWork({#646, #655, #565}, {#5, *}, {*}, *)
3	server	node.go:1001	(*Node).batchInternal(#656, {#5, *}, {*}, *)
[...]

I thought that explained everything, that somehow the HeartbeatTxn starved these goroutines, but looking at

cockroach/pkg/kv/kvserver/store.go

Lines 3790 to 3821 in eeb7236

    
           var err error 
        
           // Don't subject HeartbeatTxnRequest to the storeAdmissionQ. Even though 
        
           // it would bypass admission, it would consume a slot. When writes are 
        
           // throttled, we start generating more txn heartbeats, which then consume 
        
           // all the slots, causing no useful work to happen. We do want useful work 
        
           // to continue even when throttling since there are often significant 
        
           // number of tokens available. 
        
           if ba.IsWrite() && !ba.IsSingleHeartbeatTxnRequest() { 
        
           	ah.storeAdmissionQ = n.storeGrantCoords.TryGetQueueForStore(int32(ba.Replica.StoreID)) 
        
           } 
        
           admissionEnabled := true 
        
           if ah.storeAdmissionQ != nil { 
        
           	// TODO(sumeer): Plumb WriteBytes for ingest requests. 
        
           	ah.storeWorkHandle, err = ah.storeAdmissionQ.Admit( 
        
           		ctx, admission.StoreWriteWorkInfo{WorkInfo: admissionInfo}) 
        
           	if err != nil { 
        
           		return admissionHandle{}, err 
        
           	} 
        
           	if !ah.storeWorkHandle.AdmissionEnabled() { 
        
           		// Set storeAdmissionQ to nil so that we don't call AdmittedWorkDone 
        
           		// on it. Additionally, the code below will not call 
        
           		// kvAdmissionQ.Admit, and so callAdmittedWorkDoneOnKVAdmissionQ will 
        
           		// stay false. 
        
           		ah.storeAdmissionQ = nil 
        
           	} 
        
           } 
        
           if admissionEnabled { 
        
           	ah.callAdmittedWorkDoneOnKVAdmissionQ, err = n.kvAdmissionQ.Admit(ctx, admissionInfo) 
        
           	if err != nil { 
        
           		return admissionHandle{}, err 
        
           	} 
        
           }

reveals that heartbeats don't acquire from the StoreWorkQueue. Looking at the granter logs this is corroborated - we don't see it accept ~1k requests per second, but more like 20. However, the heartbeats do contribute to L0 growth - but only when the memtable is flushed, which maybe, sort of, explains the L0 growth pattern we've been seeing (where nothing happens for ~45s, and then ~20mb get added to L0). As a result, very little in IO tokens is given out (I've seen as low as 9.1kb, but it's usually a couple of MB) and also we attach a high L0 penalty to each admission (like 1-2mb).

I can't quite understand how that leads to permanent starvation, though. Certainly if some higher-priority operation consumed from the store work queue I could see how the kv0 writes could be left dangling forever. But in the absence of that (can I check somehow?), shouldn't some upsert ultimately get the go-ahead? That write might and will fail (run into timestamp cache) - sure - but the goroutines above show that nobody ever got to go ahead (986+14=1000, so we're seeing everyone).

To Reproduce

see above. Running the roachtest would hopefully reproduce this.

Expected behavior

Severely degraded performance of the kv50 workload, but not standstill for 7h+.

Environment:

master ~at time of writing

Add any other context about the problem here.

Jira issue: CRDB-16626

TODO(tbg): file an issue about this. ↩

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-06-10T10:16:28Z

cc @cockroachdb/replication

tbg · 2022-06-10T11:45:36Z

To see how this cluster would fare without the overload, I disabled fio. The snapshots cleared and raftlog_behind dropped - not to zero but it dropped to some lower baseline - within 10-15 minutes. It may even have started dropping a bit before I stopped fio; I only saw that too late or I would've held off. I saw the quota pool fill up a bit more - as expected, since it was now catching up n3 - but nowhere near 100%.

Looking at the graphs, I also saw that for some reason kv0 (against n1/n2 leases) was doing quite poorly intermittently, and had done poorly for some time before I turned off fio on n3. This all didn't make sense, as progress on kv0 was made using n1 and n2, and these nodes were both "healthy" as far as I could tell. I found that n2 had experienced a write stall that matched up:

$ grep stall logs/cockroach-pebble.*
logs/cockroach-pebble.ip-10-12-26-120.ubuntu.2022-06-10T09_28_45Z.006989.log:I220610 10:16:01.158811 177 3@pebble/event.go:685 ⋮ [n2,pebble,s2] 1181343  write stall beginning: memtable count limit reached
logs/cockroach-pebble.ip-10-12-26-120.ubuntu.2022-06-10T09_28_45Z.006989.log:I220610 10:16:05.364374 177 3@pebble/event.go:689 ⋮ [n2,pebble,s2] 1181363  write stall ending

(graph is in UTC+2 so it does match up). #82440 might have helped here.

As far as root causes of the flush problems, we see that read throughput really picks up (and starves write throughput) during this incident:

I was suspecting a stats job based on a conversation with @nicktrav but I couldn't line them up. However, it did line up with an earlier blip, so the stats job definitely gets us into trouble, too.

What's interesting is that kv0 continued to do poorly; it wasn't an isolated hit; kv0 has been doing poorly ever since, though of course

I went ahead and disable the auto stats job anyway, just in case.

When kv0 got slow the next time, I was able to get a statement bundle and see that we had been held up by CPU admission control:

I figured I was chasing diminishing returns at this point. Looking at the disk throughput, all nodes were still maxing out, despite "only" 1mb/s of goodput through the cluster. This time around though, the LSMs on n1 and n2 were also deteriorating.

I'm going to stop looking at this cluster and refocus on the case where only the follower is overloaded. My take-away here is again that running on stock low-throughput gp3 disks is quite brittle. The first course of action when problems occur would always be to bump the provisioned throughput, but this takes a long time to kick in, if it even does at all (#82109 (comment)).

tbg · 2022-06-10T11:48:08Z

@sumeerbhola appreciate your thoughts on this bit:

TL;DR is that admission control is not letting any of the kv50 writes on n3 through; all goroutines have been stuck for 426 minutes waiting for KV stores admission wait. Not much is added to L0 though a little bit is; 1000 HeartbeatTxn/sec are executing. I don't understand why that leads to indefinite starvation, since these requests should be bypassing kv store admission control completely.

(See towards the end of the first comment for more details).

cc @irfansharif

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication labels Jun 10, 2022

blathers-crl bot added the A-kv-replication Relating to Raft, consensus, and coordination. label Jun 10, 2022

exalate-issue-sync bot assigned tbg Jun 10, 2022

irfansharif added the A-admission-control label Jun 10, 2022

tbg removed their assignment Jun 23, 2022

erikgrinaker added T-kv KV Team and removed T-kv-replication labels Aug 29, 2022

erikgrinaker added A-kv Anything in KV that doesn't belong in a more specific category. and removed A-kv-replication Relating to Raft, consensus, and coordination. labels Aug 29, 2022

exalate-issue-sync bot added T-admission-control Admission Control and removed T-kv KV Team labels Nov 2, 2023

shralex added T-kv KV Team and removed A-kv Anything in KV that doesn't belong in a more specific category. T-admission-control Admission Control T-kv KV Team labels Nov 2, 2023

nicktrav added the T-admission-control Admission Control label Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: permanent starvation of foreground traffic in admission/follower-overload/presplit-with-leases #82713

kvserver: permanent starvation of foreground traffic in admission/follower-overload/presplit-with-leases #82713

tbg commented Jun 10, 2022 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Jun 10, 2022

tbg commented Jun 10, 2022

tbg commented Jun 10, 2022 •

edited

Loading

kvserver: permanent starvation of foreground traffic in admission/follower-overload/presplit-with-leases #82713

kvserver: permanent starvation of foreground traffic in admission/follower-overload/presplit-with-leases #82713

Comments

tbg commented Jun 10, 2022 • edited by exalate-issue-sync bot Loading

Footnotes

blathers-crl bot commented Jun 10, 2022

tbg commented Jun 10, 2022

tbg commented Jun 10, 2022 • edited Loading

tbg commented Jun 10, 2022 •

edited by exalate-issue-sync bot

Loading

tbg commented Jun 10, 2022 •

edited

Loading