kvserver: raft receive queue may OOM under overload #71805

cockroach-teamcity · 2021-10-21T12:01:40Z

roachtest.restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed with artifacts on release-21.2 @ c5a6b266917ee3846dbd7ae1126c6a5d55cf439b:

		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:128,restore.go:449,test_runner.go:777: monitor failure: unexpected node event: 5: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:449
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 5: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3613572-1634797350-67-n6cpu8 --oneshot --ignore-empty-nodes: exit status 1 3: 10803
		4: 11356
		6: 11156
		2: 13324
		1: 12103
		5: dead (exit status 137)
		Error: UNCLASSIFIED_PROBLEM: 5: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 5: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

/cc @cockroachdb/bulk-io _{This test on roachdash | Improve this report!

Jira issue: CRDB-10779
Epic CRDB-39898}

The text was updated successfully, but these errors were encountered:

blathers-crl · 2021-10-21T12:01:44Z

cc @cockroachdb/bulkio

dt · 2021-10-21T14:19:22Z

n5 is oomkilled. Prior to that, it was complaining about several ranges with >10s handle raft readies and claiming it has some unavailable ranges. Also seeing [n5,s5,r6514/1:‹/Table/53/1/1865{3829-7101}›] 18224 slow command TruncateLog [‹/Table/53/1/18653829›,‹/Min›) finished after 59.99s with error ‹result is ambiguous (context canceled)› and 60s+ CPuts to range descriptors? Looks like over to kv to?

dt · 2021-10-21T14:30:15Z

Looks like this might be similar to the one @tbg debugged in #71050 ?

tbg · 2021-10-22T15:08:44Z

Curious, what's this?

Agree that this looks like an instance of #71050 (which was accidentally closed - the backport is still open).

dt · 2021-10-22T17:42:15Z

Curious, what's this?

@shermanCRL extended blathers so it can be configured to explicitly @-mention a GitHub team when that team's T- label is applied, so that members of that team receive notifications.

cockroach-teamcity · 2022-02-04T12:31:51Z

roachtest.restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed with artifacts on release-21.2 @ 9b06fffa0e5ffed8aa92cd0d860380dafe993bd2:

		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:128,restore.go:449,test_runner.go:777: monitor failure: unexpected node event: 3: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:449
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1296,context.go:89,cluster.go:1284,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-4299140-1643958991-72-n6cpu8 --oneshot --ignore-empty-nodes: exit status 1 3: dead (exit status 137)
		4: 11495
		1: 11785
		2: 11588
		6: 11026
		5: 11925
		Error: UNCLASSIFIED_PROBLEM: 3: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Monitor
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:596
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:569
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:123
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1170
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 3: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

/cc @cockroachdb/bulk-io _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2022-03-22T11:39:01Z

roachtest.restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed with artifacts on release-21.2 @ a5a88a4db163f0915ae65649aa0264c7a913fbfb:

		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:128,restore.go:449,test_runner.go:777: monitor failure: unexpected node event: 3: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:449
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1296,context.go:89,cluster.go:1284,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-4632320-1647929887-79-n6cpu8 --oneshot --ignore-empty-nodes: exit status 1 3: dead (exit status 137)
		2: 11356
		4: 11881
		1: 12060
		5: 11720
		6: 12154
		Error: UNCLASSIFIED_PROBLEM: 3: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Monitor
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:596
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:569
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:123
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1170
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 3: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

/cc @cockroachdb/bulk-io _{This test on roachdash | Improve this report!}

tbg · 2022-03-23T10:37:32Z

inuse_space

tbg · 2022-03-23T11:40:50Z

doing a Deep Sleuth on the tsdump

Disk read mb/s and iops

n3 memory - yellow and red are Go alloc and Go total, respectively, so that roughly doubled from 3.5 to 7gb

A big part of the problem might be that n3 seems to be ruining its L0:

Other nodes don't show anything like this. There's maybe a transient spike to like 6-9 but n3 is way above that towards the end of the graph.

So one way to read this is

Btw, all of the graphs that I've looked at indicate there isn't a load imbalance in this cluster. CPU, RAM, number of replicas and leaseholders, etc, are all pretty balanced before things go sideways on n3.

tbg · 2022-03-23T12:00:29Z

Concretely, n3 is slow. It's receiving lots of AddSSTable log entries from other nodes (addressing lots of different replicas on n3). They all get put into the scheduler here:

cockroach/pkg/kv/kvserver/store_raft.go

Lines 148 to 151 in 98a66b5

    
           enqueue := s.HandleRaftUncoalescedRequest(ctx, req, respStream) 
        
           if enqueue { 
        
           	s.scheduler.EnqueueRaftRequest(req.RangeID) 
        
           }

Note that there's no backpressure. Even if the scheduler isn't making any progress, we're going to stuff the request into a slice:

cockroach/pkg/kv/kvserver/store_raft.go

Lines 182 to 185 in 98a66b5

    
           q.infos = append(q.infos, raftRequestInfo{ 
        
           	req:        req, 
        
           	respStream: respStream, 
        
           })

Now we have a dozen or so scheduler threads. If pebble is slow, they're going to spend "lots of time" handling each request, so really we're just consuming from a firehose and pulling into memory until we explode.

The quota pool isn't really going to help here. For one, the quota pool is per range, so we're still going to accumulate up to num_ranges*quota_pool_size in memory, which can easily be too much. Second, the quota pool isn't always active for all followers, for example one that is catching up. So there isn't good protection here.

We need to talk about backpressuring the incoming raft message stream. If the inflight scheduler staged request size is too large, we need to stop consuming the firehose. The question is whether it's better to keep pulling & drop events, or whether we want to stop pulling.

101437: base: reduce `RaftMaxInflightBytes` to 32 MB r=erikgrinaker a=erikgrinaker This patch reduces the default `RaftMaxInflightBytes` from 256 MB to 32 MB, to reduce the out-of-memory incidence during bulk operations like `RESTORE` on clusters with overloaded disks. `RaftMaxInflightBytes` specifies the maximum aggregate byte size of Raft log entries that a leader will send to a follower without hearing responses. As such, it also bounds the amount of replication data buffered in memory on the receiver. Individual messages can still exceed this limit (consider the default command size limit at 64 MB). Normally, `RaftMaxInflightMsgs` * `RaftMaxSizePerMsg` will bound this at 4 MB (128 messages at 32 KB each). However, individual messages are allowed to exceed the 32 KB limit, typically large AddSSTable commands that can be around 10 MB each. To prevent followers running out of memory, we place an additional total byte limit of 32 MB, which is 8 times more than normal. A survey of CC clusters over the past 30 days showed that, excluding a single outlier cluster, the total outstanding `raft.rcvd.queued_bytes` of any individual node never exceeded 500 MB, and was roughly 0 across all clusters for the majority of time. Touches #71805. Resolves #100341. Resolves #100804. Resolves #100983. Resolves #101426. Epic: none Release note (ops change): the amount of replication traffic in flight from a single Raft leader to a follower has been reduced from 256 MB to 32 MB, in order to reduce the chance of running out of memory during bulk write operations. This can be controlled via the environment variable `COCKROACH_RAFT_MAX_INFLIGHT_BYTES`. Co-authored-by: Erik Grinaker <[email protected]>

cockroach-teamcity added branch-release-21.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 21, 2021

blathers-crl bot added the T-disaster-recovery label Oct 21, 2021

blathers-crl bot added the T-kv KV Team label Oct 21, 2021

AlexTalks mentioned this issue Oct 21, 2021

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] #71802

Closed

tbg mentioned this issue Oct 22, 2021

release-21.2: kvserver: apply a limit to outgoing raft msg batching #71748

Merged

tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 25, 2021

cockroach-teamcity mentioned this issue Nov 6, 2021

roachtest: restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed #72498

Closed

cockroach-teamcity mentioned this issue Dec 7, 2021

roachtest: restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed #73553

Closed

tbg added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Feb 1, 2022

cockroach-teamcity mentioned this issue Feb 4, 2022

roachtest: restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed #76063

Closed

tbg changed the title ~~roachtest: restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed~~ roachtest: restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed [raft sst oom] Mar 23, 2022

tbg changed the title ~~roachtest: restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed [raft sst oom]~~ roachtest: restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed [raft recv-buffer oom] Mar 23, 2022

jlinder added the sync-me-3 label May 24, 2022

tbg changed the title ~~roachtest: restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed [raft recv-buffer oom]~~ kvserver: raft receive queue may OOM under overload May 31, 2022

tbg reopened this May 31, 2022

tbg mentioned this issue May 31, 2022

kvserver: throttle writes on followers #79215

Closed

erikgrinaker added the T-kv-replication label Jun 2, 2022

adityamaru mentioned this issue Sep 21, 2022

roachtest: restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed #88335

Closed

erikgrinaker mentioned this issue Sep 23, 2022

roachtest: restore2TB/nodes=32 failed [raft recv oom, reproducible] #80155

Closed

erikgrinaker mentioned this issue Oct 18, 2022

roachtest: import/tpch/nodes=8 failed #90021

Closed

erikgrinaker mentioned this issue Apr 13, 2023

base: reduce RaftMaxInflightBytes to 32 MB #101437

Merged

This was referenced Apr 13, 2023

release-23.1: base: reduce RaftMaxInflightBytes to 32 MB #101507

Merged

release-23.1.0: base: reduce RaftMaxInflightBytes to 32 MB #101508

Merged

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: raft receive queue may OOM under overload #71805

kvserver: raft receive queue may OOM under overload #71805

cockroach-teamcity commented Oct 21, 2021 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Oct 21, 2021

dt commented Oct 21, 2021

dt commented Oct 21, 2021

tbg commented Oct 22, 2021

dt commented Oct 22, 2021

cockroach-teamcity commented Feb 4, 2022

cockroach-teamcity commented Mar 22, 2022

tbg commented Mar 23, 2022 •

edited

Loading

tbg commented Mar 23, 2022

tbg commented Mar 23, 2022

kvserver: raft receive queue may OOM under overload #71805

kvserver: raft receive queue may OOM under overload #71805

Comments

cockroach-teamcity commented Oct 21, 2021 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Oct 21, 2021

dt commented Oct 21, 2021

dt commented Oct 21, 2021

tbg commented Oct 22, 2021

dt commented Oct 22, 2021

cockroach-teamcity commented Feb 4, 2022

cockroach-teamcity commented Mar 22, 2022

tbg commented Mar 23, 2022 • edited Loading

tbg commented Mar 23, 2022

tbg commented Mar 23, 2022

cockroach-teamcity commented Oct 21, 2021 •

edited by exalate-issue-sync bot

Loading

tbg commented Mar 23, 2022 •

edited

Loading