roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] #71802

cockroach-teamcity · 2021-10-21T09:20:21Z

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ c5a6b266917ee3846dbd7ae1126c6a5d55cf439b:

		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1121
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:955
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 11: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3613572-1634797350-36-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 14592
		1: 14968
		2: 14159
		8: skipped
		12: skipped
		7: 12447
		5: 13234
		6: 13145
		11: dead (exit status 137)
		9: 12402
		10: 12239
		Error: UNCLASSIFIED_PROBLEM: 11: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 11: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-10778}

The text was updated successfully, but these errors were encountered:

AlexTalks · 2021-10-21T22:45:46Z

It looks like node 11 (db node 9) ran out of memory while the test was waiting for full replication. @tbg looks like this might be similar to #71050 and #71805?

tbg · 2021-10-25T13:23:37Z

This seems related but not the same. I downloaded this file:

and uploaded it to https://share.polarsignals.com/3cfb86e/ (an online version of go tool pprof):

The big icicle in the middle is memory allocated when sending SSTables in sideloaded raft proposals to other nodes as part of catching them up on the raft log. There are various limits (per replica) at play here, all of which are bundled up here:

cockroach/pkg/kv/kvserver/store.go

Lines 223 to 240 in 0e0082d

    
           func newRaftConfig( 
        
           	strg raft.Storage, id uint64, appliedIndex uint64, storeCfg StoreConfig, logger raft.Logger, 
        
           ) *raft.Config { 
        
           	return &raft.Config{ 
        
           		ID:                        id, 
        
           		Applied:                   appliedIndex, 
        
           		ElectionTick:              storeCfg.RaftElectionTimeoutTicks, 
        
           		HeartbeatTick:             storeCfg.RaftHeartbeatIntervalTicks, 
        
           		MaxUncommittedEntriesSize: storeCfg.RaftMaxUncommittedEntriesSize, 
        
           		MaxCommittedSizePerReady:  storeCfg.RaftMaxCommittedSizePerReady, 
        
           		MaxSizePerMsg:             storeCfg.RaftMaxSizePerMsg, 
        
           		MaxInflightMsgs:           storeCfg.RaftMaxInflightMsgs, 
        
           		Storage:                   strg, 
        
           		Logger:                    logger, 
        
           		PreVote: true, 
        
           	} 
        
           }

In particular, each individual append should be limited to 32KiB (though this is a "target size", i.e. can be overshot):

cockroach/pkg/base/config.go

Lines 132 to 135 in 0a040d6

    
           // defaultRaftMaxSizePerMsg specifies the maximum aggregate byte size of Raft 
        
           // log entries that a leader will send to followers in a single MsgApp. 
        
           defaultRaftMaxSizePerMsg = envutil.EnvOrDefaultInt( 
        
           	"COCKROACH_RAFT_MAX_SIZE_PER_MSG", 32<<10 /* 32 KB */)

and we are sending at most

cockroach/pkg/base/config.go

Lines 143 to 147 in 0a040d6

    
           	// defaultRaftMaxInflightMsgs specifies how many "inflight" MsgApps a leader 
        
           	// will send to a given follower without hearing a response. 
        
           	defaultRaftMaxInflightMsgs = envutil.EnvOrDefaultInt( 
        
           		"COCKROACH_RAFT_MAX_INFLIGHT_MSGS", 128) 
        
           )

to any given follower.

Naively, you would think that this means that the most we can allocate on the leader for each follower is 12832kb, i.e. 4mb. However, let's say that all entries are 32MB SSTs - then in effect we can get up to 12832MB, i.e. 4GB which can surely give us the problems we see here. (And don't forget, it could happen for multiple followers too, giving us another factor of replication-factor-1).

The way messages are sent is that they are handed to an "outgoing queue" where they are put on a (large) buffered channel:

cockroach/pkg/kv/kvserver/raft_transport.go

Lines 566 to 572 in 54e004a

    
           select { 
        
           case ch <- req: 
        
           	l := int32(len(ch)) 
        
           	if v := atomic.LoadInt32(&stats.queueMax); v < l { 
        
           		atomic.CompareAndSwapInt32(&stats.queueMax, v, l) 
        
           	} 
        
           	return true

so it is possible that these 4GB are in memory all at once.

On the other end of this channel we indeed have the problem fixed by #71748: we may hold on to the SSTs for even longer. But even with that PR merged, an even if that PR avoids seeing this crash, there is a problem here - pulling as much data into memory in itself is an issue.

Ideally we would have a quotapool.IntPool (or BytesAllocator) that is consulted whenever raft entries are pulled into memory, and that would bound the memory consumption. In other words, memory accounting (#19721) for an important KV subsystem.

What I don't understand is why we "suddenly" have test coverage for these issues through backup/restore. @dt do you have any ideas why we're seeing these kinds of issues now? Perhaps some change in how the SSTs used by IMPORT/RESTORE are sized, or in the concurrency with which they are distributed, or any blocking that has been removed?

tbg · 2021-10-25T15:19:36Z

cc @adityamaru

tbg · 2021-10-26T15:39:53Z

Not blocking rc3 any more, since #71748 merged.

tbg · 2021-11-30T15:31:54Z

@dt we were just revisiting this and the question came up again of whether something changed on the Bulk I/O side in terms of usage of AddSSTable. Are there any changes you suspect of having changed the access pattern?

dt · 2021-11-30T16:22:42Z

I don't know of anything that changed in IMPORT or RESTORE SST sizes. 21.1 saw a large number of fixes to improve the work distribution during RESTORE, since we previously saw cases where we were blocking on splits or downloading a an rewriting SSTables for periods during which we were not sending SSTs to KV, so there was a lot of work done to improve splitting throughput, pipeline downloads with sending, etc, basically all focused on keeping the sending of ingest SSTs saturated better. But most of that happened in 21.1 or early in 21.2, so I don't know if anything changed more recently.

cockroach-teamcity · 2021-12-14T09:40:15Z

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ c754d101ccd7541b0f597dac2f37809c1a859bf2:

		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1121
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:955
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 7: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1296,context.go:89,cluster.go:1284,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3904114-1639466601-37-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 1: 13960
		3: 12469
		4: skipped
		8: skipped
		2: 12845
		11: 11430
		12: skipped
		5: 12307
		7: dead (exit status 137)
		6: 11689
		9: 11518
		10: 12381
		Error: UNCLASSIFIED_PROBLEM: 7: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Monitor
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:596
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:569
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:123
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1170
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 7: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #73675 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [C-test-failure O-roachtest O-robot branch-release-21.1 release-blocker]
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [raft oom] [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-12-28T09:35:45Z

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ 2c4bb88e5318fe187e1bf6cb134b31bd63f63528:

		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1121
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:955
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 5: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1296,context.go:89,cluster.go:1284,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3999353-1640675754-37-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 13603
		1: 15021
		3: 13418
		8: skipped
		6: 12396
		7: 14902
		10: 12433
		12: skipped
		11: 12198
		5: dead (exit status 137)
		9: 12572
		Error: UNCLASSIFIED_PROBLEM: 5: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Monitor
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:596
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:569
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:123
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1170
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 5: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #73675 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [C-test-failure O-roachtest O-robot branch-release-21.1 release-blocker]
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [raft oom] [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

tbg · 2022-01-03T07:45:30Z

Nothing new:

[ 1579.784106] Memory cgroup out of memory: Killed process 14028 (cockroach) total-vm:17849028kB, anon-rss:14540364kB, file-rss:86428kB, shmem-rss:0kB, UID:1000 pgtables:32792kB oom_score_adj:0

https://share.polarsignals.com/71b25ff/

cockroach-teamcity · 2022-02-01T09:41:07Z

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ e9fd200d3567aa542da6cd1f255e4d2971cbdd9e:

		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1121
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:955
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 6: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1296,context.go:89,cluster.go:1284,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-4263813-1643699670-35-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 13067
		3: 13047
		1: 13428
		8: skipped
		10: 11567
		12: skipped
		6: dead (exit status 137)
		5: 13005
		7: 12913
		11: 11446
		9: 11618
		Error: UNCLASSIFIED_PROBLEM: 6: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Monitor
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:596
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:569
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:123
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1170
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 6: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #73675 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [C-test-failure O-roachtest O-robot branch-release-21.1]
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [raft oom] [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2022-02-18T09:41:28Z

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ 6133ffd5459ae01d79e3dfd98528e557bb868eca:

		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1121
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:955
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1296,context.go:89,cluster.go:1284,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-4407736-1645168481-35-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 1: 15782
		3: 13744
		4: skipped
		2: dead (exit status 137)
		5: 12195
		6: 12365
		7: 12979
		8: skipped
		10: 12214
		9: 13015
		12: skipped
		11: 12275
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Monitor
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:596
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:569
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:123
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1170
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 2: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #73675 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [C-test-failure O-roachtest O-robot S-1 branch-release-21.1]
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [raft oom] [C-test-failure O-roachtest O-robot S-1 branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

github-actions · 2022-04-25T10:11:07Z

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

tbg · 2022-04-27T05:57:03Z

Using #80155 as the main tracking issue.

cockroach-teamcity added branch-release-21.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 21, 2021

AlexTalks added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 21, 2021

celiala added the blocks-21.2.0-rc.3 label Oct 26, 2021

tbg removed blocks-21.2.0-rc.3 GA-blocker labels Oct 26, 2021

cockroach-teamcity mentioned this issue Oct 28, 2021

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083

Closed

AlexTalks linked a pull request Nov 16, 2021 that will close this issue

roachprod,roachtest: fix mounting on GCP and disable RAID0 in roachtests #72803

Merged

AlexTalks removed a link to a pull request Nov 16, 2021

roachprod,roachtest: fix mounting on GCP and disable RAID0 in roachtests #72803

Merged

This comment has been minimized.

Sign in to view

tbg mentioned this issue Dec 2, 2021

kvserver: unbounded memory use when falling behind on sideloaded MsgApp #73376

Open

2 tasks

cockroach-teamcity mentioned this issue Dec 10, 2021

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #73675

Closed

tbg added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Feb 1, 2022

AlexTalks added the T-kv KV Team label Feb 18, 2022

tbg changed the title ~~roachtest: tpccbench/nodes=9/cpu=4/multi-region failed~~ roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] Mar 23, 2022

github-actions bot added the no-test-failure-activity label Apr 25, 2022

tbg closed this as completed Apr 27, 2022

jlinder added the sync-me-3 label May 24, 2022

erikgrinaker mentioned this issue Sep 12, 2022

roachtest: increase memory for tests that see Raft OOMs #87809

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] #71802

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] #71802

cockroach-teamcity commented Oct 21, 2021 •

edited by cockroach-jira-scripts

Loading

AlexTalks commented Oct 21, 2021

tbg commented Oct 25, 2021

tbg commented Oct 25, 2021

tbg commented Oct 26, 2021

This comment has been minimized.

tbg commented Nov 30, 2021

dt commented Nov 30, 2021

cockroach-teamcity commented Dec 14, 2021

cockroach-teamcity commented Dec 28, 2021

tbg commented Jan 3, 2022

cockroach-teamcity commented Feb 1, 2022

cockroach-teamcity commented Feb 18, 2022

github-actions bot commented Apr 25, 2022

tbg commented Apr 27, 2022

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] #71802

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] #71802

Comments

cockroach-teamcity commented Oct 21, 2021 • edited by cockroach-jira-scripts Loading

AlexTalks commented Oct 21, 2021

tbg commented Oct 25, 2021

tbg commented Oct 25, 2021

tbg commented Oct 26, 2021

This comment has been minimized.

tbg commented Nov 30, 2021

dt commented Nov 30, 2021

cockroach-teamcity commented Dec 14, 2021

cockroach-teamcity commented Dec 28, 2021

tbg commented Jan 3, 2022

cockroach-teamcity commented Feb 1, 2022

cockroach-teamcity commented Feb 18, 2022

github-actions bot commented Apr 25, 2022

tbg commented Apr 27, 2022

cockroach-teamcity commented Oct 21, 2021 •

edited by cockroach-jira-scripts

Loading