Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] #71802

Closed
cockroach-teamcity opened this issue Oct 21, 2021 · 14 comments
Closed
Labels
C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-roachtest O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 21, 2021

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ c5a6b266917ee3846dbd7ae1126c6a5d55cf439b:

		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1121
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:955
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 11: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3613572-1634797350-36-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 14592
		1: 14968
		2: 14159
		8: skipped
		12: skipped
		7: 12447
		5: 13234
		6: 13145
		11: dead (exit status 137)
		9: 12402
		10: 12239
		Error: UNCLASSIFIED_PROBLEM: 11: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 11: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-10778

@cockroach-teamcity cockroach-teamcity added branch-release-21.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 21, 2021
@AlexTalks
Copy link
Contributor

It looks like node 11 (db node 9) ran out of memory while the test was waiting for full replication. @tbg looks like this might be similar to #71050 and #71805?

@AlexTalks AlexTalks added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 21, 2021
@tbg
Copy link
Member

tbg commented Oct 25, 2021

This seems related but not the same. I downloaded this file:

image

and uploaded it to https://share.polarsignals.com/3cfb86e/ (an online version of go tool pprof):

image

The big icicle in the middle is memory allocated when sending SSTables in sideloaded raft proposals to other nodes as part of catching them up on the raft log. There are various limits (per replica) at play here, all of which are bundled up here:

func newRaftConfig(
strg raft.Storage, id uint64, appliedIndex uint64, storeCfg StoreConfig, logger raft.Logger,
) *raft.Config {
return &raft.Config{
ID: id,
Applied: appliedIndex,
ElectionTick: storeCfg.RaftElectionTimeoutTicks,
HeartbeatTick: storeCfg.RaftHeartbeatIntervalTicks,
MaxUncommittedEntriesSize: storeCfg.RaftMaxUncommittedEntriesSize,
MaxCommittedSizePerReady: storeCfg.RaftMaxCommittedSizePerReady,
MaxSizePerMsg: storeCfg.RaftMaxSizePerMsg,
MaxInflightMsgs: storeCfg.RaftMaxInflightMsgs,
Storage: strg,
Logger: logger,
PreVote: true,
}
}

In particular, each individual append should be limited to 32KiB (though this is a "target size", i.e. can be overshot):

// defaultRaftMaxSizePerMsg specifies the maximum aggregate byte size of Raft
// log entries that a leader will send to followers in a single MsgApp.
defaultRaftMaxSizePerMsg = envutil.EnvOrDefaultInt(
"COCKROACH_RAFT_MAX_SIZE_PER_MSG", 32<<10 /* 32 KB */)

and we are sending at most

// defaultRaftMaxInflightMsgs specifies how many "inflight" MsgApps a leader
// will send to a given follower without hearing a response.
defaultRaftMaxInflightMsgs = envutil.EnvOrDefaultInt(
"COCKROACH_RAFT_MAX_INFLIGHT_MSGS", 128)
)
to any given follower.

Naively, you would think that this means that the most we can allocate on the leader for each follower is 12832kb, i.e. 4mb. However, let's say that all entries are 32MB SSTs - then in effect we can get up to 12832MB, i.e. 4GB which can surely give us the problems we see here. (And don't forget, it could happen for multiple followers too, giving us another factor of replication-factor-1).

The way messages are sent is that they are handed to an "outgoing queue" where they are put on a (large) buffered channel:

select {
case ch <- req:
l := int32(len(ch))
if v := atomic.LoadInt32(&stats.queueMax); v < l {
atomic.CompareAndSwapInt32(&stats.queueMax, v, l)
}
return true

so it is possible that these 4GB are in memory all at once.

On the other end of this channel we indeed have the problem fixed by #71748: we may hold on to the SSTs for even longer. But even with that PR merged, an even if that PR avoids seeing this crash, there is a problem here - pulling as much data into memory in itself is an issue.

Ideally we would have a quotapool.IntPool (or BytesAllocator) that is consulted whenever raft entries are pulled into memory, and that would bound the memory consumption. In other words, memory accounting (#19721) for an important KV subsystem.

What I don't understand is why we "suddenly" have test coverage for these issues through backup/restore. @dt do you have any ideas why we're seeing these kinds of issues now? Perhaps some change in how the SSTs used by IMPORT/RESTORE are sized, or in the concurrency with which they are distributed, or any blocking that has been removed?

@tbg
Copy link
Member

tbg commented Oct 25, 2021

cc @adityamaru

@tbg
Copy link
Member

tbg commented Oct 26, 2021

Not blocking rc3 any more, since #71748 merged.

@cockroach-teamcity

This comment has been minimized.

@tbg
Copy link
Member

tbg commented Nov 30, 2021

@dt we were just revisiting this and the question came up again of whether something changed on the Bulk I/O side in terms of usage of AddSSTable. Are there any changes you suspect of having changed the access pattern?

@dt
Copy link
Member

dt commented Nov 30, 2021

I don't know of anything that changed in IMPORT or RESTORE SST sizes. 21.1 saw a large number of fixes to improve the work distribution during RESTORE, since we previously saw cases where we were blocking on splits or downloading a an rewriting SSTables for periods during which we were not sending SSTs to KV, so there was a lot of work done to improve splitting throughput, pipeline downloads with sending, etc, basically all focused on keeping the sending of ingest SSTs saturated better. But most of that happened in 21.1 or early in 21.2, so I don't know if anything changed more recently.

@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ c754d101ccd7541b0f597dac2f37809c1a859bf2:

		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1121
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:955
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 7: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1296,context.go:89,cluster.go:1284,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3904114-1639466601-37-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 1: 13960
		3: 12469
		4: skipped
		8: skipped
		2: 12845
		11: 11430
		12: skipped
		5: 12307
		7: dead (exit status 137)
		6: 11689
		9: 11518
		10: 12381
		Error: UNCLASSIFIED_PROBLEM: 7: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Monitor
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:596
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:569
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:123
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1170
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 7: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ 2c4bb88e5318fe187e1bf6cb134b31bd63f63528:

		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1121
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:955
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 5: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1296,context.go:89,cluster.go:1284,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3999353-1640675754-37-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 13603
		1: 15021
		3: 13418
		8: skipped
		6: 12396
		7: 14902
		10: 12433
		12: skipped
		11: 12198
		5: dead (exit status 137)
		9: 12572
		Error: UNCLASSIFIED_PROBLEM: 5: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Monitor
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:596
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:569
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:123
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1170
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 5: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Jan 3, 2022

Nothing new:

[ 1579.784106] Memory cgroup out of memory: Killed process 14028 (cockroach) total-vm:17849028kB, anon-rss:14540364kB, file-rss:86428kB, shmem-rss:0kB, UID:1000 pgtables:32792kB oom_score_adj:0

https://share.polarsignals.com/71b25ff/

image

@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ e9fd200d3567aa542da6cd1f255e4d2971cbdd9e:

		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1121
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:955
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 6: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1296,context.go:89,cluster.go:1284,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-4263813-1643699670-35-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 13067
		3: 13047
		1: 13428
		8: skipped
		10: 11567
		12: skipped
		6: dead (exit status 137)
		5: 13005
		7: 12913
		11: 11446
		9: 11618
		Error: UNCLASSIFIED_PROBLEM: 6: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Monitor
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:596
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:569
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:123
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1170
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 6: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@tbg tbg added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Feb 1, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ 6133ffd5459ae01d79e3dfd98528e557bb868eca:

		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1121
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:955
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1296,context.go:89,cluster.go:1284,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-4407736-1645168481-35-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 1: 15782
		3: 13744
		4: skipped
		2: dead (exit status 137)
		5: 12195
		6: 12365
		7: 12979
		8: skipped
		10: 12214
		9: 13015
		12: skipped
		11: 12275
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Monitor
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:596
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:569
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:123
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1170
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 2: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@AlexTalks AlexTalks added the T-kv KV Team label Feb 18, 2022
@tbg tbg changed the title roachtest: tpccbench/nodes=9/cpu=4/multi-region failed roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] Mar 23, 2022
@github-actions
Copy link

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

@tbg
Copy link
Member

tbg commented Apr 27, 2022

Using #80155 as the main tracking issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-roachtest O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants