Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #75071

Closed
cockroach-teamcity opened this issue Jan 18, 2022 · 7 comments
Closed

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #75071

cockroach-teamcity opened this issue Jan 18, 2022 · 7 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@cockroach-teamcity
Copy link
Member

roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 365b4da8bd02c06ee59d2130a56dec74ffc9ce21:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	monitor.go:127,tpcc.go:1074,tpcc.go:908,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1074
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:908
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 2: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 18, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 912964e02ddd951c77d4f71981ae18b3894e9084:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	monitor.go:127,tpcc.go:1074,tpcc.go:908,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1074
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:908
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 6: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Jan 20, 2022

Looking at the second failure.

[ 2831.089902] Memory cgroup out of memory: Killed process 12658 (cockroach) total-vm:24752312kB, anon-rss:13682724kB, file-rss:23196kB, shmem-rss:0kB, UID:1000 pgtables:43188kB oom_score_adj:0

The latest heap profile isn't that useful. It shows around 2GB allocated, and while we regularly see this undercount, the RSS at time of OOMkill is around 13GB so it's likely that we just didn't catch whatever was consuming all this memory.

The process was killed at

cockroach exited with code 137: Wed Jan 19 14:27:13 UTC 2022

The profile is from 13_56_18, so a solid 30 minutes prior.

Looking at the graphs for n6, we see memory use grow linearly until OOM, starting sometime after 14:10. It's not surprising that things would change around that time, because that's when we hit the cluster with this:

14:08:28 cluster.go:2075: > ./cockroach workload run tpcc --warehouses=5000 --workers=5000 --max-rate=613 --wait=false --ramp=5m0s --duration=15m0s --scatter --tolerate-errors {pgurl:1-6}

This isn't supposed to kill nodes, it's the "warmup" load that we apply during rebalancing. That memory usage is definitely not intentional, though. We'll need to get a heap dump at the right time to figure this out, I think. Will look at the first occurrence next.

image

@tbg
Copy link
Member

tbg commented Jan 20, 2022

First occurrence:

Same thing, a node (n2) OOMs during warm-up:

15:38:07 cluster.go:2075: > ./cockroach workload run tpcc --warehouses=5000 --workers=5000 --max-rate=613 --wait=false --ramp=5m0s --duration=15m0s --scatter --tolerate-errors {pgurl:1-6}
15:51:18 test_impl.go:323: test failure: monitor.go:127,tpcc.go:1074,tpcc.go:908,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 137)

https://share.polarsignals.com/822d7d3 (15_26_53)
crash at 15:51:18, so again off by tens of minutes. Profile looks identical, innocuous.

I think we should run these tests with COCKROACH_MEMPROF_INTERVAL=15s, so that we'll get a heap dump every 15s. This should be plenty to detect what's causing the linear memory growth. Will send a PR.

tbg added a commit to tbg/cockroach that referenced this issue Jan 20, 2022
Can hopefully help understand
cockroachdb#75071.

Release note: None
@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ dc07599dc9db1acd5afa3a6537297815f25c1fca:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	monitor.go:127,tpcc.go:1074,tpcc.go:908,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1074
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:908
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 6: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Jan 24, 2022

#75204 is now merging, so future failures (perhaps not the next one, since it may not have picked up the PR yet) should have plenty of heap profiles.

craig bot pushed a commit that referenced this issue Jan 24, 2022
75056: sql: avoid CREATE INDEX failure on retry r=postamar a=stevendanna

Previously, if a transaction including a CREATE INDEX statement that
used expressions in the list of included columns encountered a
TransactionRetryWithProtoRefreshError, the retry would fail with an
error such as

```
(42703) column "crdb_internal_idx_expr_6" does not exist
column_resolver.go:196: in NewUndefinedColumnError()
```

This was the result of makeIndexDescriptor substituting the
expressions with the names of the newly added columns on the
CreateIndexNode itself. When the transaction is retried, the
generated column names do not yet exist.

Here, we resolve this issue by only modifying a copy of the
IndexElemList when generating an index descriptor.

Release note (bug fix): CREATE INDEX statements using expressions
previously failed in some cases if they encountered an internal retry.

75204: roachtest: heap profile tpccbench every 15s r=erikgrinaker a=tbg

Can hopefully help understand
#75071.

Release note: None


Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
gtr pushed a commit to gtr/cockroach that referenced this issue Jan 24, 2022
Can hopefully help understand
cockroachdb#75071.

Release note: None
@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ c4c5ca2fdd5a641433a85a28d4dfd3bd4443015d:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	monitor.go:127,tpcc.go:1077,tpcc.go:911,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1077
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:911
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 1: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@tbg tbg added S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting S-1 High impact: many users impacted, serious risk of high unavailability or data loss and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Feb 1, 2022
@tbg
Copy link
Member

tbg commented Feb 2, 2022

Node died at 16:36:06, we have lots of heap dumps, the most recent is 16_35_51: https://share.polarsignals.com/fc00722

This looks fine honestly:

image

The memory blowup, at least as captured in the timeseries, isn't abrupt, but there is some goroutine blowup:

image

There is a goroutine dump. It's partial (node must've died writing it), but it shows .. after some searching through less interesting things, 531 goroutines belonging to replica checksum computations.

Fixed by #75448.

@tbg tbg closed this as completed Feb 2, 2022
@tbg tbg removed the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

No branches or pull requests

2 participants