Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpch/nodes=8 failed #93894

Closed
cockroach-teamcity opened this issue Dec 19, 2022 · 4 comments
Closed

roachtest: import/tpch/nodes=8 failed #93894

cockroach-teamcity opened this issue Dec 19, 2022 · 4 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 19, 2022

roachtest.import/tpch/nodes=8 failed with artifacts on release-22.1 @ d84033d9ee0dc4f35901bcc2e2311b6288e84569:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/import/tpch/nodes=8/run_1
	monitor.go:127,import.go:312,test_runner.go:883: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 7)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCH.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:312
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 6: dead (exit status 7)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-22588

Epic CRDB-20293

@cockroach-teamcity cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 19, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.1 milestone Dec 19, 2022
@msbutler
Copy link
Collaborator

Node 6 died from disk stall. Handing over to storage for further investigation. From the unredacted cockroach.log on node 6:

F221219 06:22:11.020890 88 storage/pebble.go:915 ⋮ [n6] 225  disk stall detected: pebble unable to write to ‹/mnt/data1/cockroach/COCKROACHDB_REGISTRY_000001› in 21.50 seconds

@msbutler msbutler added T-storage Storage Team and removed T-disaster-recovery labels Dec 19, 2022
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Dec 19, 2022
@nicktrav
Copy link
Collaborator

This one looks like an infra flake. Closing.

@nicktrav nicktrav added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Dec 20, 2022
@exalate-issue-sync exalate-issue-sync bot added X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue and removed X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue labels Dec 20, 2022
@msbutler
Copy link
Collaborator

in the future, what's the best way to identify infra flake disk stalls? Or should we always hand these to storage?

@nicktrav
Copy link
Collaborator

The log line you already posted is generally a decent indication of something wonky at the device level.

disk stall detected: pebble unable to write to ...

In the Pebble logs you should see the full picture. Especially if that stall happens across multiple files, and ultimately times out (the limit is currently 20s) and causes the process to panic.

In such cases there's not too much we can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants