Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpcc/warehouses=4000/geo failed (job session ID missing) #85310

Closed
cockroach-teamcity opened this issue Jul 29, 2022 · 9 comments
Assignees
Labels
A-disaster-recovery A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 29, 2022

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on master @ 1129fbc650fe3a037b03aea1e5f1d8078618cb1c:

		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:74
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (2) output in run_102950.598422633_n1_cockroach_workload_fixtures_import_tpcc
		Wraps: (3) ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081' returned
		  | stderr:
		  | I220729 10:29:52.480591 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 9 tables
		  | I220729 10:29:54.287002 102 ccl/workloadccl/fixture.go:481  [-] 2  imported 213 KiB in warehouse table (4000 rows, 0 index entries, took 1.543459645s, 0.13 MiB/s)
		  | I220729 10:29:54.287352 103 ccl/workloadccl/fixture.go:481  [-] 3  imported 3.9 MiB in district table (40000 rows, 0 index entries, took 1.543742349s, 2.55 MiB/s)
		  | I220729 10:29:54.504497 108 ccl/workloadccl/fixture.go:481  [-] 4  imported 7.9 MiB in item table (100000 rows, 0 index entries, took 1.760760909s, 4.47 MiB/s)
		  | I220729 10:31:45.316682 107 ccl/workloadccl/fixture.go:481  [-] 5  imported 546 MiB in new_order table (36000000 rows, 0 index entries, took 1m52.572980871s, 4.85 MiB/s)
		  | I220729 10:41:09.656506 105 ccl/workloadccl/fixture.go:481  [-] 6  imported 8.6 GiB in history table (120000000 rows, 0 index entries, took 11m16.912815734s, 13.02 MiB/s)
		  | I220729 10:42:34.548314 106 ccl/workloadccl/fixture.go:481  [-] 7  imported 6.5 GiB in order table (120000000 rows, 120000000 index entries, took 12m41.804603025s, 8.76 MiB/s)
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | UNCLASSIFIED_PROBLEM: context canceled
		  | (1) UNCLASSIFIED_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ``````
		  |   | ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081'
		  |   | ``````
		  | Wraps: (3) context canceled
		  | Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:127,import.go:154,import.go:181,test_runner.go:896: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:181
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 6: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

Jira issue: CRDB-18177

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 29, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jul 29, 2022
@adityamaru
Copy link
Contributor

node 6 was OOM killed according to 6.dmesg.txt

@adityamaru
Copy link
Contributor

Screen Shot 2022-07-29 at 10 45 04 AM

This looks like #73376. cc: @tbg incase the artifacts help further the investigation.

@adityamaru adityamaru added the A-kv Anything in KV that doesn't belong in a more specific category. label Aug 3, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 3, 2022
@irfansharif irfansharif changed the title roachtest: import/tpcc/warehouses=4000/geo failed roachtest: import/tpcc/warehouses=4000/geo failed [raft sideload oom] Aug 18, 2022
@erikgrinaker
Copy link
Contributor

Removing the release-blocker label here, since this is a known issue that pre-dates 22.1.

@erikgrinaker erikgrinaker added T-kv-replication and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 29, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 29, 2022

cc @cockroachdb/replication

@erikgrinaker erikgrinaker removed the T-kv KV Team label Aug 29, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on master @ a82711442c65cf14489c55041b45b11a1e38415b:

		Wraps: (2) output in run_100123.166589701_n1_cockroach_workload_fixtures_import_tpcc
		Wraps: (3) ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081' returned
		  | stderr:
		  | I220909 10:01:25.224016 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 9 tables
		  | I220909 10:01:32.075472 79 ccl/workloadccl/fixture.go:481  [-] 2  imported 7.9 MiB in item table (100000 rows, 0 index entries, took 6.147239254s, 1.28 MiB/s)
		  | I220909 10:01:32.449515 74 ccl/workloadccl/fixture.go:481  [-] 3  imported 3.9 MiB in district table (40000 rows, 0 index entries, took 6.521467844s, 0.60 MiB/s)
		  | I220909 10:01:33.339839 73 ccl/workloadccl/fixture.go:481  [-] 4  imported 213 KiB in warehouse table (4000 rows, 0 index entries, took 7.411826759s, 0.03 MiB/s)
		  | I220909 10:02:29.302066 78 ccl/workloadccl/fixture.go:481  [-] 5  imported 546 MiB in new_order table (36000000 rows, 0 index entries, took 1m3.37385474s, 8.62 MiB/s)
		  | Error: importing fixture: importing table history: pq: job 795106626557018113: could not mark as reverting: job 795106626557018113: with status running: expected session "aef9d1829fda40ec8aed76104bc9c51d" but found NULL
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081'
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

	monitor.go:127,import.go:154,import.go:181,test_runner.go:906: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:181
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:906
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6340
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:233
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@msbutler msbutler added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 9, 2022
@msbutler
Copy link
Collaborator

msbutler commented Sep 9, 2022

This seems to be a different failure mode than a raft sideload oom. It's really unfortunate that TC linked this new failure mode to the raft failure issue (i'll follow up with test eng on this). What I see so far:

  • the top line error msg is
    importing fixture: importing table history: pq: job 795106626557018113: could not mark as reverting: job 795106626557018113: with status running: expected session "aef9d1829fda40ec8aed76104bc9c51d" but found NULL
  • No OOM lines are present in *.dmesg.txt
  • mysteriously, this job id does not show up in crdb_internal.jobs.txt, but does show up up in system.jobs.txt

I'll let current L2 further investigate.

@tbg
Copy link
Member

tbg commented Sep 9, 2022

@tbg
Copy link
Member

tbg commented Sep 9, 2022

Btw, another way to avoid roachtest reuse of this issue is to remove the O-roachtest label (but of course that is a lie: this issue did originate with roachtest).

@dt dt removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 19, 2022
@dt
Copy link
Member

dt commented Sep 19, 2022

If this is now tracking the most recent posted failure on it, the "job ID is missing" one, then I'm removing "release-blocker" from this since that smells like some jobs vs testing flake and we haven't seen it again.

@dt dt changed the title roachtest: import/tpcc/warehouses=4000/geo failed [raft sideload oom] roachtest: import/tpcc/warehouses=4000/geo failed (job ID missing) Sep 20, 2022
@dt dt changed the title roachtest: import/tpcc/warehouses=4000/geo failed (job ID missing) roachtest: import/tpcc/warehouses=4000/geo failed (job session ID missing) Sep 20, 2022
@dt dt closed this as completed Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants