Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpcc/warehouses=4000/geo failed #81186

Closed
cockroach-teamcity opened this issue May 11, 2022 · 4 comments
Closed

roachtest: import/tpcc/warehouses=4000/geo failed #81186

cockroach-teamcity opened this issue May 11, 2022 · 4 comments
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 11, 2022

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on master @ 7f3c06f5f2c26bc84705430a3622f92ec1444e9d:

		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) output in run_091325.711673564_n1_cockroach_workload_fixtures_import_tpcc
		Wraps: (3) ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081' returned
		  | stderr:
		  | I220511 09:13:27.504376 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 9 tables
		  | I220511 09:13:32.432927 58 ccl/workloadccl/fixture.go:483  [-] 2  imported 213 KiB in warehouse table (4000 rows, 0 index entries, took 4.667557324s, 0.04 MiB/s)
		  | I220511 09:13:32.455464 64 ccl/workloadccl/fixture.go:483  [-] 3  imported 7.9 MiB in item table (100000 rows, 0 index entries, took 4.689862997s, 1.68 MiB/s)
		  | I220511 09:13:32.662187 59 ccl/workloadccl/fixture.go:483  [-] 4  imported 3.9 MiB in district table (40000 rows, 0 index entries, took 4.896772658s, 0.80 MiB/s)
		  | I220511 09:14:27.137575 63 ccl/workloadccl/fixture.go:483  [-] 5  imported 546 MiB in new_order table (36000000 rows, 0 index entries, took 59.371961149s, 9.20 MiB/s)
		  | I220511 09:23:38.276887 61 ccl/workloadccl/fixture.go:483  [-] 6  imported 8.6 GiB in history table (120000000 rows, 0 index entries, took 10m10.511380183s, 14.43 MiB/s)
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | UNCLASSIFIED_PROBLEM: context canceled
		  | (1) UNCLASSIFIED_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ``````
		  |   | ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081'
		  |   | ``````
		  | Wraps: (3) context canceled
		  | Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:127,import.go:154,import.go:181,test_runner.go:876: monitor failure: monitor task failed: read tcp 172.17.0.3:32966 -> 35.247.99.179:26257: read: connection reset by peer
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:181
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor task failed
		Wraps: (5) read tcp 172.17.0.3:32966 -> 35.247.99.179:26257
		Wraps: (6) read
		Wraps: (7) connection reset by peer
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *net.OpError (6) *os.SyscallError (7) syscall.Errno
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

Jira issue: CRDB-15309

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 11, 2022
@stevendanna
Copy link
Collaborator

stevendanna commented May 17, 2022

Most proximately, it appears the test failed because the health-checker process that the test starts failed. The health check failed with:

health.log
232:health: 09:47:31 restore.go:96: health check terminated with read tcp 172.17.0.3:32966->35.247.99.179:26257: read: connection reset by peer

35.247.99.179 is node 7. The cockroach logs for node 7 end at 09:41:12.511308

However, from the journalctl logs, we can see that oom-killer was invoked at 9:47:32

May 11 09:47:32 teamcity-5143117-1652246337-52-n8cpu16-geo-0007 kernel: sshd invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=1, oom_score_adj=-1000

and it resulted in cockroach being killed:

May 11 09:47:33 teamcity-5143117-1652246337-52-n8cpu16-geo-0007 kernel: Out of memory: Killed process 11426 (cockroach) total-vm:20827596kB, anon-rss:13917476kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:38136kB oom_score_adj:0

From the tsdump we see an increase in both memory usage and a sharp increase in disk IO right before this.

Screenshot 2022-05-17 at 10 14 04

Screenshot 2022-05-17 at 10 14 13

The last available memory profile looks to me like this may be similar to those seen in the various issues linked to #73376.

Screenshot 2022-05-17 at 10 23 26

Other Notes

On node 7, in the 20 minutes leading up to the OOM, I see a long stream of messages like the following:

I220511 09:20:55.772177 27497 kv/kvserver/pkg/kv/kvserver/store_snapshot.go:731 ⋮ [n7,s7] 1852  cannot accept snapshot: ‹snapshot intersects existing range; initiated GC:› [n7,s7,r658/3:‹/Table/109/1/{1738/"…-2000/"…}›] (incoming ‹/Table/109/1/{1938/"|\v\x970\xc1\xf8D\x00\x80\x00\x00\x00\x03w=\xaf"-2000/"\x80\x00\x00\x00\x00\x00@\x00\x80\x00\x00\x00\x03\x93\x87\x00"}›)

We also see a good number of slow AddSSTable RPCs across all nodes:

2358:W220511 09:37:15.829128 2956 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n2,f‹fa3ddd07›,job=760840218481524737] 2246  slow range RPC: have been waiting 962.70s (1 attempts) for RPC AddSSTable [‹/Table/113/1/2550/85041/0›,‹/Table/113/1/2551/66500/0/NULL›) to r751:‹/Table/113/1/{2550/85041-3000/1}› [(n8,s8):1LEARNER, (n6,s6):2, (n2,s2):3, (n7,s7):5, next=6, gen=57, sticky=1652261467.101102481,0]; resp: ‹(err: <nil>), *roachpb.AddSSTableResponse›

@stevendanna stevendanna removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 17, 2022
@stevendanna
Copy link
Collaborator

stevendanna commented May 17, 2022

I'm going to remove the release-blocker label here. I've assigned kv in case there is anything useful here to their ongoing work on this, but I imagine we can just close this one since we have a few issues related to this already.

@mwang1026
Copy link

@tbg looks like what you've been investigating. mind making a call on whether we keep this open?

@tbg
Copy link
Member

tbg commented May 23, 2022

We can close this. This is likely a follower-writes issue, which we are tracking in #79215

@tbg tbg closed this as completed May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery T-kv KV Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants