roachtest: import/tpch/nodes=8 failed #90021

cockroach-teamcity · 2022-10-15T07:23:37Z

roachtest.import/tpch/nodes=8 failed with artifacts on release-22.2 @ cffe9bc440988894abe9a598ea6b2f15e1b7df93:

test artifacts and logs in: /artifacts/import/tpch/nodes=8/run_1
	monitor.go:127,import.go:313,test_runner.go:930: monitor failure: monitor command failure: unexpected node event: 7: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCH.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:313
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 7: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-20549}

The text was updated successfully, but these errors were encountered:

stevendanna · 2022-10-17T14:07:21Z

OOM on node 7:

[ 6807.137756] oom_reaper: reaped process 13287 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

stevendanna · 2022-10-17T15:24:37Z

This node was also seeing very slow AddSSTable requests:

W221015 07:06:21.416362 224936 kv/kvclient/kvcoord/dist_sender.go:1627 â‹® [n7,fâ€¹1e503701â€º,job=805244066514862081] 7763  slow range RPC: have been waiting 200.27s (1 attempts) for RPC AddSSTable [/Table/106/3/â€¹2551627â€º/â€¹437052644â€º/â€¹6â€º/â€¹0â€º,/Table/106/3/â€¹2624592â€º/â€¹434035300â€º/â€¹5â€º/â€¹0â€º/â€¹NULLâ€º) to r290:â€¹/Table/106/3/2{551627/95460224/6-624594/4588165/2}â€º [(n5,s5):4, (n7,s7):2, (n6,s6):3, next=5, gen=33, sticky=1665813266.348297089,0]; resp: â€¹(err: <nil>), *roachpb.AddSSTableResponseâ€º

stevendanna · 2022-10-18T12:52:14Z

Here is what I have so far.

Node 7 was killed by the OOM-killer at 7:13:

cockroach exited with code 137: Sat Oct 15 07:13:47 UTC 2022

The most proximate memory profile only shows memory related to AddSSTable requests:

But this only accounts for about 3GB of the 16GB on that node.

Right before the failure, we see the number of replica leases on n7 increase. These appear to be load-based transfers from n1 and n5:

We can see that Average queries goes up on n1 and n5:

and the transfers all seem to have log entries that look like:

I221015 07:09:59.171561 251 13@kv/kvserver/store_rebalancer.go:464 ⋮ [n1,s1,store-rebalancer] 7142  transferring lease for r267 (qps=1.05) to store s7 (qps=51.56) from local store s1 (qps=241.86)
I221015 07:09:59.171592 251 13@kv/kvserver/replicate_queue.go:1862 ⋮ [n1,s1,store-rebalancer] 7143  transferring lease to s7

Throughout the import, we see very slow AddSSTable requests, with a large amount of time being spent waiting on the concurrent sstable limiter (purple line with the highest delay is n7).

As to why we see a spike in queries per second on these two nodes, we can see that after a period of inactivity, we start sending addstables again (note that the time of interest here is before 7:13 when the node died, I haven't yet looked into this spike after the node died). Nodes 1 and 5 see both a higher number of addsstable requests and more bytes ingested according to pebble.

Given the relatively small number of requests being made, it isn't clear whether than imbalance is just chance or something more fundamental.

It also isn't clear to me yet why addsstable requests are so slow here. It is possible we are just artificially slow because of the concurrent request limiter.

stevendanna · 2022-10-18T13:13:04Z

@erikgrinaker I wonder if you (or someone on KV) might provide a second set of eyes here (happy to look into this synchronously with someone if they have the time). While there is a lot of poor behaviour here, it isn't clear to me that this needs to be a release blocker.

kvoli · 2022-10-18T13:13:53Z

This looks as though the AddSSTable requests are causing thrashing due to the period between their ingestion being large.

We added a multiplier for AddSST requests in terms of their QPS: #76252

Since this QPS increase is not sustained, it is just one big hit to QPS each time an AddSST request comes in, it seems to cause lease shedding from light green/blue when its QPS spikes, then movement back in between requests.

erikgrinaker · 2022-10-18T13:25:27Z

Yeah, that checks out. We could consider increasing kv.allocator.load_based_rebalancing_interval to smooth out the QPS spikes a bit before moving leases around.

Most of the memory usage in that profile seemed to be the SST generation on the import client side, rather than the SST ingestion itself. We could consider tweaking the client-side settings to reduce the size of built SSTs too.

That said, we have seen OOM situations when ingesting large SSTs into overloaded nodes, since we don't have any memory budgeting for the Raft receive queue (#73376, #71805). It seems plausible that that's what happened here, possibly as a consequence of the lease transfers, without it being reflected in the memory profile (it may have been taken too early).

I don't think this necessarily needs to be a release blocker, unless we see repeat events, since the Raft SST OOMs are a known problem that exists in previous versions as well.

stevendanna · 2022-10-18T13:38:28Z

Thanks for taking a look.

This looks as though the AddSSTable requests are causing thrashing due to the period between their ingestion being large.

👍 . That is definitely consistent with everything I've seen. It is a bit of a bummer that n7 was select as the target for the transfer since it seems to have already been in some distress.

Most of the memory usage in that profile seemed to be the SST generation on the import client side, rather than the SST ingestion itself. We could consider tweaking the client-side settings to reduce the size of built SSTs too.

Yeah, I have a feeling the profile was just taken a bit too early. The profile looks consistent with the sst construction's memory monitoring. The last profile we have is from 07:13:07 which looks about 30-40 seconds too early.

irfansharif · 2022-10-18T15:47:50Z

Nice sleuthing @kvoli.

Yeah, I have a feeling the profile was just taken a bit too early. The profile looks consistent with the sst construction's memory monitoring. The last profile we have is from 07:13:07 which looks about 30-40 seconds too early.

How would we improve profile capture here to make this less speculative/more responsive? If you file an issue this'll get done.

cockroach-teamcity · 2023-01-06T09:20:12Z

roachtest.import/tpch/nodes=8 failed with artifacts on release-22.2 @ 00ed5143845ec05797d16e6ab61d179cf51775f2:

test artifacts and logs in: /artifacts/import/tpch/nodes=8/run_1
(test_impl.go:286).Fatal: monitor failure: monitor task failed: dial tcp 34.23.159.25:26257: connect: connection timed out

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: import/tpch/nodes=8 failed #94824 roachtest: import/tpch/nodes=8 failed [C-test-failure O-roachtest O-robot T-sql-sessions branch-master release-blocker]

_{This test on roachdash | Improve this report!}

stevendanna · 2023-06-08T00:06:31Z

Closing as it looks like we got to the cause in the initial investigation and the follow up failure is now too old to investigate.

cockroach-teamcity added this to the 22.2 milestone Oct 15, 2022

blathers-crl bot added the T-disaster-recovery label Oct 15, 2022

stevendanna removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 18, 2022

exalate-issue-sync bot assigned stevendanna Oct 18, 2022

cockroach-teamcity mentioned this issue Oct 26, 2022

roachtest: import/tpch/nodes=8 failed #90676

Closed

cockroach-teamcity mentioned this issue Nov 17, 2022

roachtest: import/tpch/nodes=8 failed #92037

Closed

cockroach-teamcity mentioned this issue Dec 19, 2022

roachtest: import/tpch/nodes=8 failed #93894

Closed

cockroach-teamcity mentioned this issue Jan 6, 2023

roachtest: import/tpch/nodes=8 failed #94824

Closed

cockroach-teamcity mentioned this issue Mar 19, 2023

roachtest: import/tpch/nodes=8 failed #98970

Closed

stevendanna closed this as completed Jun 8, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: import/tpch/nodes=8 failed #90021

roachtest: import/tpch/nodes=8 failed #90021

cockroach-teamcity commented Oct 15, 2022 •

edited by cockroach-jira-scripts

Loading

stevendanna commented Oct 17, 2022

stevendanna commented Oct 17, 2022

stevendanna commented Oct 18, 2022 •

edited

Loading

stevendanna commented Oct 18, 2022

kvoli commented Oct 18, 2022

erikgrinaker commented Oct 18, 2022 •

edited

Loading

stevendanna commented Oct 18, 2022

irfansharif commented Oct 18, 2022

cockroach-teamcity commented Jan 6, 2023

stevendanna commented Jun 8, 2023

roachtest: import/tpch/nodes=8 failed #90021

roachtest: import/tpch/nodes=8 failed #90021

Comments

cockroach-teamcity commented Oct 15, 2022 • edited by cockroach-jira-scripts Loading

stevendanna commented Oct 17, 2022

stevendanna commented Oct 17, 2022

stevendanna commented Oct 18, 2022 • edited Loading

stevendanna commented Oct 18, 2022

kvoli commented Oct 18, 2022

erikgrinaker commented Oct 18, 2022 • edited Loading

stevendanna commented Oct 18, 2022

irfansharif commented Oct 18, 2022

cockroach-teamcity commented Jan 6, 2023

stevendanna commented Jun 8, 2023

cockroach-teamcity commented Oct 15, 2022 •

edited by cockroach-jira-scripts

Loading

stevendanna commented Oct 18, 2022 •

edited

Loading

erikgrinaker commented Oct 18, 2022 •

edited

Loading