Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpch/nodes=8 failed #90021

Closed
cockroach-teamcity opened this issue Oct 15, 2022 · 10 comments
Closed

roachtest: import/tpch/nodes=8 failed #90021

cockroach-teamcity opened this issue Oct 15, 2022 · 10 comments
Assignees
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 15, 2022

roachtest.import/tpch/nodes=8 failed with artifacts on release-22.2 @ cffe9bc440988894abe9a598ea6b2f15e1b7df93:

test artifacts and logs in: /artifacts/import/tpch/nodes=8/run_1
	monitor.go:127,import.go:313,test_runner.go:930: monitor failure: monitor command failure: unexpected node event: 7: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCH.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:313
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 7: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-20549

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 15, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Oct 15, 2022
@stevendanna
Copy link
Collaborator

OOM on node 7:

[ 6807.137756] oom_reaper: reaped process 13287 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@stevendanna
Copy link
Collaborator

This node was also seeing very slow AddSSTable requests:

W221015 07:06:21.416362 224936 kv/kvclient/kvcoord/dist_sender.go:1627 ⋮ [n7,f‹1e503701›,job=805244066514862081] 7763  slow range RPC: have been waiting 200.27s (1 attempts) for RPC AddSSTable [/Table/106/3/‹2551627›/‹437052644›/‹6›/‹0›,/Table/106/3/‹2624592›/‹434035300›/‹5›/‹0›/‹NULL›) to r290:‹/Table/106/3/2{551627/95460224/6-624594/4588165/2}› [(n5,s5):4, (n7,s7):2, (n6,s6):3, next=5, gen=33, sticky=1665813266.348297089,0]; resp: ‹(err: <nil>), *roachpb.AddSSTableResponse›

@stevendanna
Copy link
Collaborator

stevendanna commented Oct 18, 2022

Here is what I have so far.

Node 7 was killed by the OOM-killer at 7:13:

cockroach exited with code 137: Sat Oct 15 07:13:47 UTC 2022

The most proximate memory profile only shows memory related to AddSSTable requests:

Screenshot 2022-10-18 at 11 56 17

But this only accounts for about 3GB of the 16GB on that node.

Right before the failure, we see the number of replica leases on n7 increase. These appear to be load-based transfers from n1 and n5:

Screenshot 2022-10-18 at 12 50 21

We can see that Average queries goes up on n1 and n5:

Screenshot 2022-10-18 at 12 50 38

and the transfers all seem to have log entries that look like:

I221015 07:09:59.171561 251 13@kv/kvserver/store_rebalancer.go:464 ⋮ [n1,s1,store-rebalancer] 7142  transferring lease for r267 (qps=1.05) to store s7 (qps=51.56) from local store s1 (qps=241.86)
I221015 07:09:59.171592 251 13@kv/kvserver/replicate_queue.go:1862 ⋮ [n1,s1,store-rebalancer] 7143  transferring lease to s7

Throughout the import, we see very slow AddSSTable requests, with a large amount of time being spent waiting on the concurrent sstable limiter (purple line with the highest delay is n7).

Screenshot 2022-10-18 at 12 55 53

As to why we see a spike in queries per second on these two nodes, we can see that after a period of inactivity, we start sending addstables again (note that the time of interest here is before 7:13 when the node died, I haven't yet looked into this spike after the node died). Nodes 1 and 5 see both a higher number of addsstable requests and more bytes ingested according to pebble.

Screenshot 2022-10-18 at 13 43 39

Given the relatively small number of requests being made, it isn't clear whether than imbalance is just chance or something more fundamental.

It also isn't clear to me yet why addsstable requests are so slow here. It is possible we are just artificially slow because of the concurrent request limiter.

@stevendanna
Copy link
Collaborator

@erikgrinaker I wonder if you (or someone on KV) might provide a second set of eyes here (happy to look into this synchronously with someone if they have the time). While there is a lot of poor behaviour here, it isn't clear to me that this needs to be a release blocker.

@kvoli
Copy link
Collaborator

kvoli commented Oct 18, 2022

This looks as though the AddSSTable requests are causing thrashing due to the period between their ingestion being large.

We added a multiplier for AddSST requests in terms of their QPS: #76252

Since this QPS increase is not sustained, it is just one big hit to QPS each time an AddSST request comes in, it seems to cause lease shedding from light green/blue when its QPS spikes, then movement back in between requests.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 18, 2022

Yeah, that checks out. We could consider increasing kv.allocator.load_based_rebalancing_interval to smooth out the QPS spikes a bit before moving leases around.

Most of the memory usage in that profile seemed to be the SST generation on the import client side, rather than the SST ingestion itself. We could consider tweaking the client-side settings to reduce the size of built SSTs too.

That said, we have seen OOM situations when ingesting large SSTs into overloaded nodes, since we don't have any memory budgeting for the Raft receive queue (#73376, #71805). It seems plausible that that's what happened here, possibly as a consequence of the lease transfers, without it being reflected in the memory profile (it may have been taken too early).

I don't think this necessarily needs to be a release blocker, unless we see repeat events, since the Raft SST OOMs are a known problem that exists in previous versions as well.

@stevendanna
Copy link
Collaborator

Thanks for taking a look.

This looks as though the AddSSTable requests are causing thrashing due to the period between their ingestion being large.

👍 . That is definitely consistent with everything I've seen. It is a bit of a bummer that n7 was select as the target for the transfer since it seems to have already been in some distress.

Most of the memory usage in that profile seemed to be the SST generation on the import client side, rather than the SST ingestion itself. We could consider tweaking the client-side settings to reduce the size of built SSTs too.

Yeah, I have a feeling the profile was just taken a bit too early. The profile looks consistent with the sst construction's memory monitoring. The last profile we have is from 07:13:07 which looks about 30-40 seconds too early.

@stevendanna stevendanna removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 18, 2022
@irfansharif
Copy link
Contributor

Nice sleuthing @kvoli.

Yeah, I have a feeling the profile was just taken a bit too early. The profile looks consistent with the sst construction's memory monitoring. The last profile we have is from 07:13:07 which looks about 30-40 seconds too early.

How would we improve profile capture here to make this less speculative/more responsive? If you file an issue this'll get done.

@cockroach-teamcity
Copy link
Member Author

roachtest.import/tpch/nodes=8 failed with artifacts on release-22.2 @ 00ed5143845ec05797d16e6ab61d179cf51775f2:

test artifacts and logs in: /artifacts/import/tpch/nodes=8/run_1
(test_impl.go:286).Fatal: monitor failure: monitor task failed: dial tcp 34.23.159.25:26257: connect: connection timed out

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@stevendanna
Copy link
Collaborator

Closing as it looks like we got to the cause in the initial investigation and the follow up failure is now too old to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants