-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: import/tpch/nodes=8 failed #90021
Comments
OOM on node 7:
|
This node was also seeing very slow AddSSTable requests:
|
Here is what I have so far. Node 7 was killed by the OOM-killer at 7:13:
The most proximate memory profile only shows memory related to AddSSTable requests: But this only accounts for about 3GB of the 16GB on that node. Right before the failure, we see the number of replica leases on n7 increase. These appear to be load-based transfers from n1 and n5: We can see that Average queries goes up on n1 and n5: and the transfers all seem to have log entries that look like:
Throughout the import, we see very slow AddSSTable requests, with a large amount of time being spent waiting on the concurrent sstable limiter (purple line with the highest delay is n7). As to why we see a spike in queries per second on these two nodes, we can see that after a period of inactivity, we start sending addstables again (note that the time of interest here is before 7:13 when the node died, I haven't yet looked into this spike after the node died). Nodes 1 and 5 see both a higher number of addsstable requests and more bytes ingested according to pebble. Given the relatively small number of requests being made, it isn't clear whether than imbalance is just chance or something more fundamental. It also isn't clear to me yet why addsstable requests are so slow here. It is possible we are just artificially slow because of the concurrent request limiter. |
@erikgrinaker I wonder if you (or someone on KV) might provide a second set of eyes here (happy to look into this synchronously with someone if they have the time). While there is a lot of poor behaviour here, it isn't clear to me that this needs to be a release blocker. |
This looks as though the AddSSTable requests are causing thrashing due to the period between their ingestion being large. We added a multiplier for AddSST requests in terms of their QPS: #76252 Since this QPS increase is not sustained, it is just one big hit to QPS each time an AddSST request comes in, it seems to cause lease shedding from light green/blue when its QPS spikes, then movement back in between requests. |
Yeah, that checks out. We could consider increasing Most of the memory usage in that profile seemed to be the SST generation on the import client side, rather than the SST ingestion itself. We could consider tweaking the client-side settings to reduce the size of built SSTs too. That said, we have seen OOM situations when ingesting large SSTs into overloaded nodes, since we don't have any memory budgeting for the Raft receive queue (#73376, #71805). It seems plausible that that's what happened here, possibly as a consequence of the lease transfers, without it being reflected in the memory profile (it may have been taken too early). I don't think this necessarily needs to be a release blocker, unless we see repeat events, since the Raft SST OOMs are a known problem that exists in previous versions as well. |
Thanks for taking a look.
👍 . That is definitely consistent with everything I've seen. It is a bit of a bummer that n7 was select as the target for the transfer since it seems to have already been in some distress.
Yeah, I have a feeling the profile was just taken a bit too early. The profile looks consistent with the sst construction's memory monitoring. The last profile we have is from 07:13:07 which looks about 30-40 seconds too early. |
Nice sleuthing @kvoli.
How would we improve profile capture here to make this less speculative/more responsive? If you file an issue this'll get done. |
roachtest.import/tpch/nodes=8 failed with artifacts on release-22.2 @ 00ed5143845ec05797d16e6ab61d179cf51775f2:
Parameters: Same failure on other branches
|
Closing as it looks like we got to the cause in the initial investigation and the follow up failure is now too old to investigate. |
roachtest.import/tpch/nodes=8 failed with artifacts on release-22.2 @ cffe9bc440988894abe9a598ea6b2f15e1b7df93:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-20549
The text was updated successfully, but these errors were encountered: