Remove redundant response of empty result in AsyncShardFetch to avoid OOM issue #84010

maosuhan · 2022-02-16T06:05:51Z

In the process of full restart, OOM problems are often encountered. After dump analysis and we found the following problems.

In the process of fetching data, the master will ask each data whether it owns the shard, and all the returned results of the node will be saved in the map<nodeId, nodeResponse> in AsyncShardFetch. In most cases, a shard only has data on several nodes, but the intermediate result map will still save the return results of all nodes.

In this MR, we only save valid intermediate result of node responses and ignore the node responses that does not hold the shard at all.

It is proved to be successfully in our company and the OOM issue is gone after the optimization.

cla-checker-service · 2022-02-16T06:05:54Z

❌ Author of the following commits did not sign a Contributor Agreement:
e37d3e1

Please, read and sign the above mentioned agreement if you want to contribute to this project

DaveCTurner · 2022-02-16T08:09:51Z

What version are you using @maosuhan? I think this excessive memory usage was addressed by #77991.

maosuhan · 2022-02-16T08:52:17Z

@DaveCTurner We already backport the #77991, but the OOM issue still exists.
We use 7.6.2 in production. The shard number is 1 million, and the nodes number is nearly 400.

The main part which occupy the memory is not DiscoveryNode but StoreFilesMetadata which is a node response content and is relatively large.

DaveCTurner · 2022-02-16T09:23:49Z

Can you help me understand why an empty StoreFilesMetadata is still "relatively large". I get that there are a lot of them, but each one should be small. Is that not the case?

I think this will be addressed by some bigger changes related to #77466, particularly #80694 and friends.

… OOM

elasticmachine · 2022-02-16T09:32:01Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2022-02-16T09:32:52Z

Also would you sign the CLA?

maosuhan · 2022-02-16T09:42:38Z

CLA Done @DaveCTurner

maosuhan · 2022-02-16T10:10:10Z

@DaveCTurner As you can see in the screenshot, there was totally 32927024 objets in the dump and they occupied more than 15GB.
So each object size is 0.45KB.

DaveCTurner · 2022-02-16T11:14:44Z

So each object size is 0.45KB.

i.e. 450 bytes. That's a lot more than I expect. Can you explain why it's so much?

DaveCTurner · 2022-02-16T11:17:31Z

Are you sure you signed the CLA with the address [email protected] attached to your commits? I don't see this address in the list.

DaveCTurner · 2022-02-16T13:52:58Z

On reflection I'd rather avoid allocating these objects entirely rather than creating them and then filtering them out as proposed here. #84034 should do what we want.

maosuhan · 2022-02-16T15:45:46Z

Thanks for your effort! I think it is more reasonable to do that.
I just debugged in IDE, below is the screenshot.

metadata and commitUserData are all empty HashMap which is read through streamInput. And each hashmap will occupy 64 bytes each. So the object is at least 128 bytes.
The non-empty result is definitely much larger than the empty ones but the accumulated wasted memory will be very large if the cluster and node number are both large.
Feel free to close this issue.

DaveCTurner · 2022-02-16T15:49:34Z

Sounds good, thanks for raising the issue. Closing this, will proceed with #84034 instead.

elasticsearchmachine · 2022-02-16T15:49:59Z

@maosuhan please enable the option "Allow edits and access to secrets by maintainers" on your PR. For more information, see the documentation.

elasticsearchmachine added v8.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Feb 16, 2022

Remove redundant response of empty result in AsyncShardFetch to avoid…

e37d3e1

… OOM

maosuhan force-pushed the enhance_async_shard_fetch branch from b515a21 to e37d3e1 Compare February 16, 2022 09:31

DaveCTurner added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Feb 16, 2022

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 16, 2022

DaveCTurner self-assigned this Feb 16, 2022

DaveCTurner added the >bug label Feb 16, 2022

Add changelog

6bd96e6

DaveCTurner closed this Feb 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove redundant response of empty result in AsyncShardFetch to avoid OOM issue #84010

Remove redundant response of empty result in AsyncShardFetch to avoid OOM issue #84010

maosuhan commented Feb 16, 2022

cla-checker-service bot commented Feb 16, 2022 •

edited

Loading

DaveCTurner commented Feb 16, 2022

maosuhan commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022

elasticmachine commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022

maosuhan commented Feb 16, 2022

maosuhan commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022

maosuhan commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022 •

edited

Loading

elasticsearchmachine commented Feb 16, 2022

Remove redundant response of empty result in AsyncShardFetch to avoid OOM issue #84010

Remove redundant response of empty result in AsyncShardFetch to avoid OOM issue #84010

Conversation

maosuhan commented Feb 16, 2022

cla-checker-service bot commented Feb 16, 2022 • edited Loading

DaveCTurner commented Feb 16, 2022

maosuhan commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022

elasticmachine commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022

maosuhan commented Feb 16, 2022

maosuhan commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022

maosuhan commented Feb 16, 2022

DaveCTurner commented Feb 16, 2022 • edited Loading

elasticsearchmachine commented Feb 16, 2022

cla-checker-service bot commented Feb 16, 2022 •

edited

Loading

DaveCTurner commented Feb 16, 2022 •

edited

Loading