Massive async shard fetch requests consume lots of heap memories on master node. #80694

howardhuanghua · 2021-11-14T17:01:01Z

Elasticsearch version : 7.10

JVM version (java -version): JDK 11

Description of the problem including expected versus actual behavior:
In #77991 we solved asnyc shard fetch resposes memory consumption issue.
But we found async shard fetch reqeusts also consume lots of heap memories. Here is our production env for this exception case:
Data nodes number: 75
Dedicate master nodes number: 3
Master node resource: 2 Core cpus, 8GB physical memory, 4GB heap memory.
Total shards number: 15000

When the new master has been elected after full cluster restart, the elected master heap memory would be used up for several seconds. We dump the memory and found netty inflight sending request used lots of heap:

Each WriteOperation should be single shard request to specific node (16k buffer size per each):

From Netty4MessageChannelHandler class we could see a queuedWrites, messages are flushed asynchronously:

elasticsearch/modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4MessageChannelHandler.java

Line 42 in 025dbdc

private final Queue<WriteOperation> queuedWrites = new ArrayDeque<>();

So besides cutting fetch shard response, we also need to handle massive shard sending requests.

The text was updated successfully, but these errors were encountered:

DaveCTurner · 2021-11-14T20:24:10Z

Relates #77466

elasticmachine · 2021-11-14T20:24:18Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2021-11-14T20:32:14Z

I can see that this might be a problem: 15k shards × 75 nodes × 16kiB buffer for each message ≈ 17GiB of memory churn when the cluster state recovers.

howardhuanghua · 2021-11-15T01:36:24Z

That's it. Finally we scale up heap memory to 16GB and cluster got recovered.

howardhuanghua · 2021-11-15T01:54:35Z

The main problem is that we fetch each shard from all the data nodes, actually only specific single data node contains the right shard copy. Meanwhile, we don't need fetch all the shards at the same in the begining , as we have concurrent allocation throttle. If we restart single data node, we could capture the unassigned shard target node info in disassociating left node event. Then during shard recovery, we could fetch shard from single target node, or the nodes contain this shard's primary + replicas. This is what we have already done in our local test env. But if in full cluster restart case, we haven't stored any shard routing info during cluster state persistent, we don't know which specific node the unassigned shard that was allocated previously. We could discuss more details, it’s our pleasure to optimize this part if possible.

DaveCTurner · 2021-11-15T08:11:37Z

I suspect the problem is primary allocation (i.e. internal:gateway/local/started_shards) rather than replica allocation (i.e. internal:cluster/nodes/indices/shard/store). Replica allocation messages are naturally throttled because they happen as the primaries come online.

I have a couple of ideas for addressing this:

We could batch the requests up so that there's only a limited number of requests in flight at once. Maybe just one-at-a-time. Rather than sending each request right away if there's already one in flight then we could just add them to the next batch to be sent.
We could keep track of which shards are definitely not on each node and skip sending requests to nodes that don't have anything to offer.

I prefer the first idea I think. The second is effectively a cache which we could populate at cluster startup fairly easily but it would be tricky to keep its contents correct as the cluster is running (the refresh or invalidation logic is pretty complicated). Batching has the disadvantage that a shard on a slow/broken disk will hold up replies about other shards but I don't think that's a big deal in practice since a slow/broken disk on a node causes other issues. I think batching would scale better in even larger clusters too: if we stick to the one-message-per-shard model then a fairly modest shard count (~100k or so) will always need GBs of memory just for the allocation messages.

DaveCTurner · 2021-11-15T08:12:25Z

I won't have time to work on this myself in the near future, but if you would like to work on this then please do.

howardhuanghua · 2021-11-17T02:43:14Z

@DaveCTurner Thanks for the suggestion. Just to confirm details about batching mode, one of the idea is that group all the unassigned primary shards together, send to each node, that means each node only has single fetch request. Otherwise we still need to prepare single request per shard for each node?

DaveCTurner · 2021-11-17T07:50:28Z

Yes that's about right. It won't be all the unassigned shards, ideally we'd keep the same overall behaviour as today, just using fewer transport messages.

Having looked at this a bit more I can see value in adding batching to the replica allocator too. There are cases (e.g. a big network partition) where we'd send a large number of internal:cluster/nodes/indices/shard/store messages today.

howardhuanghua · 2021-11-27T15:15:19Z

Hi @DaveCTurner , I quick implemented a draft #81081 to batch the async shard fetching request for primary first.
Please help to check it's the right direction, the main logic is in InternalPrimaryShardAllocator, there are still some works need to be done.

howardhuanghua added >bug needs:triage Requires assignment of a team area label labels Nov 14, 2021

howardhuanghua changed the title ~~Massive async shard fetch requests cost lots of heap memories on master node.~~ Massive async shard fetch requests consume lots of heap memories on master node. Nov 14, 2021

howardhuanghua mentioned this issue Nov 14, 2021

Reuse local node in async shard fetch responses #77991

Merged

DaveCTurner mentioned this issue Nov 14, 2021

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

DaveCTurner added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Nov 14, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 14, 2021

DaveCTurner removed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. needs:triage Requires assignment of a team area label labels Nov 14, 2021

howardhuanghua linked a pull request Nov 27, 2021 that will close this issue

Batch async fetch shards data to reduce memory consumption. #81081

Open

DaveCTurner mentioned this issue Feb 16, 2022

Remove redundant response of empty result in AsyncShardFetch to avoid OOM issue #84010

Closed

maosuhan mentioned this issue Feb 18, 2022

recover primary shard first and start replica recovery after most of the primary started opensearch-project/OpenSearch#2169

Closed

DaveCTurner added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 2, 2022

DaveCTurner mentioned this issue Jan 25, 2024

Improve reliability of cleanup of unnecessary shard data #104735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive async shard fetch requests consume lots of heap memories on master node. #80694

Massive async shard fetch requests consume lots of heap memories on master node. #80694

howardhuanghua commented Nov 14, 2021 •

edited

Loading

DaveCTurner commented Nov 14, 2021

elasticmachine commented Nov 14, 2021

DaveCTurner commented Nov 14, 2021

howardhuanghua commented Nov 15, 2021

howardhuanghua commented Nov 15, 2021

DaveCTurner commented Nov 15, 2021 •

edited

Loading

DaveCTurner commented Nov 15, 2021

howardhuanghua commented Nov 17, 2021

DaveCTurner commented Nov 17, 2021

howardhuanghua commented Nov 27, 2021 •

edited

Loading

Massive async shard fetch requests consume lots of heap memories on master node. #80694

Massive async shard fetch requests consume lots of heap memories on master node. #80694

Comments

howardhuanghua commented Nov 14, 2021 • edited Loading

DaveCTurner commented Nov 14, 2021

elasticmachine commented Nov 14, 2021

DaveCTurner commented Nov 14, 2021

howardhuanghua commented Nov 15, 2021

howardhuanghua commented Nov 15, 2021

DaveCTurner commented Nov 15, 2021 • edited Loading

DaveCTurner commented Nov 15, 2021

howardhuanghua commented Nov 17, 2021

DaveCTurner commented Nov 17, 2021

howardhuanghua commented Nov 27, 2021 • edited Loading

howardhuanghua commented Nov 14, 2021 •

edited

Loading

DaveCTurner commented Nov 15, 2021 •

edited

Loading

howardhuanghua commented Nov 27, 2021 •

edited

Loading