-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Massive async shard fetch requests consume lots of heap memories on master node. #80694
Comments
Relates #77466 |
Pinging @elastic/es-distributed (Team:Distributed) |
I can see that this might be a problem: 15k shards × 75 nodes × 16kiB buffer for each message ≈ 17GiB of memory churn when the cluster state recovers. |
That's it. Finally we scale up heap memory to 16GB and cluster got recovered. |
The main problem is that we fetch each shard from all the data nodes, actually only specific single data node contains the right shard copy. Meanwhile, we don't need fetch all the shards at the same in the begining , as we have concurrent allocation throttle. If we restart single data node, we could capture the unassigned shard target node info in disassociating left node event. Then during shard recovery, we could fetch shard from single target node, or the nodes contain this shard's primary + replicas. This is what we have already done in our local test env. But if in full cluster restart case, we haven't stored any shard routing info during cluster state persistent, we don't know which specific node the unassigned shard that was allocated previously. We could discuss more details, it’s our pleasure to optimize this part if possible. |
I suspect the problem is primary allocation (i.e. I have a couple of ideas for addressing this:
I prefer the first idea I think. The second is effectively a cache which we could populate at cluster startup fairly easily but it would be tricky to keep its contents correct as the cluster is running (the refresh or invalidation logic is pretty complicated). Batching has the disadvantage that a shard on a slow/broken disk will hold up replies about other shards but I don't think that's a big deal in practice since a slow/broken disk on a node causes other issues. I think batching would scale better in even larger clusters too: if we stick to the one-message-per-shard model then a fairly modest shard count (~100k or so) will always need GBs of memory just for the allocation messages. |
I won't have time to work on this myself in the near future, but if you would like to work on this then please do. |
@DaveCTurner Thanks for the suggestion. Just to confirm details about batching mode, one of the idea is that group all the unassigned primary shards together, send to each node, that means each node only has single fetch request. Otherwise we still need to prepare single request per shard for each node? |
Yes that's about right. It won't be all the unassigned shards, ideally we'd keep the same overall behaviour as today, just using fewer transport messages. Having looked at this a bit more I can see value in adding batching to the replica allocator too. There are cases (e.g. a big network partition) where we'd send a large number of |
Hi @DaveCTurner , I quick implemented a draft #81081 to batch the async shard fetching request for primary first. |
Elasticsearch version : 7.10
JVM version (
java -version
): JDK 11Description of the problem including expected versus actual behavior:
In #77991 we solved asnyc shard fetch resposes memory consumption issue.
But we found async shard fetch reqeusts also consume lots of heap memories. Here is our production env for this exception case:
Data nodes number: 75
Dedicate master nodes number: 3
Master node resource: 2 Core cpus, 8GB physical memory, 4GB heap memory.
Total shards number: 15000
When the new master has been elected after full cluster restart, the elected master heap memory would be used up for several seconds. We dump the memory and found netty inflight sending request used lots of heap:
Each
WriteOperation
should be single shard request to specific node (16k buffer size per each):From
Netty4MessageChannelHandler
class we could see aqueuedWrites
, messages are flushed asynchronously:elasticsearch/modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4MessageChannelHandler.java
Line 42 in 025dbdc
So besides cutting fetch shard response, we also need to handle massive shard sending requests.
The text was updated successfully, but these errors were encountered: