-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many async fetch shard results caused JVM heap explosion during cluster recovering. #76218
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
First of all yes this does seem like a problem, and I appreciate the detailed report. I'm not sure everyone would call describe 50k shards in a single cluster as "reasonable", at least I'm not totally surprised there are some things we don't test so thoroughly at that scale, but we do track gaps like the one you've found so I'll add it to the list. The ideas you suggest are all insightful, although none is an obvious winner.
|
@DaveCTurner Thanks for the reply. It seems the second idea, reduce the response result in cache would be simple and straightforward as tentative solution. We could transfer the
|
Yes I think that's a nice self-contained improvement to do here. On reflection I think it doesn't completely fix the problem, there's still a risk that we collect almost all the responses in memory before inserting them in cache, but this is unlikely. Are you offering a PR? That'd be great if so. |
Thanks. After the implementaion I would submit a PR. |
Hi @DaveCTurner , I have implemented purge logic after master node receiving shard fetch result. Please help to check it, thanks. |
Cluster version: 7.10.1.
Dedicated master nodes: 3.
Dedicated data nodes: 163.
Total primary shards: 80k, no replicas.
After full cluster restart, all the primaries would be fetched from all of the data nodes concurrently:
elasticsearch/server/src/main/java/org/elasticsearch/gateway/GatewayAllocator.java
Lines 228 to 243 in 4f22f43
The fetch result mainly contains
DiscoveryNode
andTransportAddress
, around 1.7KB heap memory usage.So single shard fetch result would cost: 1.7KB * 163 data nodes = 280KB. And 80k shard would cost: 80000 * 280KB = 21GB.
This big heap cost would explode current master node's jvm heap:
Even we have a reasonable 50k shards in a cluster, it would almost need 15GB heap, that's a huge memory cost.
Several ideas try to solve this issue:
Single shard should only belong to single node. After cluster restart, we don't need to send fetch request to all of the data nodes. But in a fresh started cluster, no routing table info could extract the previous allocated node for a shard. Could we try to save node id info just like
inSyncAllocationIds
inIndexMetadata
? Then we could send single fetch request to the target node that shard used to be allocated.We could see that
BaseNodeResponse
contains aDiscoveryNode
and basic abstract classTransportMessage
, theTransportMessage
has a duplicated entry inDiscoveryNode
. In the fetch case, onlynodeId
is required. The node attributes are not necessary at all. Could we only returnnodeId
in shard fetch response instead of the heavy structures?Shard recovery has node level concurrency limitations, could we fetch partial of the shards' store info instead of fetching all the shards together at the beginning?
Send request per node, batch all of the shards info result together.
The text was updated successfully, but these errors were encountered: