Too many async fetch shard results caused JVM heap explosion during cluster recovering. #76218

howardhuanghua · 2021-08-08T11:53:30Z

Cluster version: 7.10.1.
Dedicated master nodes: 3.
Dedicated data nodes: 163.
Total primary shards: 80k, no replicas.

After full cluster restart, all the primaries would be fetched from all of the data nodes concurrently:

elasticsearch/server/src/main/java/org/elasticsearch/gateway/GatewayAllocator.java

Lines 228 to 243 in 4f22f43

    
           protected AsyncShardFetch.FetchResult<NodeGatewayStartedShards> fetchData(ShardRouting shard, RoutingAllocation allocation) { 
        
               // explicitely type lister, some IDEs (Eclipse) are not able to correctly infer the function type 
        
               Lister<BaseNodesResponse<NodeGatewayStartedShards>, NodeGatewayStartedShards> lister = this::listStartedShards; 
        
               AsyncShardFetch<NodeGatewayStartedShards> fetch = 
        
                   asyncFetchStarted.computeIfAbsent(shard.shardId(), 
        
                               shardId -> new InternalAsyncFetch<>(logger, "shard_started", shardId, 
        
                                   IndexMetadata.INDEX_DATA_PATH_SETTING.get(allocation.metadata().index(shard.index()).getSettings()), 
        
                                   lister)); 
        
               AsyncShardFetch.FetchResult<NodeGatewayStartedShards> shardState = 
        
                       fetch.fetchData(allocation.nodes(), allocation.getIgnoreNodes(shard.shardId())); 
        
               if (shardState.hasData()) { 
        
                   shardState.processAllocation(allocation); 
        
               } 
        
               return shardState; 
        
           }

The fetch result mainly contains DiscoveryNode and TransportAddress, around 1.7KB heap memory usage.
So single shard fetch result would cost: 1.7KB * 163 data nodes = 280KB. And 80k shard would cost: 80000 * 280KB = 21GB.
This big heap cost would explode current master node's jvm heap:

Even we have a reasonable 50k shards in a cluster, it would almost need 15GB heap, that's a huge memory cost.

Several ideas try to solve this issue:

Single shard should only belong to single node. After cluster restart, we don't need to send fetch request to all of the data nodes. But in a fresh started cluster, no routing table info could extract the previous allocated node for a shard. Could we try to save node id info just like inSyncAllocationIds in IndexMetadata ? Then we could send single fetch request to the target node that shard used to be allocated.
We could see that BaseNodeResponse contains a DiscoveryNode and basic abstract class TransportMessage, the TransportMessage has a duplicated entry in DiscoveryNode. In the fetch case, only nodeId is required. The node attributes are not necessary at all. Could we only return nodeId in shard fetch response instead of the heavy structures?
Shard recovery has node level concurrency limitations, could we fetch partial of the shards' store info instead of fetching all the shards together at the beginning?
Send request per node, batch all of the shards info result together.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-08T18:23:16Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2021-08-09T11:23:33Z

First of all yes this does seem like a problem, and I appreciate the detailed report. I'm not sure everyone would call describe 50k shards in a single cluster as "reasonable", at least I'm not totally surprised there are some things we don't test so thoroughly at that scale, but we do track gaps like the one you've found so I'll add it to the list.

The ideas you suggest are all insightful, although none is an obvious winner.

I think tracking the node ID alongside the allocation ID would find some challenging corner cases, for instance to do with dangling indices, although at a high level I see why this might make sense.
We pass DiscoveryNode objects around over the wire in lots of places as if they're lightweight values, but really these days they're quite heavy things. Maybe we should try and find a way to deduplicate the DiscoveryNode received over the wire because we almost certainly have an instance of the right thing in memory already. In this case the problem is more that we keep each multi-kB NodeGatewayStartedShards in the cache, I think we could keep hold of something much lighter.
It's not really the number of concurrent fetches that's a problem, it's that we have to keep the details of all previous fetches around in memory until the shard is allocated. If we can't allocate it straight away then we must try with other shards, or else we'd risk blocking some allocations behind other ones that will never work. We could try dropping data from the cache to bound its size perhaps, although we'd have to add some extra machinery to prevent looping forever.
I think batching would be similarly challenging. Again it's not the concurrent fetches that presents the problem, it's the size of the cache, and batching wouldn't help with that.

howardhuanghua · 2021-08-09T12:04:56Z

@DaveCTurner Thanks for the reply. It seems the second idea, reduce the response result in cache would be simple and straightforward as tentative solution. We could transfer the DiscoveryNode to node ID only for the fetching case when master received the heavy result. ^^

We pass DiscoveryNode objects around over the wire in lots of places as if they're lightweight values, but really these days they're quite heavy things. Maybe we should try and find a way to deduplicate the DiscoveryNode received over the wire because we almost certainly have an instance of the right thing in memory already. In this case the problem is more that we keep each multi-kB NodeGatewayStartedShards in the cache, I think we could keep hold of something much lighter.

DaveCTurner · 2021-08-09T13:18:17Z

Yes I think that's a nice self-contained improvement to do here. On reflection I think it doesn't completely fix the problem, there's still a risk that we collect almost all the responses in memory before inserting them in cache, but this is unlikely. Are you offering a PR? That'd be great if so.

howardhuanghua · 2021-08-09T13:42:21Z

Thanks. After the implementaion I would submit a PR.

howardhuanghua · 2021-09-03T16:17:28Z

Hi @DaveCTurner , I have implemented purge logic after master node receiving shard fetch result. Please help to check it, thanks.
#77266

howardhuanghua added >bug needs:triage Requires assignment of a team area label labels Aug 8, 2021

DaveCTurner added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed needs:triage Requires assignment of a team area label labels Aug 8, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 8, 2021

howardhuanghua mentioned this issue Sep 3, 2021

Purge unused node info and transport addr info after fetching shard stats. #77266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many async fetch shard results caused JVM heap explosion during cluster recovering. #76218

Too many async fetch shard results caused JVM heap explosion during cluster recovering. #76218

howardhuanghua commented Aug 8, 2021 •

edited

Loading

elasticmachine commented Aug 8, 2021

DaveCTurner commented Aug 9, 2021

howardhuanghua commented Aug 9, 2021

DaveCTurner commented Aug 9, 2021

howardhuanghua commented Aug 9, 2021

howardhuanghua commented Sep 3, 2021

Too many async fetch shard results caused JVM heap explosion during cluster recovering. #76218

Too many async fetch shard results caused JVM heap explosion during cluster recovering. #76218

Comments

howardhuanghua commented Aug 8, 2021 • edited Loading

elasticmachine commented Aug 8, 2021

DaveCTurner commented Aug 9, 2021

howardhuanghua commented Aug 9, 2021

DaveCTurner commented Aug 9, 2021

howardhuanghua commented Aug 9, 2021

howardhuanghua commented Sep 3, 2021

howardhuanghua commented Aug 8, 2021 •

edited

Loading