Batch async fetch shards data to reduce memory consumption. #81081

howardhuanghua · 2021-11-27T15:09:36Z

This commit is going to fix #80694.

Queue shard async fetch requests and listeners.
Flush all the queued reqeusts after we collected all the node level async fetch reqeusts.
Split fetch responses to single node-to-shard and call the cached listeners after receiving node level fetch reqeusts.

Async shard fetch requests before/after optimization:

DaveCTurner

This seems like the right sort of idea. I left some comments inline. I think we should also do this for replica allocations too.

DaveCTurner · 2021-11-29T13:34:26Z

server/src/main/java/org/elasticsearch/gateway/AsyncShardFetch.java

@@ -56,6 +56,7 @@
     */
    public interface Lister<NodesResponse extends BaseNodesResponse<NodeResponse>, NodeResponse extends BaseNodeResponse> {
        void list(ShardId shardId, @Nullable String customDataPath, DiscoveryNode[] nodes, ActionListener<NodesResponse> listener);
+        void flush();


Rather than introducing this method to the lister (and the corresponding flag passed in to fetchData) could we have the allocator directly indicate the end of an allocation round which triggers the flush.

DaveCTurner · 2021-11-29T13:35:37Z

server/src/main/java/org/elasticsearch/gateway/GatewayAllocator.java

+                        for (ShardId shardId : requestMap.keySet()) {
+                            ShardRequestInfo shardRequest = requestMap.get(shardId);
+                            shards.put(shardRequest.shardId(), shardRequest.getCustomDataPath());
+                            if (node.getVersion().before(Version.V_7_16_0)) {


The version in master is now 8.1.0; it's unlikely we'll backport this to an earlier version.

DaveCTurner · 2021-11-29T13:38:25Z

server/src/main/java/org/elasticsearch/gateway/GatewayAllocator.java

+                            };
+
+                            client.executeLocally(
+                                TransportNodesListGatewayStartedShards.TYPE,


I'm undecided about re-using the same action type for both kinds of request here. I think it'd be cleaner to introduce a new one (and to name it something better than internal:gateway/local/started_shards) given how big a difference in behaviour we are making.

Hi @DaveCTurner , if we introduce a new action, then we need to refactor some logics in GatewayAllocator, like the follow structures, it seems that would a big change for the high level allocators. How do you think so?

elasticsearch/server/src/main/java/org/elasticsearch/gateway/GatewayAllocator.java

Lines 57 to 60 in 2629c32

private final ConcurrentMap<ShardId, AsyncShardFetch<NodeGatewayStartedShards>> asyncFetchStarted = ConcurrentCollections

.newConcurrentMap();

private final ConcurrentMap<ShardId, AsyncShardFetch<NodeStoreFilesMetadata>> asyncFetchStore = ConcurrentCollections

.newConcurrentMap();

I'm not sure this is true, I think we could keep pretty much the same interface from the point of view of GatewayAllocator. It should be possible to implement a batching Lister which reworks the batched responses into a BaseNodesResponse<NodeGatewayStartedShards>.

DaveCTurner · 2021-11-29T13:40:31Z

server/src/main/java/org/elasticsearch/gateway/TransportNodesListGatewayStartedShards.java

-    protected NodeGatewayStartedShards nodeOperation(NodeRequest request, Task task) {
+    protected NodeGroupedGatewayStartedShards nodeOperation(NodeRequest request, Task task) {
+        NodeGroupedGatewayStartedShards groupedStartedShards = new NodeGroupedGatewayStartedShards(clusterService.localNode());
+        for (Map.Entry<ShardId, String> entry : request.getShards().entrySet()) {


When sending these requests per-shard we execute them in parallel across the FETCH_SHARD_STARTED threadpool. I think we should continue to parallelise them at the shard level like that.

howardhuanghua · 2021-11-30T01:56:43Z

Thanks for the suggestion. I am going to complete the optimization.

elasticmachine · 2021-11-30T13:24:24Z

Pinging @elastic/es-search (Team:Search)

elasticmachine · 2021-11-30T13:32:49Z

Pinging @elastic/es-distributed (Team:Distributed)

howardhuanghua · 2023-03-07T09:42:28Z

it'd help if we could break this down into several smaller (and easier-to-review) steps somehow.

Thanks. I would try to break down into smaller steps. Any suggestion would be appreciated.

DaveCTurner · 2023-03-12T21:20:55Z

Thanks. I would try to break down into smaller steps. Any suggestion would be appreciated.

IMO the trickiest part is that today there's no well-defined "end" to the shard-by-shard fetching process, and there needs to be an end so that we know when to flush the last batch of requests. This PR adds this event (and a lot of other things besides). I think making that change in isolation would be worth trying as a first step. This change won't mean very much on its own because we won't actually be doing any flushing at the end, but still once we have this flushing event I believe it'll be easier to review the changes that adds batching on top of it.

Independently, I think you could detach TransportIndicesShardStoresAction from the InternalAsyncFetch mechanism. This action doesn't do any caching (and is hardly used anyway) so there's no need to involve it in these changes.

howardhuanghua · 2023-03-13T13:45:14Z

Hi @DaveCTurner , in the follow logic of above PR, we first iterate all the unassigned primaries, each allocator.allocateUnassigned call would add single pending fetching shard, and after the while loop, we would call fush again to handle the rest of pending shards, this is the "end" that you mentioned above?

private void allocateExistingUnassignedShards(RoutingAllocation allocation) {
        allocation.routingNodes().unassigned().sort(PriorityComparator.getAllocationComparator(allocation)); // sort for priority ordering

        for (final ExistingShardsAllocator existingShardsAllocator : existingShardsAllocators.values()) {
            existingShardsAllocator.beforeAllocation(allocation);
        }

        GatewayAllocator gatewayAllocator = null;
        if (logger.isDebugEnabled()) {
            logger.debug("set batch fetch mode [{}] for routing allocation.", batchFetchShardEnable);
        }
        allocation.setBatchShardFetchMode(batchFetchShardEnable);

        final RoutingNodes.UnassignedShards.UnassignedIterator primaryIterator = allocation.routingNodes().unassigned().iterator();
        while (primaryIterator.hasNext()) {
            final ShardRouting shardRouting = primaryIterator.next();
            if (shardRouting.primary()) {
                ExistingShardsAllocator allocator = getAllocatorForShard(shardRouting, allocation);
                allocator.allocateUnassigned(shardRouting, allocation, primaryIterator);
                if (gatewayAllocator == null && allocator instanceof GatewayAllocator) {
                    gatewayAllocator = (GatewayAllocator) allocator;
                }

                if (gatewayAllocator != null
                    && gatewayAllocator.getPrimaryPendingFetchShardCount() > 0
                    && gatewayAllocator.getPrimaryPendingFetchShardCount() % batchFetchShardStepSize == 0) {
                    gatewayAllocator.flushPendingPrimaryFetchRequests(batchFetchShardStepSize);
                }
            }
        }

        // flush the rest primaries
        if (gatewayAllocator != null) {
            gatewayAllocator.flushPendingPrimaryFetchRequests(batchFetchShardStepSize);
        }

Original code:
https://github.com/TencentCloudES/elasticsearch/blob/e415426c511088ca5b0d4e86b205a6ab2b025648/server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java#L592-L595

- No need to use an `AsyncShardFetch` here, there is no caching - Response may be very large, introduce chunking - Fan-out may be very large, introduce throttling - Processing time may be nontrivial, introduce cancellability - Eliminate many unnecessary intermediate data structures - Do shard-level response processing more eagerly - Determine allocation from `RoutingTable` not `RoutingNodes` - Add tests Relates elastic#81081

- No need to use an `AsyncShardFetch` here, there is no caching - Response may be very large, introduce chunking - Fan-out may be very large, introduce throttling - Processing time may be nontrivial, introduce cancellability - Eliminate many unnecessary intermediate data structures - Do shard-level response processing more eagerly - Determine allocation from `RoutingTable` not `RoutingNodes` - Add tests Relates #81081

elasticsearchmachine · 2025-01-30T16:57:20Z

Pinging @elastic/es-distributed-obsolete (Team:Distributed (Obsolete))

elasticsearchmachine · 2025-01-30T16:57:21Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Initial draft

689467f

elasticsearchmachine added v8.1.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Nov 27, 2021

howardhuanghua marked this pull request as ready for review November 28, 2021 01:52

howardhuanghua closed this Nov 28, 2021

howardhuanghua reopened this Nov 28, 2021

howardhuanghua changed the title ~~Group primary shard async fetch requests by node to reduce memory consumption.~~ [Draft]Group primary shard async fetch requests by node to reduce memory consumption. Nov 28, 2021

howardhuanghua mentioned this pull request Nov 28, 2021

Massive async shard fetch requests consume lots of heap memories on master node. #80694

Open

update draft

8cf0166

DaveCTurner reviewed Nov 29, 2021

View reviewed changes

pgomulka added the :Search/Search Search-related issues that do not fall into other categories label Nov 30, 2021

elasticmachine added the Team:Search Meta label for search team label Nov 30, 2021

ywelsch added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Search/Search Search-related issues that do not fall into other categories labels Nov 30, 2021

elasticmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. and removed Team:Search Meta label for search team labels Nov 30, 2021

howardhuanghua added 8 commits December 6, 2021 20:58

Merge remote-tracking branch 'origin' into async_fetch_request

eee910e

Introduce new batch action for shard fetch

6f8af21

fix compile issue

c63da96

revert shard store action

be1357b

remove unused func

00ec7fe

fix spotless issue

c64755d

add new action registion instance

a5552c2

Merge remote-tracking branch 'origin' into async_fetch_request

92c19ef

howardhuanghua changed the title ~~[Draft]Group primary shard async fetch requests by node to reduce memory consumption.~~ Group primary shard async fetch requests by node to reduce memory consumption. Dec 7, 2021

howardhuanghua changed the title ~~Group primary shard async fetch requests by node to reduce memory consumption.~~ Batch async fetch shards data to reduce memory consumption. Dec 7, 2021

howardhuanghua requested a review from DaveCTurner December 7, 2021 06:24

DaveCTurner mentioned this pull request Mar 14, 2023

Simplify IndicesShardStoresAction #94507

Merged

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

pugnascotia added v8.10.0 and removed v8.9.0 labels Jun 22, 2023

quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

elasticsearchmachine added v9.1.0 Team:Distributed Coordination Meta label for Distributed Coordination team and removed v9.0.0 labels Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch async fetch shards data to reduce memory consumption. #81081

Batch async fetch shards data to reduce memory consumption. #81081

howardhuanghua commented Nov 27, 2021 •

edited

Loading

DaveCTurner left a comment

DaveCTurner Nov 29, 2021

DaveCTurner Nov 29, 2021

DaveCTurner Nov 29, 2021

howardhuanghua Nov 30, 2021

DaveCTurner Nov 30, 2021

DaveCTurner Nov 29, 2021

howardhuanghua commented Nov 30, 2021

elasticmachine commented Nov 30, 2021

elasticmachine commented Nov 30, 2021

howardhuanghua commented Mar 7, 2023

DaveCTurner commented Mar 12, 2023

howardhuanghua commented Mar 13, 2023

elasticsearchmachine commented Jan 30, 2025

elasticsearchmachine commented Jan 30, 2025

	private final ConcurrentMap<ShardId, AsyncShardFetch<NodeGatewayStartedShards>> asyncFetchStarted = ConcurrentCollections
	.newConcurrentMap();
	private final ConcurrentMap<ShardId, AsyncShardFetch<NodeStoreFilesMetadata>> asyncFetchStore = ConcurrentCollections
	.newConcurrentMap();

Batch async fetch shards data to reduce memory consumption. #81081

Are you sure you want to change the base?

Batch async fetch shards data to reduce memory consumption. #81081

Conversation

howardhuanghua commented Nov 27, 2021 • edited Loading

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Nov 29, 2021

Choose a reason for hiding this comment

DaveCTurner Nov 29, 2021

Choose a reason for hiding this comment

DaveCTurner Nov 29, 2021

Choose a reason for hiding this comment

howardhuanghua Nov 30, 2021

Choose a reason for hiding this comment

DaveCTurner Nov 30, 2021

Choose a reason for hiding this comment

DaveCTurner Nov 29, 2021

Choose a reason for hiding this comment

howardhuanghua commented Nov 30, 2021

elasticmachine commented Nov 30, 2021

elasticmachine commented Nov 30, 2021

howardhuanghua commented Mar 7, 2023

DaveCTurner commented Mar 12, 2023

howardhuanghua commented Mar 13, 2023

elasticsearchmachine commented Jan 30, 2025

elasticsearchmachine commented Jan 30, 2025

howardhuanghua commented Nov 27, 2021 •

edited

Loading