-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve reroute with many shards #48579
Conversation
Hi @xiahongju, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile? |
Thanks for the contribution @xiahongju. Your analysis seems sound, but this work is already under way in #47817. There is no need to track the shards that are neither
This sounds to be outside the bounds of how a cluster should be configured. For instance, note these recommendations for shard sizing suggesting that they should have fewer, larger shards. I think they should also either be freezing the older indices (if they want them to remain searchable) or removing them entirely from the cluster (if they do not need to be searchable). They can be restored from snapshots if needed. |
@xiahongju could you sign the CLA with the email address you used for this PR, |
Pinging @elastic/es-distributed (:Distributed/Allocation) |
Ah it looks like you already did, I couldn't tell that when the PR was closed. No action required here, thanks. |
Thank you @DaveCTurner, I am very happy and willing to be a co-author and contribute to the community. Yes you’re right, Tracking the shards besides INITIALIZING and RELOCATING does bring some memory overhead, However
|
Today a couple of allocation deciders iterate through all the shards on a node to find the `INITIALIZING` or `RELOCATING` ones, and this can slow down cluster state updates in clusters with very high-density nodes holding many thousands of shards even if those shards belong to closed or frozen indices. This commit pre-computes the sets of `INITIALIZING` and `RELOCATING` shards to speed up this search. Closes #46941 Relates #48579 Co-authored-by: "hongju.xhj" <[email protected]>
Today a couple of allocation deciders iterate through all the shards on a node to find the `INITIALIZING` or `RELOCATING` ones, and this can slow down cluster state updates in clusters with very high-density nodes holding many thousands of shards even if those shards belong to closed or frozen indices. This commit pre-computes the sets of `INITIALIZING` and `RELOCATING` shards to speed up this search. Closes #46941 Relates #48579 Co-authored-by: "hongju.xhj" <[email protected]>
Background: Last week, a customer of Alibaba Cloud Elasticsearch complained to us that it took 1 minute to create an index after the cluster was migrated from version 6.3.2 to 7.4.0. The customer's current cluster has 10 data nodes, more than 50,000 shards, using index lifecycle management, and when the index moves to the cold node, it is closed.
Testing: We built a test environment to reproduce the problem
Elasticsearch version: 7.4.0
Dedicated master node: 3 * 16core64GB
Data node: 2 * 16core64GB
First create 5000 indexes, each index has 5 primaries, 0 replicas, a total of 25,000 shards, the test found that each time creating a new index requires 58s. By analyzing the hot threads of the master when creating an index, we found that the problem was introduced by Add support for replicating closed indices (#39499), starting with 7.2.0, The shards of closed indices are still reinitialized and reallocated on data nodes。
Analysis: BalancedShardsAllocator traverses all started shards of the cluster during reroute, and calculates size of all shards that are currently being relocated to the node where the shard is located. This requires finding all shards that are in INITIALIZING and being relocated. The current implementation is to traverse all the shards of the node, which is very time consuming. When there are a lot of shards, almost any requests that need reroute action will encounter this problem in all versions of the Elasticsearch.
Solution: Considering that there are not many shards in the cluster in INITIALIZING and RELOCATING, we can cache the shards of the corresponding state without having to calculate it every time. After the optimization of the above test cluster, the time of creating index and closing index will be reduced from 58s to 1.2s.
Although we may circumvent this problem by setting cluster.routing.allocation.disk.include_relocations to false, But there are some drawbacks.
Java stack traces of masterService thread