Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recover primary shard first and start replica recovery after most of the primary started #2169

Closed
maosuhan opened this issue Feb 18, 2022 · 4 comments
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@maosuhan
Copy link

In our company, we have 1M shards and 400 nodes within a cluster.
The business is very important and the data availability is very important. In full cluster recovery process, we want to recover primary shard as soon as possible and the replica recovery can be delayed after most of the primary is started.

Why we don't want to set cluster.routing.allocation.enable to primaries

  1. It is rely on some decider and decider itself will have computation overhead.
  2. It rely on SRE to manually change cluster.routing.allocation.enable to all after all the primary shards are recovered.
  3. Fetching data of replica shard cost almost half an hour in our environment because the number of shards * nodes is too large. During that period of time, no shard can be recovered at all.

Our solution is to totally ignore replica recovery process including fetchdata and recovery at all until most of the primary are started.
We add a new parameter: cluster.routing.allocation.replica_recovery_wait_primary.enable and cluster.routing.allocation.replica_recovery_wait.threshold and the range is 0 to 1.0.
Replica recovery can only be started if percentage of the active primary shards exceed the threshold.

If the proposal is good, I can submit the MR.

@maosuhan maosuhan added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 18, 2022
@Bukhtawar
Copy link
Collaborator

With the cluster configuration(1M shards) you mention, the major bottleneck IMO is the async shard fetch that does #node x #shards requests, which you note in point 3 is massively slow. This also has the implication on bloating the JVM heap for the leader. Do you think optimising async shard fetch to batch requests as a first step would help your case of slower shard recovery?
The proposal you mention is working around the problem, I would rather try to see if we can optimise the cluster recovery process to scale with larger clusters

@Bukhtawar
Copy link
Collaborator

Relates to #2170

@maosuhan
Copy link
Author

@Bukhtawar Thanks for your reply.
We already tried batch async shard fetch in our environment(elastic/elasticsearch#80694) and the recovery speed improve a lot indeed. But the primary recovery speed is still not as fast as we expected. As I listed, deciders of ReplicaShardAllocator also take a lot of time in unassignedShards loop which slow down the MasterService throughput.

I think batch fetch data and primary recovery first does not conflict and they can work together.

@Bukhtawar
Copy link
Collaborator

Thanks, Can you please share the stack traces(thread dump or profiler results) for the deciders which based on your analysis is causing the slowdown or additional memory overhead. Wanted to understand if there is a scope to optimize those areas as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

3 participants