You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our company, we have 1M shards and 400 nodes within a cluster.
The business is very important and the data availability is very important. In full cluster recovery process, we want to recover primary shard as soon as possible and the replica recovery can be delayed after most of the primary is started.
Why we don't want to set cluster.routing.allocation.enable to primaries
It is rely on some decider and decider itself will have computation overhead.
It rely on SRE to manually change cluster.routing.allocation.enable to all after all the primary shards are recovered.
Fetching data of replica shard cost almost half an hour in our environment because the number of shards * nodes is too large. During that period of time, no shard can be recovered at all.
Our solution is to totally ignore replica recovery process including fetchdata and recovery at all until most of the primary are started.
We add a new parameter: cluster.routing.allocation.replica_recovery_wait_primary.enable and cluster.routing.allocation.replica_recovery_wait.threshold and the range is 0 to 1.0.
Replica recovery can only be started if percentage of the active primary shards exceed the threshold.
If the proposal is good, I can submit the MR.
The text was updated successfully, but these errors were encountered:
With the cluster configuration(1M shards) you mention, the major bottleneck IMO is the async shard fetch that does #node x #shards requests, which you note in point 3 is massively slow. This also has the implication on bloating the JVM heap for the leader. Do you think optimising async shard fetch to batch requests as a first step would help your case of slower shard recovery?
The proposal you mention is working around the problem, I would rather try to see if we can optimise the cluster recovery process to scale with larger clusters
@Bukhtawar Thanks for your reply.
We already tried batch async shard fetch in our environment(elastic/elasticsearch#80694) and the recovery speed improve a lot indeed. But the primary recovery speed is still not as fast as we expected. As I listed, deciders of ReplicaShardAllocator also take a lot of time in unassignedShards loop which slow down the MasterService throughput.
I think batch fetch data and primary recovery first does not conflict and they can work together.
Thanks, Can you please share the stack traces(thread dump or profiler results) for the deciders which based on your analysis is causing the slowdown or additional memory overhead. Wanted to understand if there is a scope to optimize those areas as well.
In our company, we have 1M shards and 400 nodes within a cluster.
The business is very important and the data availability is very important. In full cluster recovery process, we want to recover primary shard as soon as possible and the replica recovery can be delayed after most of the primary is started.
Why we don't want to set
cluster.routing.allocation.enable
toprimaries
cluster.routing.allocation.enable
toall
after all the primary shards are recovered.Our solution is to totally ignore replica recovery process including fetchdata and recovery at all until most of the primary are started.
We add a new parameter:
cluster.routing.allocation.replica_recovery_wait_primary.enable
andcluster.routing.allocation.replica_recovery_wait.threshold
and the range is 0 to 1.0.Replica recovery can only be started if percentage of the active primary shards exceed the threshold.
If the proposal is good, I can submit the MR.
The text was updated successfully, but these errors were encountered: