recover primary shard first and start replica recovery after most of the primary started #2169

maosuhan · 2022-02-18T08:16:57Z

In our company, we have 1M shards and 400 nodes within a cluster.
The business is very important and the data availability is very important. In full cluster recovery process, we want to recover primary shard as soon as possible and the replica recovery can be delayed after most of the primary is started.

Why we don't want to set cluster.routing.allocation.enable to primaries

It is rely on some decider and decider itself will have computation overhead.
It rely on SRE to manually change cluster.routing.allocation.enable to all after all the primary shards are recovered.
Fetching data of replica shard cost almost half an hour in our environment because the number of shards * nodes is too large. During that period of time, no shard can be recovered at all.

Our solution is to totally ignore replica recovery process including fetchdata and recovery at all until most of the primary are started.
We add a new parameter: cluster.routing.allocation.replica_recovery_wait_primary.enable and cluster.routing.allocation.replica_recovery_wait.threshold and the range is 0 to 1.0.
Replica recovery can only be started if percentage of the active primary shards exceed the threshold.

If the proposal is good, I can submit the MR.

The text was updated successfully, but these errors were encountered:

Bukhtawar · 2022-02-18T15:57:51Z

With the cluster configuration(1M shards) you mention, the major bottleneck IMO is the async shard fetch that does #node x #shards requests, which you note in point 3 is massively slow. This also has the implication on bloating the JVM heap for the leader. Do you think optimising async shard fetch to batch requests as a first step would help your case of slower shard recovery?
The proposal you mention is working around the problem, I would rather try to see if we can optimise the cluster recovery process to scale with larger clusters

Bukhtawar · 2022-02-18T16:00:34Z

Relates to #2170

maosuhan · 2022-02-18T16:25:46Z

@Bukhtawar Thanks for your reply.
We already tried batch async shard fetch in our environment(elastic/elasticsearch#80694) and the recovery speed improve a lot indeed. But the primary recovery speed is still not as fast as we expected. As I listed, deciders of ReplicaShardAllocator also take a lot of time in unassignedShards loop which slow down the MasterService throughput.

I think batch fetch data and primary recovery first does not conflict and they can work together.

Bukhtawar · 2022-02-19T06:02:37Z

Thanks, Can you please share the stack traces(thread dump or profiler results) for the deciders which based on your analysis is causing the slowdown or additional memory overhead. Wanted to understand if there is a scope to optimize those areas as well.

maosuhan added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 18, 2022

anasalkouz added the distributed framework label Mar 1, 2022

anasalkouz self-assigned this Mar 1, 2022

anasalkouz removed the untriaged label Mar 1, 2022

maosuhan closed this as completed Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recover primary shard first and start replica recovery after most of the primary started #2169

recover primary shard first and start replica recovery after most of the primary started #2169

maosuhan commented Feb 18, 2022

Bukhtawar commented Feb 18, 2022

Bukhtawar commented Feb 18, 2022

maosuhan commented Feb 18, 2022

Bukhtawar commented Feb 19, 2022

recover primary shard first and start replica recovery after most of the primary started #2169

recover primary shard first and start replica recovery after most of the primary started #2169

Comments

maosuhan commented Feb 18, 2022

Bukhtawar commented Feb 18, 2022

Bukhtawar commented Feb 18, 2022

maosuhan commented Feb 18, 2022

Bukhtawar commented Feb 19, 2022