-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shards relocating during rolling restarts #14387
Comments
Could you add the exact commands etc that you used to test. I'm on a poor network and can't view the video. thanks |
The relocating shards seems to be recoveries, not rebalances. I infer this because when i set the following, I see them all happening at once.
This is what i'm seeing after restarting a node - shards moving on and off.
TRACE logs show a lot of this:
|
As for the exact command for testing, all that i am doing is starting up 3 nodes, and restarting one with
Then watching things move around with
|
I assigned it to @ywelsch we will look into this and come back to you shortly. In the meanwhile can you show all the commands you are executing especially the one that: |
This is the exact command I used:
And I would see something like this on all nodes:
Shard relocations / recoveries begin after relocation is reenabled like this:
|
A short update - @clintongormley and I researched this. It has to do with a race condition between the gateway allocator and the cluster balancer. When the node comes back/allocation is enabled the gateway allocator goes and asks the node for information about it's shard store. This is done async. While that request is in flight, the balanced allocator thinks the node is empty and assigns shards to it. Only later when the gateway allocator assigns the missing shard back to node does the cluster rebalances again. Our idea for a fix was to disable balancing while there are in flight data fetching requests... |
@bleskes makes sense to me - I will take a look at implementing this. |
…ilable This commit prevents running rebalance operations if the store allocator is still fetching async shard / store data to prevent pre-mature rebalance decisions which need to be reverted once shard store data is available. This is typically happening on rolling restarts which can make those restarts extremely painful. Closes elastic#14387
Tested these workarounds with good results: 1.x
2.0
|
Was this present in ES versions before 1.6? |
no I don't think so since back then we fetched data synchronously so this couldn't happen. |
…ilable This commit prevents running rebalance operations if the store allocator is still fetching async shard / store data to prevent pre-mature rebalance decisions which need to be reverted once shard store data is available. This is typically happening on rolling restarts which can make those restarts extremely painful. Closes elastic#14387
@s1monw Is this issue fixed in Elasticsearch 2.x? |
@bittusarkar yes see #14652 |
This behaviour is reproducible in v1.6.0 through 2.0.0.
Expected during rolling restarts that no shard relocations will occur, however there is shard movement occurring while the cluster is in a yellow health state.
Steps to reproduce:
At step 6, shards are observed to be relocating, in addition to any recovery by sync_id that has occurred. After recoveries and relocations, the cluster will change to green state. This was tested in slow motion by limiting bandwidth to one of the nodes in the cluster.
Relocations are not observed in a 2 node cluster, or when restarting the entire cluster.
The text was updated successfully, but these errors were encountered: