Shards relocating during rolling restarts #14387

PhaedrusTheGreek · 2015-10-30T14:52:00Z

This behaviour is reproducible in v1.6.0 through 2.0.0.

Expected during rolling restarts that no shard relocations will occur, however there is shard movement occurring while the cluster is in a yellow health state.

Steps to reproduce:

Create a cluster with at least 3 nodes, 1 index with 2 shards + 1 replica (4 shards total), and index some data.
Stop all indexing
Set allocation: none
_all/_flush/synced
restart a single node
reenable allocation

At step 6, shards are observed to be relocating, in addition to any recovery by sync_id that has occurred. After recoveries and relocations, the cluster will change to green state. This was tested in slow motion by limiting bandwidth to one of the nodes in the cluster.

Relocations are not observed in a 2 node cluster, or when restarting the entire cluster.

clintongormley · 2015-11-05T15:01:48Z

Hi @PhaedrusTheGreek

Could you add the exact commands etc that you used to test. I'm on a poor network and can't view the video.

thanks

PhaedrusTheGreek · 2015-11-05T15:36:44Z

The relocating shards seems to be recoveries, not rebalances. I infer this because when i set the following, I see them all happening at once.

"cluster.routing.allocation.node_concurrent_recoveries" : 10

This is what i'm seeing after restarting a node - shards moving on and off.

index shard prirep state      docs   store ip           node                                                
big   0     r      RELOCATING 1026 605.7kb 192.168.0.2  Max -> 192.168.0.25 gUii9aw4QTW_CRP4Akg_Nw Scrier   
big   0     p      STARTED    1026 605.7kb 192.168.0.25 Arclight                                            
big   1     r      RELOCATING 1018 854.4kb 192.168.0.2  Max -> 192.168.0.25 8y8i2N0oQpag6zsVPm323g Arclight 
big   1     p      STARTED    1018 854.4kb 192.168.0.25 Scrier                                              
big2  2     r      STARTED     413 278.4kb 192.168.0.25 Arclight                                            
big2  2     p      RELOCATING  413 278.4kb 192.168.0.25 Scrier -> 192.168.0.2 m3eFHPt0QyqEdiXUnk59Yg Max    
big2  0     r      RELOCATING  405 270.6kb 192.168.0.2  Max -> 192.168.0.25 8y8i2N0oQpag6zsVPm323g Arclight 
big2  0     p      STARTED     405 270.6kb 192.168.0.25 Scrier                                              
big2  3     p      RELOCATING  405   269kb 192.168.0.25 Arclight -> 192.168.0.2 m3eFHPt0QyqEdiXUnk59Yg Max  
big2  3     r      STARTED     405   269kb 192.168.0.25 Scrier                                              
big2  1     r      RELOCATING  410 426.8kb 192.168.0.2  Max -> 192.168.0.25 gUii9aw4QTW_CRP4Akg_Nw Scrier   
big2  1     p      STARTED     410 426.8kb 192.168.0.25 Arclight                                            
big2  4     p      RELOCATING  411 443.7kb 192.168.0.25 Arclight -> 192.168.0.2 m3eFHPt0QyqEdiXUnk59Yg Max  
big2  4     r      STARTED     411 443.6kb 192.168.0.25 Scrier                                              
big3  2     r      STARTED     407 344.6kb 192.168.0.2  Max                                                 
big3  2     p      STARTED     407 344.6kb 192.168.0.25 Scrier                                              
big3  0     r      STARTED     406 319.2kb 192.168.0.25 Arclight                                            
big3  0     p      RELOCATING  406 319.3kb 192.168.0.25 Scrier -> 192.168.0.2 m3eFHPt0QyqEdiXUnk59Yg Max    
big3  3     r      STARTED     413 402.1kb 192.168.0.2  Max                                                 
big3  3     p      STARTED     413 402.1kb 192.168.0.25 Scrier                                              
big3  1     r      STARTED     411 276.8kb 192.168.0.2  Max                                                 
big3  1     p      STARTED     411 276.8kb 192.168.0.25 Arclight                                            
big3  4     r      STARTED     407 342.5kb 192.168.0.2  Max                                                 
big3  4     p      STARTED     407 342.5kb 192.168.0.25 Arclight

TRACE logs show a lot of this:

[2015-11-05 10:22:00,938][TRACE][indices.recovery         ] [Max] [big3][0] recovery completed from [Scrier][gUii9aw4QTW_CRP4Akg_Nw][Jasons-MacBook-Pro-3.local][inet[/192.168.0.25:9300]], took[2.6m]
   phase1: recovered_files [7] with total_size of [319.2kb], took [2.5m], throttling_wait [0s]
         : reusing_files   [0] with total_size of [0b]
   phase2: start took [19ms]
         : recovered [0] transaction log operations, took [0s]
   phase3: recovered [0] transaction log operations, took [1ms]

PhaedrusTheGreek · 2015-11-05T15:41:19Z

As for the exact command for testing, all that i am doing is starting up 3 nodes, and restarting one with

CTRL-C; bin/elasticsearch

Then watching things move around with

GET /_cat/shards?v

s1monw · 2015-11-06T10:49:06Z

I assigned it to @ywelsch we will look into this and come back to you shortly. In the meanwhile can you show all the commands you are executing especially the one that: Set allocation: none

PhaedrusTheGreek · 2015-11-06T15:16:21Z

This is the exact command I used:

PUT /_cluster/settings
{
        "persistent" : {
            "cluster.routing.allocation.enable" : "none"
        }
}

And I would see something like this on all nodes:

[2015-10-30 10:54:51,429][INFO ][cluster.routing.allocation.decider] [Humus Sapien] updating [cluster.routing.allocation.enable] from [ALL] to [NONE]

Shard relocations / recoveries begin after relocation is reenabled like this:

PUT /_cluster/settings
{
        "persistent" : {
            "cluster.routing.allocation.enable" : "all"
        }
}

bleskes · 2015-11-06T15:28:46Z

A short update - @clintongormley and I researched this. It has to do with a race condition between the gateway allocator and the cluster balancer. When the node comes back/allocation is enabled the gateway allocator goes and asks the node for information about it's shard store. This is done async. While that request is in flight, the balanced allocator thinks the node is empty and assigns shards to it. Only later when the gateway allocator assigns the missing shard back to node does the cluster rebalances again. Our idea for a fix was to disable balancing while there are in flight data fetching requests...

s1monw · 2015-11-06T18:57:21Z

@bleskes makes sense to me - I will take a look at implementing this.

…ilable This commit prevents running rebalance operations if the store allocator is still fetching async shard / store data to prevent pre-mature rebalance decisions which need to be reverted once shard store data is available. This is typically happening on rolling restarts which can make those restarts extremely painful. Closes elastic#14387

PhaedrusTheGreek · 2015-11-06T19:21:55Z

Tested these workarounds with good results:

1.x

 "cluster.routing.allocation.balance.threshold" : "100.0f" (During Node Restart)
 "cluster.routing.allocation.balance.threshold" : "1.0f" (Return to Default)

2.0

"cluster.routing.rebalance.enable" : "none" (During Node Restart)
"cluster.routing.rebalance.enable" : "all" (Return to Default)

astefan · 2015-11-10T07:49:45Z

Was this present in ES versions before 1.6?

s1monw · 2015-11-10T08:09:11Z

Was this present in ES versions before 1.6?

no I don't think so since back then we fetched data synchronously so this couldn't happen.

…ilable This commit prevents running rebalance operations if the store allocator is still fetching async shard / store data to prevent pre-mature rebalance decisions which need to be reverted once shard store data is available. This is typically happening on rolling restarts which can make those restarts extremely painful. Closes elastic#14387

bittusarkar · 2016-10-27T11:04:16Z

@s1monw Is this issue fixed in Elasticsearch 2.x?

s1monw · 2016-10-27T11:14:50Z

@bittusarkar yes see #14652

clintongormley added discuss :Allocation labels Nov 5, 2015

s1monw assigned ywelsch Nov 6, 2015

s1monw removed the discuss label Nov 6, 2015

s1monw mentioned this issue Nov 6, 2015

Only allow rebalance operations to run if all shard store data is available #14591

Merged

s1monw added a commit to s1monw/elasticsearch that referenced this issue Nov 9, 2015

add IT for elastic#14387

7b5e323

s1monw closed this as completed in #14591 Nov 10, 2015

s1monw mentioned this issue Nov 10, 2015

Only allow rebalance operations to run if all shard store data is available #14652

Merged

ichernev mentioned this issue Aug 11, 2016

Fix threadpool settings example #19932

Closed

PhaedrusTheGreek added v2.1.0 v1.7.4 labels Oct 27, 2016

lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018

fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021

Create TestFixMe.md

a9fae03

fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021

Create Helloworld.md

1398a04

fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021

Update Helloworld.md

f68abab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shards relocating during rolling restarts #14387

Shards relocating during rolling restarts #14387

PhaedrusTheGreek commented Oct 30, 2015

clintongormley commented Nov 5, 2015

PhaedrusTheGreek commented Nov 5, 2015

PhaedrusTheGreek commented Nov 5, 2015

s1monw commented Nov 6, 2015

PhaedrusTheGreek commented Nov 6, 2015

bleskes commented Nov 6, 2015

s1monw commented Nov 6, 2015

PhaedrusTheGreek commented Nov 6, 2015

astefan commented Nov 10, 2015

s1monw commented Nov 10, 2015

bittusarkar commented Oct 27, 2016

s1monw commented Oct 27, 2016

Shards relocating during rolling restarts #14387

Shards relocating during rolling restarts #14387

Comments

PhaedrusTheGreek commented Oct 30, 2015

clintongormley commented Nov 5, 2015

PhaedrusTheGreek commented Nov 5, 2015

PhaedrusTheGreek commented Nov 5, 2015

s1monw commented Nov 6, 2015

PhaedrusTheGreek commented Nov 6, 2015

bleskes commented Nov 6, 2015

s1monw commented Nov 6, 2015

PhaedrusTheGreek commented Nov 6, 2015

astefan commented Nov 10, 2015

s1monw commented Nov 10, 2015

bittusarkar commented Oct 27, 2016

s1monw commented Oct 27, 2016