-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] RelocationIT.testIndexAndRelocateConcurrently timeouts #50508
Comments
Pinging @elastic/es-distributed (:Distributed/Recovery) |
I was able to reproduce this with the seed from the failure here and running in a loop with concurrent => failure log with |
I can fairly reliably reproduce this on |
Thanks @original-brownbear :). I will take a look. |
I think I know what's causing the test failure here. An interesting observation though is that with the change to not refresh under the engine lock anymore, refreshes (in particular
Maybe something to adapt in Lucene not to treat concurrent closing as a fatal exception... Now on to the real issue here: It looks like a shard is reallocated to a node that was previously hosting a recovering copy of the shard.
i.e. thinks nothing needs to be done. This means we ultimately end up in a situation where the master thinks the new shard copy is still recovering, but it has been removed on the node. I will open a fix for this. |
A failure of a recovering shard can race with a new allocation of the shard, and cause the new allocation to be failed as well. This can result in a shard being marked as initializing in the cluster state, but not exist on the node anymore. Closes #50508
A failure of a recovering shard can race with a new allocation of the shard, and cause the new allocation to be failed as well. This can result in a shard being marked as initializing in the cluster state, but not exist on the node anymore. Closes #50508
A failure of a recovering shard can race with a new allocation of the shard, and cause the new allocation to be failed as well. This can result in a shard being marked as initializing in the cluster state, but not exist on the node anymore. Closes #50508
A failure of a recovering shard can race with a new allocation of the shard, and cause the new allocation to be failed as well. This can result in a shard being marked as initializing in the cluster state, but not exist on the node anymore. Closes elastic#50508
Times out with
timed out waiting for green state
Reproduce line
Build scans
[7.5.2] https://gradle-enterprise.elastic.co/s/spwlu6566rbue
[6.8.7] https://gradle-enterprise.elastic.co/s/evfxzrqeohj5q
[7.x] https://gradle-enterprise.elastic.co/s/ezlcf2b36tiew
This particular, latest, occurrence was found on 7.x - https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-unix-compatibility/os=ubuntu-16.04&&immutable/471/console - but this test timeouts every few days across the 7.x branches and master
The timeout was increased in #46554 which was merged on the 11th of September 2019, but the new timeout still isn't enough (maybe there is a different problem? or maybe the timeouts configuration need some attention to speedup the relocation?)
The text was updated successfully, but these errors were encountered: