-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NetworkDisruptionIT testJobRelocation failing #35052
Comments
Pinging @elastic/ml-core |
It looks like the bug in the test is that it needs to set |
Pinging @elastic/es-distributed |
It's a little subtle. Here we started the first node with What puzzles me is that we allow 3 seconds for all the nodes to find each other, which should be plenty of time to discover the existing master:
I will try and reproduce this with more logging. It's not ML-specific, although the form of this test does make it more likely to occur here: if we said we wanted 5 nodes up-front using the |
I managed to reproduce a similar failure (after ≥500 iterations) without extra logging. With extra logging enabled it's been running overnight and consistently passing I have pushed f7760dd to enable more logging in CI, which will tell us what really happens in the next failure (or fix the problem 😉) It could just be that things were going a bit slowly on the CI worker. This is a failure mode that can occur when growing a cluster too quickly if the timing is sufficiently bad. We could fix this by growing the cluster more slowly (e.g. one-node-at-a-time), or else set |
Sometimes the cluster forming here will split-brain when it grows up to 5 nodes. This could be a timing issue or could be something going wrong in discovery, so this asks for more logs. Relates #35052
Thanks for investigating this @DaveCTurner.
There's no particular reason why this test is creating a 1 node cluster and then adding 4 more nodes to it. It could still test what it's supposed to be testing by immediately creating a 5 node cluster - the way it's creating the cluster now is just due to ignorance of the subtleties of the different ways of creating the cluster. I guess this is extremely unlikely to happen at a real customer because they wouldn't be running 5 nodes on the same VM and the timeout for the nodes to find each other is 30 seconds. So if you get fed up investigating I'm very happy to just switch to |
I think it'd be good to give this a few weeks to fail again with more logging just so we can be sure that we're not doing something incorrect with how discovery interacts with By the way, I think what the test is doing here is legitimate: this is a bug in the framework, I think, and the suggestion to use the annotation is merely a workaround. |
Thanks @DaveCTurner. There's no need to create a separate test case. I'm happy to use this ML one for this purpose. |
I forgot that I left my CI running this job yesterday, but was just notified that iteration 10337 failed again with trace logging enabled, and the failure now makes sense despite the 3-second pinging delay. It is a consequence of how today's discovery implementation does not gossip symmetrically: it only shares pings that it has received, not their responses. In this case, only one node,
Once a pinging round has started we do not check the unicast hosts provider again. |
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts files each time a node starts its transport service. This does mean that a number of nodes can start and perform their first pinging round without any unicast hosts which, if the timing is unlucky and a lot of nodes are all started at the same time, can lead to a split brain as in elastic#35052. Prior to elastic#33554 this was unlikely to happen since the MockUncasedHostsProvider would always have yielded the existing hosts, so the timing would have to have been implausibly unlucky. Since elastic#33554, however, it's more likely because the race occurs between the start of the first round of pinging and the writing of the unicast hosts file. It is realistic that new nodes will be configured with the existing nodes from startup, so this change reinstates that behaviour Closes elastic#35052.
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts files each time a node starts its transport service. This does mean that a number of nodes can start and perform their first pinging round without any unicast hosts which, if the timing is unlucky and a lot of nodes are all started at the same time, can lead to a split brain as in #35052. Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider would always have yielded the existing hosts, so the timing would have to have been implausibly unlucky. Since #33554, however, it's more likely because the race occurs between the start of the first round of pinging and the writing of the unicast hosts file. It is realistic that new nodes will be configured with the existing nodes from startup, so this change reinstates that behaviour. Closes #35052.
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts files each time a node starts its transport service. This does mean that a number of nodes can start and perform their first pinging round without any unicast hosts which, if the timing is unlucky and a lot of nodes are all started at the same time, can lead to a split brain as in #35052. Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider would always have yielded the existing hosts, so the timing would have to have been implausibly unlucky. Since #33554, however, it's more likely because the race occurs between the start of the first round of pinging and the writing of the unicast hosts file. It is realistic that new nodes will be configured with the existing nodes from startup, so this change reinstates that behaviour. Closes #35052.
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts files each time a node starts its transport service. This does mean that a number of nodes can start and perform their first pinging round without any unicast hosts which, if the timing is unlucky and a lot of nodes are all started at the same time, can lead to a split brain as in #35052. Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider would always have yielded the existing hosts, so the timing would have to have been implausibly unlucky. Since #33554, however, it's more likely because the race occurs between the start of the first round of pinging and the writing of the unicast hosts file. It is realistic that new nodes will be configured with the existing nodes from startup, so this change reinstates that behaviour. Closes #35052.
@DaveCTurner This looks like the same issue happening back in 5.6? (https://internal-ci.elastic.co/job/elastic+x-pack-elasticsearch+5.6+multijob-windows-compatibility/508/console). I assume this won't be backported that far, but wanted to confirm. |
This test fails from time to time:
Log : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+multijob-unix-compatibility/os=debian/8/console
The failure is not reproducible for me.
Looks like there is problem of forming a cluster: split brain in the cluster (two master nodes detected:
node_t3
andnode_t0
, and not enough master nodes for the quorum:The text was updated successfully, but these errors were encountered: