-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] MixedClusterTest are failing while waiting for a 4-node cluster to form #27233
Comments
This one on master looks very similar:
|
And another one on master: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu/1665/console This one has HTTP 408 errors instead the 503 from above, and I cannot find MasterNotDiscoveredException, but the rest looks very similar.
|
And this on 6.x branch: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/297/console |
This has been failing for quite a while now. |
@rjernst Would you be able to look at this and see if we can change something in the build, or reassign to anybody who might be better to move this forward? |
This seems to be related as well: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/604/console |
':qa:mixed-cluster:v5.6.4-SNAPSHOT#mixedClusterTestCluster#wait' |
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/753/ looks related, but different. In that run, nodes 0, 2 and 3 formed a cluster and node 1 was left out in the cold. Discussed this with @ywelsch and I think we'll try increasing the logging level, as at the moment there's not much to go on. |
I've pushed e04e5ab |
Meanwhile, I managed to reproduce locally using
The failure, like in #27233 (comment), was that nodes 0, 2 and 3 formed a cluster and node 1 was left out. Here is the tail of node 1's log:
So node 1 received pings from nodes 0 and 2 and hence started an election. Meanwhile node 3 also started an election with nodes 0 and 2 and won the race, so they do not vote for node 1, so after 30 seconds it times out...
... and starts another round of pinging. If this pinging had succeeded then everything would have been fine, but the test only waits for 40 seconds for the cluster to start, and two rounds of pinging plus 30 seconds plus various other overheads manages to exceed the 40-second test timeout. Two simple resolutions would be to increase the 40-second timeout in the test or to reduce A further possibility would be to stagger the startup of the nodes enough to reduce the frequency with which this race occurs. |
@albertzaharovits I'm not sure about https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.0+multijob-windows-compatibility/322/console - that just looks like a complete failure to form a cluster. Nodes 0 and 1 managed to find each other, but 2 and 3 never got in touch with anyone else. Could also be timeout related, if the CI machine was a slow one. |
PR #26911 set minimum_master_nodes from number_of_nodes to (number_of_nodes / 2) + 1 in our REST tests. This has led to test failures (see #27233) as the REST tests only configure the first node in its unicast.hosts pinging list (see explanation here: #27233 (comment)). Until we have a proper fix for this, I'm reverting the change in #26911.
PR #26911 set minimum_master_nodes from number_of_nodes to (number_of_nodes / 2) + 1 in our REST tests. This has led to test failures (see #27233) as the REST tests only configure the first node in its unicast.hosts pinging list (see explanation here: #27233 (comment)). Until we have a proper fix for this, I'm reverting the change in #26911.
PR #26911 set minimum_master_nodes from number_of_nodes to (number_of_nodes / 2) + 1 in our REST tests. This has led to test failures (see #27233) as the REST tests only configure the first node in its unicast.hosts pinging list (see explanation here: #27233 (comment)). Until we have a proper fix for this, I'm reverting the change in #26911.
I've pushed #27344 which should fix the test failures seen here. I would prefer to set |
@ywelsch it looks like it failed again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+intake/131/console in a very similair way. Could you take a look? |
That's a completely different test suite. Also, just because cluster formation did not work, does not mean it's the same failure, there could be N many reasons :) |
There currently are many instances of build failures where the Mixed Cluster Tests are failing because the test is waiting on a connection to check the cluster health but this fails with:
Example for failures: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.0+java9-periodic/1185/consoleFull
The above failure also has:
In one of the node logs.
The text was updated successfully, but these errors were encountered: