Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MixedClusterTest are failing while waiting for a 4-node cluster to form #27233

Closed
cbuescher opened this issue Nov 2, 2017 · 16 comments
Closed
Assignees
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI

Comments

@cbuescher
Copy link
Member

There currently are many instances of build failures where the Mixed Cluster Tests are failing because the test is waiting on a connection to check the cluster health but this fails with:

12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow

Example for failures: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.0+java9-periodic/1185/consoleFull

12:47:52   Task has not declared any outputs.
12:48:06 [ant:echo] ==> [Thu Nov 02 12:48:06 UTC 2017] checking health: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:06      [echo] ==> [Thu Nov 02 12:48:06 UTC 2017] checking health: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:06 [ant:get] Getting: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:06       [get] Getting: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:06 [ant:get] To: /var/lib/jenkins/workspace/elastic+elasticsearch+6.0+java9-periodic/qa/mixed-cluster/build/cluster/v5.6.4-SNAPSHOT#mixedClusterTestCluster node0/cwd/wait.success
12:48:06       [get] To: /var/lib/jenkins/workspace/elastic+elasticsearch+6.0+java9-periodic/qa/mixed-cluster/build/cluster/v5.6.4-SNAPSHOT#mixedClusterTestCluster node0/cwd/wait.success
12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 Node 0 output:
12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 |-----------------------------------------
12:48:36       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 |  failure marker exists: false
12:48:36 |  pid file exists: true
12:48:36 |  http ports file exists: true
12:48:36       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
12:48:36 [ant:get] Can't get http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow to /var/lib/jenkins/workspace/elastic+elasticsearch+6.0+java9-periodic/qa/mixed-cluster/build/cluster/v5.6.4-SNAPSHOT#mixedClusterTestCluster node0/cwd/wait.success
12:48:36       [get] Can't get http://[::1]:40959/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow to /var/lib/jenkins/workspace/elastic+elasticsearch+6.0+java9-periodic/qa/mixed-cluster/build/cluster/v5.6.4-SNAPSHOT#mixedClusterTestCluster node0/cwd/wait.success

The above failure also has:

12:48:36 |    [2017-11-02T12:48:36,421][WARN ][r.suppressed             ] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=>=4}
12:48:36 |    org.elasticsearch.discovery.MasterNotDiscoveredException: null
12:48:36 |    	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:209) [elasticsearch-5.6.4-SNAPSHOT.jar:5.6.4-SNAPSHOT]
12:48:36 |    	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:311) [elasticsearch-5.6.4-SNAPSHOT.jar:5.6.4-SNAPSHOT]
12:48:36 |    	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:238) [elasticsearch-5.6.4-SNAPSHOT.jar:5.6.4-SNAPSHOT]
12:48:36 |    	at org.elasticsearch.cluster.service.ClusterService$NotifyTimeout.run(ClusterService.java:1056) [elasticsearch-5.6.4-SNAPSHOT.jar:5.6.4-SNAPSHOT]
12:48:36 |    	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.4-SNAPSHOT.jar:5.6.4-SNAPSHOT]
12:48:36 |    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) [?:?]
12:48:36 |    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) [?:?]
12:48:36 |    	at java.lang.Thread.run(Thread.java:844) [?:?]

In one of the node logs.

@cbuescher cbuescher added >test-failure Triaged test failures from CI >test Issues or PRs that are addressing/adding tests labels Nov 2, 2017
@cbuescher
Copy link
Member Author

This one on master looks very similar:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/1301/consoleFull

13:33:58 Executing task ':qa:mixed-cluster:v6.1.0-SNAPSHOT#mixedClusterTestCluster#wait' (up-to-date check took 0.0 secs) due to:
13:33:58   Task has not declared any outputs.
13:34:12 [ant:echo] ==> [Thu Nov 02 06:34:12 PDT 2017] checking health: http://[::1]:54323/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
13:34:12      [echo] ==> [Thu Nov 02 06:34:12 PDT 2017] checking health: http://[::1]:54323/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
13:34:12 [ant:get] Getting: http://[::1]:54323/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
13:34:12       [get] Getting: http://[::1]:54323/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
13:34:12 [ant:get] To: /private/var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob-darwin-compatibility/qa/mixed-cluster/build/cluster/v6.1.0-SNAPSHOT#mixedClusterTestCluster node0/cwd/wait.success
13:34:12       [get] To: /private/var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob-darwin-compatibility/qa/mixed-cluster/build/cluster/v6.1.0-SNAPSHOT#mixedClusterTestCluster node0/cwd/wait.success
13:34:42 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:54323/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
13:34:42       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:54323/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
13:34:42 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:54323/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
13:34:42       [get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:54323/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
13:34:42 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 503 for URL: http://[::1]:54323/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow
...
13:34:42 Node 0 output:
[...]
13:34:42 |-----------------------------------------
13:34:42 |  failure marker exists: false
13:34:42 |  pid file exists: true
13:34:42 |  http ports file exists: true
13:34:42 |  transport ports file exists: true
13:34:42 |
13:34:42 |  [ant output]
13:34:42 |
13:34:42 |  [log]
13:34:42 |    warning: ignoring JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8
13:34:42 |    [2017-11-02T06:33:52,707][INFO ][o.e.n.Node               ] [node-0] initializing ...
13:34:42 |    [2017-11-02T06:33:52,785][INFO ][o.e.e.NodeEnvironment    ] [node-0] using [1] data paths, mounts [[/ (/dev/disk0s2)]], net usable_space [857.4gb], net total_space [930.5gb], types [hfs]
13:34:42 |    [2017-11-02T06:33:52,785][INFO ][o.e.e.NodeEnvironment    ] [node-0] heap size [494.9mb], compressed ordinary object pointers [true]
13:34:42 |    [2017-11-02T06:33:52,787][INFO ][o.e.n.Node               ] [node-0] node name [node-0], node ID [MVV26-YjRoeo3tgRJ-vBTA]
13:34:42 |    [2017-11-02T06:33:52,788][INFO ][o.e.n.Node               ] [node-0] version[6.1.0-SNAPSHOT], pid[29965], build[f928d61/2017-11-02T13:31:46.401Z], OS[Mac OS X/10.12.6/x86_64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_141/25.141-b15]
13:34:42 |    [2017-11-02T06:33:52,788][INFO ][o.e.n.Node               ] [node-0] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -XX:+HeapDumpOnOutOfMemoryError, -Xms512m, -Xmx512m, -ea, -esa, -Des.path.home=/private/var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob-darwin-compatibility/qa/mixed-cluster/build/cluster/v6.1.0-SNAPSHOT#mixedClusterTestCluster node0/elasticsearch-6.1.0-SNAPSHOT, -Des.path.conf=/private/var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob-darwin-compatibility/qa/mixed-cluster/build/cluster/v6.1.0-SNAPSHOT#mixedClusterTestCluster node0/elasticsearch-6.1.0-SNAPSHOT/config]
13:34:42 |    [2017-11-02T06:33:52,788][WARN ][o.e.n.Node               ] [node-0] version [6.1.0-SNAPSHOT] is a pre-release version of Elasticsearch and is not suitable for production
13:34:42 |    [2017-11-02T06:33:53,847][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [aggs-matrix-stats]
13:34:42 |    [2017-11-02T06:33:53,847][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [analysis-common]
13:34:42 |    [2017-11-02T06:33:53,847][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [ingest-common]
13:34:42 |    [2017-11-02T06:33:53,847][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [lang-expression]
13:34:42 |    [2017-11-02T06:33:53,848][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [lang-mustache]
13:34:42 |    [2017-11-02T06:33:53,848][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [lang-painless]
13:34:42 |    [2017-11-02T06:33:53,848][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [mapper-extras]
13:34:42 |    [2017-11-02T06:33:53,848][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [parent-join]
13:34:42 |    [2017-11-02T06:33:53,848][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [percolator]
13:34:42 |    [2017-11-02T06:33:53,848][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [reindex]
13:34:42 |    [2017-11-02T06:33:53,848][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [repository-url]
13:34:42 |    [2017-11-02T06:33:53,848][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [transport-netty4]
13:34:42 |    [2017-11-02T06:33:53,848][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [tribe]
13:34:42 |    [2017-11-02T06:33:53,849][INFO ][o.e.p.PluginsService     ] [node-0] no plugins loaded
13:34:42 |    [2017-11-02T06:33:55,503][INFO ][o.e.d.DiscoveryModule    ] [node-0] using discovery type [zen]
13:34:42 |    [2017-11-02T06:33:56,141][INFO ][o.e.n.Node               ] [node-0] initialized
13:34:42 |    [2017-11-02T06:33:56,141][INFO ][o.e.n.Node               ] [node-0] starting ...
13:34:42 |    [2017-11-02T06:33:56,316][INFO ][o.e.t.TransportService   ] [node-0] publish_address {127.0.0.1:54321}, bound_addresses {[fe80::1]:54319}, {[::1]:54320}, {127.0.0.1:54321}
13:34:42 |    [2017-11-02T06:33:56,347][INFO ][o.e.h.n.Netty4HttpServerTransport] [node-0] publish_address {127.0.0.1:54324}, bound_addresses {[fe80::1]:54322}, {[::1]:54323}, {127.0.0.1:54324}
13:34:42 |    [2017-11-02T06:33:56,350][INFO ][o.e.n.Node               ] [node-0] started
13:34:42 |    [2017-11-02T06:33:59,358][WARN ][o.e.d.z.ZenDiscovery     ] [node-0] not enough master nodes discovered during pinging (found [[Candidate{node={node-0}{MVV26-YjRoeo3tgRJ-vBTA}{EfKwddXlRR20DNZwVtHdpA}{127.0.0.1}{127.0.0.1:54321}{testattr=test}, clusterStateVersion=-1}]], but needed [3]), pinging again
13:34:42 |    [2017-11-02T06:34:02,423][WARN ][o.e.d.z.ZenDiscovery     ] [node-0] not enough master nodes discovered during pinging (found [[Candidate{node={node-0}{MVV26-YjRoeo3tgRJ-vBTA}{EfKwddXlRR20DNZwVtHdpA}{127.0.0.1}{127.0.0.1:54321}{testattr=test}, clusterStateVersion=-1}]], but needed [3]), pinging again
13:34:42 |    [2017-11-02T06:34:05,441][WARN ][o.e.d.z.ZenDiscovery     ] [node-0] not enough master nodes discovered during pinging (found [[Candidate{node={node-0}{MVV26-YjRoeo3tgRJ-vBTA}{EfKwddXlRR20DNZwVtHdpA}{127.0.0.1}{127.0.0.1:54321}{testattr=test}, clusterStateVersion=-1}]], but needed [3]), pinging again
13:34:42 |    [2017-11-02T06:34:08,499][WARN ][o.e.d.z.ZenDiscovery     ] [node-0] not enough master nodes discovered during pinging (found [[Candidate{node={node-0}{MVV26-YjRoeo3tgRJ-vBTA}{EfKwddXlRR20DNZwVtHdpA}{127.0.0.1}{127.0.0.1:54321}{testattr=test}, clusterStateVersion=-1}]], but needed [3]), pinging again
13:34:42 |    [2017-11-02T06:34:12,341][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [node-0] no known master node, scheduling a retry
13:34:42 |    [2017-11-02T06:34:42,403][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [node-0] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
13:34:42 |    [2017-11-02T06:34:42,418][WARN ][r.suppressed             ] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=>=4}
13:34:42 |    org.elasticsearch.discovery.MasterNotDiscoveredException: null
13:34:42 |    	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:209) [elasticsearch-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
13:34:42 |    	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
13:34:42 |    	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
13:34:42 |    	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
13:34:42 |    	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
13:34:42 |    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
13:34:42 |    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
13:34:42 |    	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]

@cbuescher
Copy link
Member Author

And another one on master: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu/1665/console

This one has HTTP 408 errors instead the 503 from above, and I cannot find MasterNotDiscoveredException, but the rest looks very similar.

13:26:08 [ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 408 for URL: http://[::1]:35632/_cluster/health?wait_for_nodes=>=4&wait_for_status=yellow

@cbuescher
Copy link
Member Author

@ywelsch
Copy link
Contributor

ywelsch commented Nov 2, 2017

This has been failing for quite a while now.
The issue here is that we have a 4 node cluster. minimum_master_nodes is 3 and the first node of the cluster (node-0) is the only host on the unicast.hosts list.
I'm not sure why, but node-1, node-2 and node-3, learning about each other through node-0, form a cluster. Once they've formed a cluster, they stop pinging node-0, which has now no way of getting to know about the other nodes. The test fails while waiting for a 4-node cluster to form.
We could either set min-master-nodes to 4, or extend the unicast hosts lists with more than just the first node.

@cbuescher cbuescher added the :Delivery/Build Build or test infrastructure label Nov 2, 2017
@cbuescher
Copy link
Member Author

@rjernst Would you be able to look at this and see if we can change something in the build, or reassign to anybody who might be better to move this forward?

@cbuescher cbuescher changed the title [CI] MixedClusterTest are failing waiting on cluster connection [CI] MixedClusterTest are failing while waiting for a 4-node cluster to form Nov 2, 2017
@dimitris-athanasiou
Copy link
Contributor

This seems to be related as well:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/604/console

@dimitris-athanasiou
Copy link
Contributor

As well as this:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.0+release-tests/176/console

@albertzaharovits
Copy link
Contributor

@DaveCTurner
Copy link
Contributor

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/753/ looks related, but different. In that run, nodes 0, 2 and 3 formed a cluster and node 1 was left out in the cold. Discussed this with @ywelsch and I think we'll try increasing the logging level, as at the moment there's not much to go on.

@ywelsch
Copy link
Contributor

ywelsch commented Nov 9, 2017

I've pushed e04e5ab

@DaveCTurner
Copy link
Contributor

Meanwhile, I managed to reproduce locally using

while gradle :qa:mixed-cluster:v6.1.0-SNAPSHOT#mixedClusterTestRunner -Dtests.seed=516D102959ABE203 -Dtests.class=org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT -Dtests.method="test {p0=indices.shrink/20_source_mapping/Shrink index ignores target template mapping}" -Dtests.security.manager=true -Dtests.locale=et -Dtests.timezone=Asia/Beirut ; do :; done

The failure, like in #27233 (comment), was that nodes 0, 2 and 3 formed a cluster and node 1 was left out. Here is the tail of node 1's log:

[2017-11-09T14:09:44,243][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] sending to {#zen_unicast_[::1]:63689_0#}{j39CE0vISIeTHh2XbSU3xw}{0:0:0:0:0:0:0:1}{[::1]:63689}
[2017-11-09T14:09:44,251][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] received response from {#zen_unicast_[::1]:63689_0#}{j39CE0vISIeTHh2XbSU3xw}{0:0:0:0:0:0:0:1}{[::1]:63689}: [ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[3], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[5], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[8], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[12], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[15], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[17], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:09:45,077][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] sending to {#zen_unicast_[::1]:63689_0#}{j39CE0vISIeTHh2XbSU3xw}{0:0:0:0:0:0:0:1}{[::1]:63689}
[2017-11-09T14:09:45,077][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] opening connection to [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}]
[2017-11-09T14:09:45,077][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] sending to {node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}
[2017-11-09T14:09:45,080][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] received response from {node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}: [ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[18], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[4], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[5], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:09:45,083][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] received response from {#zen_unicast_[::1]:63689_0#}{j39CE0vISIeTHh2XbSU3xw}{0:0:0:0:0:0:0:1}{[::1]:63689}: [ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[5], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[8], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[12], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[15], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[18], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[4], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[20], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:09:45,131][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] sending to {node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}
[2017-11-09T14:09:45,133][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] received response from {node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}: [ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[5], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[8], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[12], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[15], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[18], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[4], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[4], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[21], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:09:46,078][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] sending to {node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}
[2017-11-09T14:09:46,078][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] sending to {#zen_unicast_[::1]:63689_0#}{j39CE0vISIeTHh2XbSU3xw}{0:0:0:0:0:0:0:1}{[::1]:63689}
[2017-11-09T14:09:46,079][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] sending to {node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}
[2017-11-09T14:09:46,079][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] received response from {node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}: [ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[18], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[4], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[23], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[7], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[8], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:09:46,093][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] received response from {#zen_unicast_[::1]:63689_0#}{j39CE0vISIeTHh2XbSU3xw}{0:0:0:0:0:0:0:1}{[::1]:63689}: [ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[8], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[12], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[15], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[18], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[4], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[4], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-2}{OJm_zDfrTueCYDF95mGUKw}{fisftQAhQb-qeVStt3vfMQ}{127.0.0.1}{127.0.0.1:63721}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[23], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[7], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[7], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[26], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:09:46,093][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [1] received response from {node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}: [ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[8], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[12], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[15], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[18], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[4], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[4], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-2}{OJm_zDfrTueCYDF95mGUKw}{fisftQAhQb-qeVStt3vfMQ}{127.0.0.1}{127.0.0.1:63721}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[23], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[7], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[25], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:09:47,079][TRACE][o.e.d.z.ZenDiscovery     ] [node-1] full ping responses:
        --> ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[26], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}
        --> ping_response{node [{node-2}{OJm_zDfrTueCYDF95mGUKw}{fisftQAhQb-qeVStt3vfMQ}{127.0.0.1}{127.0.0.1:63721}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}
[2017-11-09T14:09:47,081][DEBUG][o.e.d.z.ZenDiscovery     ] [node-1] filtered ping responses: (ignore_non_masters [false])
        --> ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[26], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}
        --> ping_response{node [{node-2}{OJm_zDfrTueCYDF95mGUKw}{fisftQAhQb-qeVStt3vfMQ}{127.0.0.1}{127.0.0.1:63721}{testattr=test}], id[1], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}
        --> ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[9], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}
[2017-11-09T14:09:47,082][TRACE][o.e.d.z.ZenDiscovery     ] [node-1] candidate Candidate{node={node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}, clusterStateVersion=-1} won election
[2017-11-09T14:09:47,082][DEBUG][o.e.d.z.ZenDiscovery     ] [node-1] elected as master, waiting for incoming joins ([2] needed)
[2017-11-09T14:09:47,083][TRACE][o.e.d.z.NodeJoinController] [node-1] not enough joins for election. Got [0], required [2]
[2017-11-09T14:10:17,086][TRACE][o.e.d.z.NodeJoinController] [node-1] timed out waiting to be elected. waited [30s]. pending master node joins [0]
[2017-11-09T14:10:17,092][TRACE][o.e.d.z.NodeJoinController] [node-1] stopping election ([timed out waiting to be elected])
[2017-11-09T14:10:17,099][DEBUG][o.e.c.s.MasterService    ] [node-1] processing [zen-disco-election-stop [timed out waiting to be elected]]: execute
[2017-11-09T14:10:17,102][TRACE][o.e.d.z.ZenDiscovery     ] [node-1] failed while waiting for nodes to join, rejoining
org.elasticsearch.cluster.NotMasterException: election stopped [zen-disco-election-stop [timed out waiting to be elected]]
[2017-11-09T14:10:17,104][DEBUG][o.e.c.s.MasterService    ] [node-1] processing [zen-disco-election-stop [timed out waiting to be elected]]: took [4ms] no change in cluster state
[2017-11-09T14:10:17,104][TRACE][o.e.d.z.NodeJoinController] [node-1] starting an election context, will accumulate joins
[2017-11-09T14:10:17,104][TRACE][o.e.d.z.ZenDiscovery     ] [node-1] starting to ping
[2017-11-09T14:10:17,105][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] resolved host [[::1]:63689] to [[::1]:63689]
[2017-11-09T14:10:17,105][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] sending to {node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}
[2017-11-09T14:10:17,105][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] opening connection to [{#zen_unicast_[::1]:63689_0#}{ZyrLye1YRvKyE6Yhlj0cgg}{0:0:0:0:0:0:0:1}{[::1]:63689}]
[2017-11-09T14:10:17,105][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] received response from {node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}: [ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[11], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:10:17,120][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] sending to {#zen_unicast_[::1]:63689_0#}{ZyrLye1YRvKyE6Yhlj0cgg}{0:0:0:0:0:0:0:1}{[::1]:63689}
[2017-11-09T14:10:17,121][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] received response from {#zen_unicast_[::1]:63689_0#}{ZyrLye1YRvKyE6Yhlj0cgg}{0:0:0:0:0:0:0:1}{[::1]:63689}: [ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[35], master [{node-3}{0DHD1bJzSpKIJTWbKoyO-g}{-vvW1WIfTyC5L-QO2p_uWQ}{127.0.0.1}{127.0.0.1:63731}{testattr=test}],cluster_state_version [2], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:10:18,106][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] sending to {node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}
[2017-11-09T14:10:18,106][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] sending to {#zen_unicast_[::1]:63689_0#}{ZyrLye1YRvKyE6Yhlj0cgg}{0:0:0:0:0:0:0:1}{[::1]:63689}
[2017-11-09T14:10:18,106][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] received response from {node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}: [ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[12], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[13], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:10:18,108][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] received response from {#zen_unicast_[::1]:63689_0#}{ZyrLye1YRvKyE6Yhlj0cgg}{0:0:0:0:0:0:0:1}{[::1]:63689}: [ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[12], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[36], master [{node-3}{0DHD1bJzSpKIJTWbKoyO-g}{-vvW1WIfTyC5L-QO2p_uWQ}{127.0.0.1}{127.0.0.1:63731}{testattr=test}],cluster_state_version [2], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:10:19,105][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] sending to {node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}
[2017-11-09T14:10:19,105][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] sending to {#zen_unicast_[::1]:63689_0#}{ZyrLye1YRvKyE6Yhlj0cgg}{0:0:0:0:0:0:0:1}{[::1]:63689}
[2017-11-09T14:10:19,106][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] received response from {node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}: [ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[12], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[14], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[15], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]
[2017-11-09T14:10:19,107][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [2] received response from {#zen_unicast_[::1]:63689_0#}{ZyrLye1YRvKyE6Yhlj0cgg}{0:0:0:0:0:0:0:1}{[::1]:63689}: [ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[10], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[12], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-1}{3pVseLUbRUe1q83D8CUIjg}{U_xJZxXER229ZSWkoPc-Mg}{127.0.0.1}{127.0.0.1:63710}{testattr=test}], id[14], master [null],cluster_state_version [-1], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}, ping_response{node [{node-0}{lKGBv8BlQc2Xxf0ByfEXgw}{5Y6tZSVaQOOGiPGe_FEZ-A}{127.0.0.1}{127.0.0.1:63690}{testattr=test}], id[37], master [{node-3}{0DHD1bJzSpKIJTWbKoyO-g}{-vvW1WIfTyC5L-QO2p_uWQ}{127.0.0.1}{127.0.0.1:63731}{testattr=test}],cluster_state_version [2], cluster_name[qa_mixed-cluster_v6.1.0-SNAPSHOT#mixedClusterTestCluster]}]

So node 1 received pings from nodes 0 and 2 and hence started an election. Meanwhile node 3 also started an election with nodes 0 and 2 and won the race, so they do not vote for node 1, so after 30 seconds it times out...

[2017-11-09T14:10:17,086][TRACE][o.e.d.z.NodeJoinController] [node-1] timed out waiting to be elected. waited [30s]. pending master node joins [0]

... and starts another round of pinging. If this pinging had succeeded then everything would have been fine, but the test only waits for 40 seconds for the cluster to start, and two rounds of pinging plus 30 seconds plus various other overheads manages to exceed the 40-second test timeout.

Two simple resolutions would be to increase the 40-second timeout in the test or to reduce discovery.zen.master_election.wait_for_joins_timeout to something much less than 30 seconds. I don't especially like either: bumping the 40 seconds will increase the test runtime accordingly, and reducing the timeout takes us further away from a "standard" configuration, but of the two I think I prefer the latter. @ywelsch just opened #27332 to do this.

A further possibility would be to stagger the startup of the nodes enough to reduce the frequency with which this race occurs.

@DaveCTurner
Copy link
Contributor

@albertzaharovits I'm not sure about https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.0+multijob-windows-compatibility/322/console - that just looks like a complete failure to form a cluster. Nodes 0 and 1 managed to find each other, but 2 and 3 never got in touch with anyone else. Could also be timeout related, if the CI machine was a slow one.

ywelsch added a commit that referenced this issue Nov 10, 2017
PR #26911 set minimum_master_nodes from number_of_nodes to (number_of_nodes / 2) + 1 in our REST tests. This has led to test failures (see #27233) as the REST tests only configure the first node in its unicast.hosts pinging list (see explanation here: #27233 (comment)). Until we have a proper fix for this, I'm reverting the change in #26911.
ywelsch added a commit that referenced this issue Nov 10, 2017
PR #26911 set minimum_master_nodes from number_of_nodes to (number_of_nodes / 2) + 1 in our REST tests. This has led to test failures (see #27233) as the REST tests only configure the first node in its unicast.hosts pinging list (see explanation here: #27233 (comment)). Until we have a proper fix for this, I'm reverting the change in #26911.
ywelsch added a commit that referenced this issue Nov 10, 2017
PR #26911 set minimum_master_nodes from number_of_nodes to (number_of_nodes / 2) + 1 in our REST tests. This has led to test failures (see #27233) as the REST tests only configure the first node in its unicast.hosts pinging list (see explanation here: #27233 (comment)). Until we have a proper fix for this, I'm reverting the change in #26911.
@ywelsch
Copy link
Contributor

ywelsch commented Nov 10, 2017

I've pushed #27344 which should fix the test failures seen here. I would prefer to set minimum_master_nodes to a more realistic value in the REST tests, but that's going to require more work.

@ywelsch ywelsch closed this as completed Nov 10, 2017
@imotov
Copy link
Contributor

imotov commented May 30, 2018

@ywelsch it looks like it failed again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+intake/131/console in a very similair way. Could you take a look?

@imotov imotov reopened this May 30, 2018
@ywelsch
Copy link
Contributor

ywelsch commented May 31, 2018

That's a completely different test suite. Also, just because cluster formation did not work, does not mean it's the same failure, there could be N many reasons :)
Can you open a new issue and assign @nik9000. He worked last on this.

@ywelsch ywelsch closed this as completed Jun 6, 2018
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

8 participants