[CI] RemoteClusterClientTests testEnsureWeReconnect failing with NoSuchRemoteClusterException #52029

mark-vieira · 2020-02-07T06:15:58Z

This test has failed 4 times in the past three days after passing basically 100% of the time for the past month. Look suspicious. Happening on both master and 7.x.

:server:test » org.elasticsearch.transport.RemoteClusterClientTests » testEnsureWeReconnect (1.662s)
org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [test]
java.util.concurrent.ExecutionException: org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [test]Open stacktrace
Caused by: org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [test]Open stacktrace
[2020-02-07T11:10:26,252][INFO ][o.e.t.RemoteClusterClientTests] [testEnsureWeReconnect] before test
[2020-02-07T11:10:26,808][INFO ][o.e.t.TransportService   ] [testEnsureWeReconnect] publish_address {127.0.0.1:14000}, bound_addresses {[::1]:14000}, {127.0.0.1:14000}
[2020-02-07T11:10:27,048][INFO ][o.e.t.TransportService   ] [testEnsureWeReconnect] publish_address {127.0.0.1:14001}, bound_addresses {[::1]:14001}, {127.0.0.1:14001}
[2020-02-07T11:10:27,519][INFO ][o.e.t.RemoteClusterClientTests] [testEnsureWeReconnect] after test
REPRODUCE WITH: ./gradlew ':server:test' --tests "org.elasticsearch.transport.RemoteClusterClientTests.testEnsureWeReconnect" -Dtests.seed=EE4961D2D47604F4 -Dtests.security.manager=true -Dtests.locale=zh-Hans-SG -Dtests.timezone=Asia/Pontianak -Dcompiler.java=13

This same time period has also seen a rather significant jump in average test execution times so perhaps there is something going on here.

https://gradle-enterprise.elastic.co/scans/tests?failures.failureClassification=non_verification&list.offset=0&list.size=50&list.sortColumn=startTime&list.sortOrder=desc&search.buildToolType=gradle&search.buildToolType=maven&search.startTimeMax=1581055741639&search.startTimeMin=1580450941631&search.tags=CI&search.tags=not:nested&search.tags=not:pull-request&tests.container=org.elasticsearch.transport.RemoteClusterClientTests&tests.sortField=FAILED&tests.test=testEnsureWeReconnect&tests.unstableOnly&trends.section=overview&trends.timeResolution=day&viewer.tzOffset=-480

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-02-07T06:16:00Z

Pinging @elastic/es-distributed (:Distributed/Network)

albertzaharovits · 2020-02-07T17:20:50Z

two more such failures on master
https://gradle-enterprise.elastic.co/s/gjthqdvfktvvg
https://gradle-enterprise.elastic.co/s/bp7nr62mxmski
Going to mute

Relates #52029

albertzaharovits · 2020-02-07T17:35:17Z

Only muted in master d4c609b

pugnascotia · 2020-02-14T11:21:25Z

Failed on 7.6 too - https://gradle-enterprise.elastic.co/s/p6yxu3p22ni52

Currently the remote connection manager will delegate the size() call to the underlying cluster connection manager. This introduces the possibility that call will return 1 before the nodeConnection method has been triggered to add the connection to the remote connection list. This can cause issues, as the ensureConnected method checks the connection managers size and executes synchronously if the size is > 0. This leads to a potential cluster not connected exception while we are still waiting for the connection opened callback to be triggered. This commit fixes this issue by using the remote connection manager's size to report the connection manager's size. Fixes elastic#52029.

Currently the remote connection manager will delegate the size() call to the underlying cluster connection manager. This introduces the possibility that call will return 1 before the nodeConnection method has been triggered to add the connection to the remote connection list. This can cause issues, as the ensureConnected method checks the connection managers size and executes synchronously if the size is > 0. This leads to a potential cluster not connected exception while we are still waiting for the connection opened callback to be triggered. This commit fixes this issue by using the remote connection manager's size to report the connection manager's size. Fixes #52029.

mayya-sharipova · 2020-03-06T14:59:57Z

Another failure on master today. Reopening the issue.

Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob+fast+part1/4097/console
Build Scans: https://gradle-enterprise.elastic.co/s/hijlawbrzpxnu

Stack trace:

09:23:34 org.elasticsearch.transport.RemoteClusterClientTests > testEnsureWeReconnect FAILED
09:23:34     java.util.concurrent.ExecutionException: org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [test]
09:23:34         at __randomizedtesting.SeedInfo.seed([C01B5CF26B407FDA:1498B3C6DC018873]:0)
09:23:34         at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
09:23:34         at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253)
09:23:34         at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
09:23:34         at org.elasticsearch.transport.RemoteClusterClientTests.testEnsureWeReconnect(RemoteClusterClientTests.java:111)

Tim-Brooks · 2020-04-15T13:03:24Z

I have merged #54934. This PR may not completely fix this issue. But it will help expose the underlying cause if the test fails again. At the moment we are in wait and see mode to see if this test continues to fail.

tlrx · 2020-04-17T10:35:04Z

@tbrooks8 The test failed today on CI but with the exception you added in #54934:

java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Unable to open any connections to remote cluster [test]
        at __randomizedtesting.SeedInfo.seed([7E33386F2D392BD:D360DCB245926514]:0)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253)
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
        at org.elasticsearch.transport.RemoteClusterClientTests.testEnsureWeReconnect(RemoteClusterClientTests.java:113)

        Caused by:
        java.lang.IllegalStateException: Unable to open any connections to remote cluster [test]
            at org.elasticsearch.transport.SniffConnectionStrategy$SniffClusterStateResponseHandler.handleNodes(SniffConnectionStrategy.java:367)
            at org.elasticsearch.transport.SniffConnectionStrategy$SniffClusterStateResponseHandler.handleResponse(SniffConnectionStrategy.java:329)
            at org.elasticsearch.transport.SniffConnectionStrategy$SniffClusterStateResponseHandler.handleResponse(SniffConnectionStrategy.java:309)
            at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1101)
            at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1101)
            at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:206)
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:691)
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base/java.lang.Thread.run(Thread.java:834)

I can't reproduce the issue locally.
Build scan: https://gradle-enterprise.elastic.co/s/nni5qkxjegck6

iverase · 2020-04-21T09:13:05Z

Two more failures in master:

https://gradle-enterprise.elastic.co/s/iycqzcwkouene
https://gradle-enterprise.elastic.co/s/iycqzcwkouene

markharwood · 2020-05-13T08:36:06Z

A case on 7.7 this morning
https://gradle-enterprise.elastic.co/s/5rxifwbiqaudq

Tim-Brooks · 2020-05-20T14:46:01Z

I believe this has been fixed by #56654. Closing.

mark-vieira added :Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI labels Feb 7, 2020

albertzaharovits mentioned this issue Feb 7, 2020

Mute RemoteClusterClientTests testEnsureWeReconnect #52067

Merged

albertzaharovits added a commit that referenced this issue Feb 7, 2020

Mute RemoteClusterClientTests testEnsureWeReconnect (#52067)

d4c609b

Relates #52029

original-brownbear assigned original-brownbear and unassigned original-brownbear Feb 8, 2020

ywelsch assigned Tim-Brooks Feb 12, 2020

Tim-Brooks mentioned this issue Feb 26, 2020

Fix RemoteConnectionManager size() method #52823

Merged

Tim-Brooks closed this as completed in #52823 Feb 26, 2020

mayya-sharipova reopened this Mar 6, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

Tim-Brooks mentioned this issue May 6, 2020

Improve logging around SniffConnectionStrategy #56292

Merged

Tim-Brooks closed this as completed May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] RemoteClusterClientTests testEnsureWeReconnect failing with NoSuchRemoteClusterException #52029

[CI] RemoteClusterClientTests testEnsureWeReconnect failing with NoSuchRemoteClusterException #52029

mark-vieira commented Feb 7, 2020 •

edited

Loading

elasticmachine commented Feb 7, 2020

albertzaharovits commented Feb 7, 2020

albertzaharovits commented Feb 7, 2020

pugnascotia commented Feb 14, 2020

mayya-sharipova commented Mar 6, 2020

Tim-Brooks commented Apr 15, 2020

tlrx commented Apr 17, 2020

iverase commented Apr 21, 2020

markharwood commented May 13, 2020

Tim-Brooks commented May 20, 2020

[CI] RemoteClusterClientTests testEnsureWeReconnect failing with NoSuchRemoteClusterException #52029

[CI] RemoteClusterClientTests testEnsureWeReconnect failing with NoSuchRemoteClusterException #52029

Comments

mark-vieira commented Feb 7, 2020 • edited Loading

elasticmachine commented Feb 7, 2020

albertzaharovits commented Feb 7, 2020

albertzaharovits commented Feb 7, 2020

pugnascotia commented Feb 14, 2020

mayya-sharipova commented Mar 6, 2020

Tim-Brooks commented Apr 15, 2020

tlrx commented Apr 17, 2020

iverase commented Apr 21, 2020

markharwood commented May 13, 2020

Tim-Brooks commented May 20, 2020

mark-vieira commented Feb 7, 2020 •

edited

Loading