Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] RemoteClusterClientTests testEnsureWeReconnect failing with NoSuchRemoteClusterException #52029

Closed
mark-vieira opened this issue Feb 7, 2020 · 10 comments · Fixed by #52823
Assignees
Labels
:Distributed Coordination/Network Http and internode communication implementations Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@mark-vieira
Copy link
Contributor

mark-vieira commented Feb 7, 2020

This test has failed 4 times in the past three days after passing basically 100% of the time for the past month. Look suspicious. Happening on both master and 7.x.

:server:test » org.elasticsearch.transport.RemoteClusterClientTests » testEnsureWeReconnect (1.662s)
org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [test]
java.util.concurrent.ExecutionException: org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [test]Open stacktrace
Caused by: org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [test]Open stacktrace
[2020-02-07T11:10:26,252][INFO ][o.e.t.RemoteClusterClientTests] [testEnsureWeReconnect] before test
[2020-02-07T11:10:26,808][INFO ][o.e.t.TransportService   ] [testEnsureWeReconnect] publish_address {127.0.0.1:14000}, bound_addresses {[::1]:14000}, {127.0.0.1:14000}
[2020-02-07T11:10:27,048][INFO ][o.e.t.TransportService   ] [testEnsureWeReconnect] publish_address {127.0.0.1:14001}, bound_addresses {[::1]:14001}, {127.0.0.1:14001}
[2020-02-07T11:10:27,519][INFO ][o.e.t.RemoteClusterClientTests] [testEnsureWeReconnect] after test
REPRODUCE WITH: ./gradlew ':server:test' --tests "org.elasticsearch.transport.RemoteClusterClientTests.testEnsureWeReconnect" -Dtests.seed=EE4961D2D47604F4 -Dtests.security.manager=true -Dtests.locale=zh-Hans-SG -Dtests.timezone=Asia/Pontianak -Dcompiler.java=13

This same time period has also seen a rather significant jump in average test execution times so perhaps there is something going on here.
image

https://gradle-enterprise.elastic.co/scans/tests?failures.failureClassification=non_verification&list.offset=0&list.size=50&list.sortColumn=startTime&list.sortOrder=desc&search.buildToolType=gradle&search.buildToolType=maven&search.startTimeMax=1581055741639&search.startTimeMin=1580450941631&search.tags=CI&search.tags=not:nested&search.tags=not:pull-request&tests.container=org.elasticsearch.transport.RemoteClusterClientTests&tests.sortField=FAILED&tests.test=testEnsureWeReconnect&tests.unstableOnly&trends.section=overview&trends.timeResolution=day&viewer.tzOffset=-480

@mark-vieira mark-vieira added :Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI labels Feb 7, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Network)

@albertzaharovits
Copy link
Contributor

@albertzaharovits
Copy link
Contributor

Only muted in master d4c609b

@pugnascotia
Copy link
Contributor

Failed on 7.6 too - https://gradle-enterprise.elastic.co/s/p6yxu3p22ni52

Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Feb 26, 2020
Currently the remote connection manager will delegate the size() call to
the underlying cluster connection manager. This introduces the
possibility that call will return 1 before the nodeConnection method has
been triggered to add the connection to the remote connection list. This
can cause issues, as the ensureConnected method checks the connection
managers size and executes synchronously if the size is > 0. This leads
to a potential cluster not connected exception while we are still
waiting for the connection opened callback to be triggered.

This commit fixes this issue by using the remote connection manager's
size to report the connection manager's size.

Fixes elastic#52029.
Tim-Brooks added a commit that referenced this issue Feb 26, 2020
Currently the remote connection manager will delegate the size() call to
the underlying cluster connection manager. This introduces the
possibility that call will return 1 before the nodeConnection method has
been triggered to add the connection to the remote connection list. This
can cause issues, as the ensureConnected method checks the connection
managers size and executes synchronously if the size is > 0. This leads
to a potential cluster not connected exception while we are still
waiting for the connection opened callback to be triggered.

This commit fixes this issue by using the remote connection manager's
size to report the connection manager's size.

Fixes #52029.
Tim-Brooks added a commit that referenced this issue Mar 4, 2020
Currently the remote connection manager will delegate the size() call to
the underlying cluster connection manager. This introduces the
possibility that call will return 1 before the nodeConnection method has
been triggered to add the connection to the remote connection list. This
can cause issues, as the ensureConnected method checks the connection
managers size and executes synchronously if the size is > 0. This leads
to a potential cluster not connected exception while we are still
waiting for the connection opened callback to be triggered.

This commit fixes this issue by using the remote connection manager's
size to report the connection manager's size.

Fixes #52029.
@mayya-sharipova
Copy link
Contributor

Another failure on master today. Reopening the issue.

Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob+fast+part1/4097/console
Build Scans: https://gradle-enterprise.elastic.co/s/hijlawbrzpxnu

Stack trace:

09:23:34 org.elasticsearch.transport.RemoteClusterClientTests > testEnsureWeReconnect FAILED
09:23:34     java.util.concurrent.ExecutionException: org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [test]
09:23:34         at __randomizedtesting.SeedInfo.seed([C01B5CF26B407FDA:1498B3C6DC018873]:0)
09:23:34         at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
09:23:34         at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253)
09:23:34         at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
09:23:34         at org.elasticsearch.transport.RemoteClusterClientTests.testEnsureWeReconnect(RemoteClusterClientTests.java:111)

@Tim-Brooks
Copy link
Contributor

I have merged #54934. This PR may not completely fix this issue. But it will help expose the underlying cause if the test fails again. At the moment we are in wait and see mode to see if this test continues to fail.

@tlrx
Copy link
Member

tlrx commented Apr 17, 2020

@tbrooks8 The test failed today on CI but with the exception you added in #54934:

java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Unable to open any connections to remote cluster [test]
        at __randomizedtesting.SeedInfo.seed([7E33386F2D392BD:D360DCB245926514]:0)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253)
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
        at org.elasticsearch.transport.RemoteClusterClientTests.testEnsureWeReconnect(RemoteClusterClientTests.java:113)

        Caused by:
        java.lang.IllegalStateException: Unable to open any connections to remote cluster [test]
            at org.elasticsearch.transport.SniffConnectionStrategy$SniffClusterStateResponseHandler.handleNodes(SniffConnectionStrategy.java:367)
            at org.elasticsearch.transport.SniffConnectionStrategy$SniffClusterStateResponseHandler.handleResponse(SniffConnectionStrategy.java:329)
            at org.elasticsearch.transport.SniffConnectionStrategy$SniffClusterStateResponseHandler.handleResponse(SniffConnectionStrategy.java:309)
            at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1101)
            at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1101)
            at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:206)
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:691)
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base/java.lang.Thread.run(Thread.java:834)

I can't reproduce the issue locally.
Build scan: https://gradle-enterprise.elastic.co/s/nni5qkxjegck6

@iverase
Copy link
Contributor

iverase commented Apr 21, 2020

@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
@markharwood
Copy link
Contributor

A case on 7.7 this morning
https://gradle-enterprise.elastic.co/s/5rxifwbiqaudq

@Tim-Brooks
Copy link
Contributor

I believe this has been fixed by #56654. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.