Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] TestClustersPluginIT.testMultiNode failure #41256

Closed
henningandersen opened this issue Apr 16, 2019 · 13 comments · Fixed by #41340 or #44056
Closed

[CI] TestClustersPluginIT.testMultiNode failure #41256

henningandersen opened this issue Apr 16, 2019 · 13 comments · Fixed by #41340 or #44056
Assignees
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@henningandersen
Copy link
Contributor

This test failed on my master PR build. I have not been able to reproduce it locally. Unsure if it relates to #37889.

This seems to be the primary problem:

[2019-04-16T09:36:36,076][WARN ][o.e.c.NodeConnectionsService] [multiNode-3] failed to connect to {multiNode-1}{76UnrzNcSQCGO2byg8l0-g}{pWe_gZtjS_22EyAtKnzMDA}{127.0.0.1}{127.0.0.1:33513}{testattr=test, ml.machine_memory=63323983872, ml.max_open_jobs=20, xpack.installed=true} (tried [1] times)
  org.elasticsearch.transport.ConnectTransportException: [multiNode-1][127.0.0.1:33513] connect_exception
  	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:956) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:161) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]

leading to:

`cluster{::multiNode}` failed to wait for cluster health yellow after 40 SECONDS

testMultiNode.zip

@henningandersen henningandersen added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Apr 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Apr 16, 2019
@alpar-t
Copy link
Contributor

alpar-t commented Apr 18, 2019

I tried to reproduce this both locally and on an identical worker but it won't reproduce.
There's not much to go by from the logs., I don't see why the nodes wouldn't be able to talk to each-other. Will try some more to reproduce with the entire suite.

@henningandersen
Copy link
Contributor Author

This test failed again on my PR build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/12570/testReport/org.elasticsearch.gradle.testclusters/TestClustersPluginIT/testMultiProject/

Looks like a similar but not identical error:

> `cluster{:bravo:myTestCluster}` failed to wait for cluster health yellow after 40 SECONDS

Will leave test unmuted for now and retry PR.

@alpar-t alpar-t self-assigned this Apr 24, 2019
@henningandersen
Copy link
Contributor Author

henningandersen commented Apr 25, 2019

@henningandersen
Copy link
Contributor Author

@henningandersen
Copy link
Contributor Author

Turns out the last failures were for testMultiProject. testMultiNode is still muted.

Muted testMultiProject in 40216b4

akhil10x5 pushed a commit to akhil10x5/elasticsearch that referenced this issue May 2, 2019
@droberts195
Copy link
Contributor

droberts195 commented May 8, 2019

Both the affected tests are failing again:

In both cases the error is:

> `cluster{:alpha:myTestCluster}` failed to wait for cluster health yellow after 40 SECONDS

I guess some CI workers are slower than others and the startup time could be longer than expected due to a slow CI worker, so in case it's useful the 13551 failure was on elasticsearch-ci-immutable-debian-8-1557309753632904860 and the 13545 failure was on elasticsearch-ci-immutable-debian-9-1557304813981342182.

@droberts195 droberts195 reopened this May 8, 2019
@alpar-t
Copy link
Contributor

alpar-t commented May 8, 2019

@droberts195 can you merge master in and try again please ? I added additional logging to see what the issue is.

@droberts195
Copy link
Contributor

My PR is merged now, as I managed to get a green build after rerunning the tests - they got assigned to an Ubuntu worker on the run that succeeded. I see that both the workers that had problems for my PR have been replaced in Jenkins. So maybe there was something dodgy about those Debian workers.

@droberts195
Copy link
Contributor

@atorok was fa98c5e the commit with the extra logging? If so a failure of TestClustersPluginIT.testMultiNode containing that is https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/13576/testReport/

That build also ran on a Debian worker...

@alpar-t
Copy link
Contributor

alpar-t commented May 9, 2019

That's the right commit, but I'm not seeing the additional logs I was expecting

@alpar-t
Copy link
Contributor

alpar-t commented May 9, 2019

muted in 54f7113b2e8

alpar-t added a commit that referenced this issue May 9, 2019
alpar-t added a commit that referenced this issue May 10, 2019
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
henningandersen pushed a commit that referenced this issue Jun 10, 2019
@henningandersen
Copy link
Contributor Author

henningandersen commented Jun 10, 2019

alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 8, 2019
- Moves the example project builds into their own project
   - switch to doing an assemble only rather than a full check
- Remove most of testclusters IT - this is a central piece of test
  infrastructure now widely use we would for sure catch any fallout
  in all the other tests.

Closes elastic#41256 elastic#41256
alpar-t added a commit that referenced this issue Sep 30, 2019
…#44056)

Remove heavy build-tool integ test. 
Add a  unit test for the plugin builder plugin.

Closes #41256 #41256
alpar-t added a commit that referenced this issue Sep 30, 2019
…#44056)

Remove heavy build-tool integ test.
Add a  unit test for the plugin builder plugin.

Closes #41256 #41256
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
5 participants