Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TooManyJobsIT#testMultipleNodes fails: Had to resort to force-closing job #48511

Closed
DaveCTurner opened this issue Oct 25, 2019 · 2 comments · Fixed by #50142
Closed

TooManyJobsIT#testMultipleNodes fails: Had to resort to force-closing job #48511

DaveCTurner opened this issue Oct 25, 2019 · 2 comments · Fixed by #50142
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

I see a couple of builds failed like this:

org.elasticsearch.xpack.ml.integration.TooManyJobsIT > testMultipleNodes FAILED
    java.lang.RuntimeException: Had to resort to force-closing job, something went wrong?

        Caused by:
        java.util.concurrent.ExecutionException: org.elasticsearch.transport.RemoteTransportException: [node_t1][127.0.0.1:44709][cluster:admin/xpack/ml/job/close]

            Caused by:
            org.elasticsearch.transport.RemoteTransportException: [node_t1][127.0.0.1:44709][cluster:admin/xpack/ml/job/close]

                Caused by:
                java.lang.IllegalStateException: Timed out when waiting for persistent tasks after 30s
REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testMultipleNodes" -Dtests.seed=62CDCB396D54BF7A -Dtests.security.manager=true -Dtests.locale=is -Dtests.timezone=Europe/Sofia -Dcompiler.java=12 -Druntime.java=11

Possibly this is #30300 again, or maybe this is just the CI machine running slowly?

@DaveCTurner DaveCTurner added >test-failure Triaged test failures from CI :ml Machine learning labels Oct 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor

or maybe this is just the CI machine running slowly?

Yes. We allow 30 seconds to close all the jobs. They did actually all close successfully, but this took 79 seconds in the first build and 48 seconds in the second. The log from the first failure shows:

[2019-08-09T02:02:48,163][INFO ][o.e.x.m.i.TooManyJobsIT  ] [testMultipleNodes] Closing jobs using [_all]
...
[2019-08-09T02:04:07,320][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node_t2] [max-number-of-jobs-limit-job-39] job closed

and the log from the second failure shows:

[2019-10-25T07:48:17,023][INFO ][o.e.x.m.i.TooManyJobsIT  ] [testMultipleNodes] Closing jobs using [_all]
...
[2019-10-25T07:49:05,701][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node_t3] [max-number-of-jobs-limit-job-37] job closed

This is probably due to running multiple test tasks in parallel. If this test happens to run at the some time as some other resource intensive test suite it takes too long.

I will bump the timeout for ML cleanup in the internal cluster tests up to 90 seconds.

@droberts195 droberts195 self-assigned this Dec 12, 2019
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Dec 12, 2019
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants