Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MlMappingsUpgradeIT.testMappingsUpgrade fails with testclusters #46262

Closed
alpar-t opened this issue Sep 3, 2019 · 5 comments · Fixed by #50175
Closed

MlMappingsUpgradeIT.testMappingsUpgrade fails with testclusters #46262

alpar-t opened this issue Sep 3, 2019 · 5 comments · Fixed by #50175
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI v8.0.0-alpha1

Comments

@alpar-t
Copy link
Contributor

alpar-t commented Sep 3, 2019

I'm working in porting bwc tests to run with testclusters and can't get this one to pass:
https://gradle-enterprise.elastic.co/s/oqrlvw6evjqa4/tests/q3vpzt4orcdaq-vgix7kzieu2xs

I'm going to mute this since it's the only one that fails with this conversion and kindly ask someone from the team to take a look.

@alpar-t alpar-t added :ml Machine learning v8.0.0 labels Sep 3, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@jasontedor jasontedor added the >test-failure Triaged test failures from CI label Sep 3, 2019
@alpar-t
Copy link
Contributor Author

alpar-t commented Sep 3, 2019

The PR that implements the conversion is #46265

@droberts195
Copy link
Contributor

For the record, this test is muted in 7.5 through to master.

@droberts195
Copy link
Contributor

The problem is this:

[2019-12-12T15:40:11,596][ERROR][o.e.x.m.p.l.CppLogMessageHandler] [v7.6.0-0] [controller/4715] [CDetachedProcessSpawner.cc@184] Child process with PID 4863 was terminated by signal 9
[2019-12-12T15:40:11,733][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [v7.6.0-0] Successfully set job state to [failed] for job [ml-mappings-upgrade-job]

The test clusters code is killing the ML jobs before it kills the ES JVM.

@droberts195
Copy link
Contributor

The code that causes the problem is

// Stop all children first, ES could actually be a child when there's some wrapper process like on Windows.
processHandle.children().forEach(each -> stopHandle(each, forcibly));

It needs to kill the ES JVM before the ML processes. Otherwise the ES JVM thinks the ML processes have crashed, records them as failed and then they don't start again automatically during the rolling upgrade.

@droberts195 droberts195 self-assigned this Dec 13, 2019
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Dec 13, 2019
The testclusters shutdown code was killing child processes
of the ES JVM before the ES JVM.  This causes any running
ML jobs to be recorded as failed, as the ES JVM notices that
they have disconnected from it without being told to stop,
as they would if they crashed.  In many test suites this
doesn't matter because the test cluster will never be
restarted, but in the case of upgrade tests it makes it
impossible to test what happens when an ML job is running
at the time of the upgrade.

This change reverses the order of killing the ES process
tree such that the parent processes are killed before their
children.  A list of children is stored before killing the
parent so that they can subsequently be killed (if they
don't exit by themselves as a side effect of the parent
dying).

Fixes elastic#46262
droberts195 added a commit that referenced this issue Dec 15, 2019
The testclusters shutdown code was killing child processes
of the ES JVM before the ES JVM.  This causes any running
ML jobs to be recorded as failed, as the ES JVM notices that
they have disconnected from it without being told to stop,
as they would if they crashed.  In many test suites this
doesn't matter because the test cluster will never be
restarted, but in the case of upgrade tests it makes it
impossible to test what happens when an ML job is running
at the time of the upgrade.

This change reverses the order of killing the ES process
tree such that the parent processes are killed before their
children.  A list of children is stored before killing the
parent so that they can subsequently be killed (if they
don't exit by themselves as a side effect of the parent
dying).

Fixes #46262
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
The testclusters shutdown code was killing child processes
of the ES JVM before the ES JVM.  This causes any running
ML jobs to be recorded as failed, as the ES JVM notices that
they have disconnected from it without being told to stop,
as they would if they crashed.  In many test suites this
doesn't matter because the test cluster will never be
restarted, but in the case of upgrade tests it makes it
impossible to test what happens when an ML job is running
at the time of the upgrade.

This change reverses the order of killing the ES process
tree such that the parent processes are killed before their
children.  A list of children is stored before killing the
parent so that they can subsequently be killed (if they
don't exit by themselves as a side effect of the parent
dying).

Fixes elastic#46262
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants