MlMappingsUpgradeIT.testMappingsUpgrade fails with testclusters #46262

alpar-t · 2019-09-03T11:50:37Z

I'm working in porting bwc tests to run with testclusters and can't get this one to pass:
https://gradle-enterprise.elastic.co/s/oqrlvw6evjqa4/tests/q3vpzt4orcdaq-vgix7kzieu2xs

I'm going to mute this since it's the only one that fails with this conversion and kindly ask someone from the team to take a look.

elasticmachine · 2019-09-03T11:50:38Z

Pinging @elastic/ml-core

alpar-t · 2019-09-03T12:24:39Z

The PR that implements the conversion is #46265

droberts195 · 2019-12-12T14:51:37Z

For the record, this test is muted in 7.5 through to master.

droberts195 · 2019-12-12T16:21:03Z

The problem is this:

[2019-12-12T15:40:11,596][ERROR][o.e.x.m.p.l.CppLogMessageHandler] [v7.6.0-0] [controller/4715] [CDetachedProcessSpawner.cc@184] Child process with PID 4863 was terminated by signal 9
[2019-12-12T15:40:11,733][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [v7.6.0-0] Successfully set job state to [failed] for job [ml-mappings-upgrade-job]

The test clusters code is killing the ML jobs before it kills the ES JVM.

droberts195 · 2019-12-12T16:56:48Z

The code that causes the problem is

elasticsearch/buildSrc/src/main/java/org/elasticsearch/gradle/testclusters/ElasticsearchNode.java

Lines 840 to 841 in d442ff9

    
           // Stop all children first, ES could actually be a child when there's some wrapper process like on Windows. 
        
           processHandle.children().forEach(each -> stopHandle(each, forcibly));

It needs to kill the ES JVM before the ML processes. Otherwise the ES JVM thinks the ML processes have crashed, records them as failed and then they don't start again automatically during the rolling upgrade.

The testclusters shutdown code was killing child processes of the ES JVM before the ES JVM. This causes any running ML jobs to be recorded as failed, as the ES JVM notices that they have disconnected from it without being told to stop, as they would if they crashed. In many test suites this doesn't matter because the test cluster will never be restarted, but in the case of upgrade tests it makes it impossible to test what happens when an ML job is running at the time of the upgrade. This change reverses the order of killing the ES process tree such that the parent processes are killed before their children. A list of children is stored before killing the parent so that they can subsequently be killed (if they don't exit by themselves as a side effect of the parent dying). Fixes elastic#46262

The testclusters shutdown code was killing child processes of the ES JVM before the ES JVM. This causes any running ML jobs to be recorded as failed, as the ES JVM notices that they have disconnected from it without being told to stop, as they would if they crashed. In many test suites this doesn't matter because the test cluster will never be restarted, but in the case of upgrade tests it makes it impossible to test what happens when an ML job is running at the time of the upgrade. This change reverses the order of killing the ES process tree such that the parent processes are killed before their children. A list of children is stored before killing the parent so that they can subsequently be killed (if they don't exit by themselves as a side effect of the parent dying). Fixes #46262

The testclusters shutdown code was killing child processes of the ES JVM before the ES JVM. This causes any running ML jobs to be recorded as failed, as the ES JVM notices that they have disconnected from it without being told to stop, as they would if they crashed. In many test suites this doesn't matter because the test cluster will never be restarted, but in the case of upgrade tests it makes it impossible to test what happens when an ML job is running at the time of the upgrade. This change reverses the order of killing the ES process tree such that the parent processes are killed before their children. A list of children is stored before killing the parent so that they can subsequently be killed (if they don't exit by themselves as a side effect of the parent dying). Fixes elastic#46262

alpar-t added :ml Machine learning v8.0.0 labels Sep 3, 2019

jasontedor added the >test-failure Triaged test failures from CI label Sep 3, 2019

droberts195 self-assigned this Dec 13, 2019

This was referenced Dec 13, 2019

Change process kill order for testclusters shutdown #50175

Merged

Make Docker images suitable for ESS/ECE #49926

Closed

droberts195 closed this as completed in #50175 Dec 15, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MlMappingsUpgradeIT.testMappingsUpgrade fails with testclusters #46262

MlMappingsUpgradeIT.testMappingsUpgrade fails with testclusters #46262

alpar-t commented Sep 3, 2019

elasticmachine commented Sep 3, 2019

alpar-t commented Sep 3, 2019

droberts195 commented Dec 12, 2019

droberts195 commented Dec 12, 2019

droberts195 commented Dec 12, 2019

MlMappingsUpgradeIT.testMappingsUpgrade fails with testclusters #46262

MlMappingsUpgradeIT.testMappingsUpgrade fails with testclusters #46262

Comments

alpar-t commented Sep 3, 2019

elasticmachine commented Sep 3, 2019

alpar-t commented Sep 3, 2019

droberts195 commented Dec 12, 2019

droberts195 commented Dec 12, 2019

droberts195 commented Dec 12, 2019