Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Timeout in ml.integration.RegressionIT.testStopAndRestart on Windows 2019 #50177

Closed
dliappis opened this issue Dec 13, 2019 · 5 comments
Closed
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@dliappis
Copy link
Contributor

Example build scan: https://gradle-enterprise.elastic.co/s/ssxafsqxsb6sy

Happened on Windows 2019.

Reproduction line:

gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.RegressionIT.testStopAndRestart" \
  -Dtests.seed=23440FBF46C6B468 \
  -Dtests.security.manager=true \
  -Dtests.locale=pt-PT \
  -Dtests.timezone=BST \
  -Dcompiler.java=13

Doesn't reproduce locally.

Given it happens only on Windows workers, is it just a matter of worker slowness hence requiring increase in suite timeout (currently it's 30m)? (AFAIK Windows workers aren't using a ram disk, hence expected to be slower).

Other issues related to the same test are tracked in #47612

@dliappis dliappis added >test-failure Triaged test failures from CI :ml Machine learning labels Dec 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@przemekwitek
Copy link
Contributor

przemekwitek commented Dec 13, 2019

This is the first time I'm seeing this exact issue related to testStopAndRestart test (the test which I'm working for some time already).
Here is the excerpt from the server logs downloaded from Jenkins:

3371 [2019-12-13T09:37:55,153][INFO ][o.e.x.m.a.TransportStartDataFrameAnalyticsAction] [integTest-1] [regression_stop_and_restart] Starting data frame analytics^M
3372 [2019-12-13T09:37:55,807][INFO ][o.e.x.m.d.DataFrameAnalyticsManager] [integTest-1] [regression_stop_and_restart] Creating destination index [regression_stop_and_restart_source_index_results]^M
3373 [2019-12-13T09:37:56,382][INFO ][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [integTest-1] [regression_stop_and_restart] Stopping task with force [false]^M
3374 [2019-12-13T09:37:56,779][WARN ][o.e.x.m.p.AbstractNativeProcess] [integTest-1] [regression_stop_and_restart] Failed to get PID of analytics process to kill^M
3375 [2019-12-13T09:37:56,785][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-1] [regression_stop_and_restart] Closing process^M
3376 [2019-12-13T09:37:56,880][FATAL][o.e.x.m.p.l.CppLogMessageHandler] [integTest-1] [regression_stop_and_restart] [data_frame_analyzer/13148] [Main.cc@49] Environment error: failed to open file 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ml\qa\native-multi-node-tests\build\testclusters\integTest-1\tmp\analysis12988701410012102091.conf'.^M
3377 [2019-12-13T09:37:56,880][FATAL][o.e.x.m.p.l.CppLogMessageHandler] [integTest-1] [regression_stop_and_restart] [data_frame_analyzer/13148] [Main.cc@148] Failed to read config file 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ml\qa\native-multi-node-tests\build\testclusters\integTest-1\tmp\analysis12988701410012102091.conf'^M
3378 [2019-12-13T09:37:56,883][ERROR][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-1] [regression_stop_and_restart] Error closing data frame analyzer process^M
3379 org.elasticsearch.ElasticsearchException: Environment error: failed to open file 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ml\qa\native-multi-node-tests\build\testclusters\integTest-1\tmp\analysis12988701410012102091.conf'.
3380 Failed to read config file 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ml\qa\native-multi-node-tests\build\testclusters\integTest-1\tmp\analysis12988701410012102091.conf'
3381 ^M
3382     at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:51) ~[x-pack-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]^M
3383     at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.close(AbstractNativeProcess.java:182) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]^M
3384     at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.closeProcess(AnalyticsProcessManager.java:297) [x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]^M
3385     at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.processData(AnalyticsProcessManager.java:177) [x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]^M
3386     at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.lambda$runJob$1(AnalyticsProcessManager.java:119) [x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]^M
3387     at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:629) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]^M
3388     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]^M
3389     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]^M
3390     at java.lang.Thread.run(Thread.java:834) [?:?]^M
3391 [2019-12-13T09:37:56,885][ERROR][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-1] [regression_stop_and_restart] Marking task failed; [regression_stop_and_restart] Error closing data frame analyzer process [Environment error: failed to open file 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ml\qa\native-multi-node-tests\build\testclusters\integTest-1\tmp\analysis12988701410012102091.conf'.
3392 Failed to read config file 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ml\qa\native-multi-node-tests\build\testclusters\integTest-1\tmp\analysis12988701410012102091.conf'
3393 ]^M
3394 [2019-12-13T09:37:56,938][INFO ][o.e.x.m.d.DataFrameAnalyticsTask] [integTest-1] [regression_stop_and_restart] Successfully update task state to [failed]^M

@droberts195
Copy link
Contributor

@przemekwitek to me that sequence of log messages suggests a race between start and stop.

  1. Job is started
  2. Job is stopped while the start sequence is still in progress
  3. Stop sequence tries to kill the process and fails because it hasn't been running long enough to report its PID to the JVM
  4. Stop sequence deletes the native process's config file
  5. Native process tries to read its config file and fails

It implies this logic is flawed:

// The PID comes via the processes log stream. We don't wait for it to arrive here,
// but if the wait times out it implies the process has only just started, in which
// case it should die very quickly when we close its input stream.
nativeController.killProcess(cppLogHandler.getPid(Duration.ZERO));

Anomaly detection jobs treat a closing of their input stream as meaning they've been asked to gracefully shut down. Not sure about data frame analytics jobs. But even for anomaly detection jobs there's a race where the process kill could be attempted in between startup and reading the config files, fail, and then the config files get deleted by the JVM before the native process tries to read them.

So I think it would be worth trying a wait of Duration.ofMillis(processConnectTimeout.getMillis()); instead of Duration.ZERO where processConnectTimeout comes from the MachineLearning.PROCESS_CONNECT_TIMEOUT dynamic setting.

Please give it a try and if it doesn't cause any easily reproducible problems locally that suggest it's a bad idea then add it to the code and we can see if it stops these related intermittent failures of the various "stop and restart" tests.

@przemekwitek
Copy link
Contributor

Please give it a try and if it doesn't cause any easily reproducible problems locally that suggest it's a bad idea then add it to the code and we can see if it stops these related intermittent failures of the various "stop and restart" tests.

Thanks for the analysis, @droberts195. I'll try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants