[CI] Failure in ml.integration.RegressionIT.testStopAndRestart #47612

talevy · 2019-10-04T22:26:42Z

Something has triggered this test to fail 14 times in CI, starting on October 2.

reproduction step:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.RegressionIT.testStopAndRestart" -Dtests.seed=6C78F050FC5C2A2

Failed with stacktrace:

java.lang.AssertionError: 

Expected: <stopped>
     but: was <failed>

Open stacktrace

[2019-10-04T16:15:56,253][ERROR][o.e.x.m.i.RegressionIT   ] [testStopAndRestart] Failed to stop data frame analytics jobs; trying force
org.elasticsearch.ElasticsearchStatusException: cannot close data frame analytics [regression_stop_and_restart] because it failed, use force stop instead
	at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:59) ~[x-pack-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.findAnalyticsToStop(TransportStopDataFrameAnalyticsAction.java:127) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.lambda$doExecute$1(TransportStopDataFrameAnalyticsAction.java:102) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.lambda$expandIds$4(TransportStopDataFrameAnalyticsAction.java:173) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-04T22:27:52Z

Pinging @elastic/ml-core (:ml)

dimitris-athanasiou · 2019-10-05T09:53:52Z

@talevy Could you please add links to CI and the build scans?

droberts195 · 2019-10-05T16:04:57Z

Here’s one example:

Jenkins: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/362/console
Scan: https://gradle-enterprise.elastic.co/s/bdbipsekpfijw

droberts195 · 2019-10-05T16:12:42Z

Looks like it’s due to:

18:45:24 » ERROR][o.e.x.m.p.AbstractNativeProcess] [integTest-1] [regression_stop_and_restart] analytics process stopped unexpectedly: Input error: expected no more than '232' rows but got '350' rows. Please report this problem.

dimitris-athanasiou · 2019-10-05T17:19:41Z

This is really unexpected. I'll look into how this could happen.

droberts195 · 2019-10-05T17:36:24Z

It just failed again, but on a different assertion and no message about the C++ process exiting early (unless I missed it - possible as I looked on a phone).

Details are:

dimitris-athanasiou · 2019-10-05T20:20:17Z

I'll mute it for now.

Relates elastic#47612

Relates #47612

Relates elastic#47612

Relates #47612

Tracked by #47612

dimitris-athanasiou · 2019-10-15T21:12:03Z

I have understood why these failures may happen.

If the stopping occurs right after reindexing is finished but before we refresh the destination index, we don't refresh at all. The job is started again right after and jumps into the analyzing state. However, the data is still not searchable. This is why we see that we start the process expecting X rows (where X is lower than the expected number of docs) and we end up getting X+.

The fix is quite simple actually. There is no reason to perform the refresh at that part of the code. We can instead do it in AnalyticsProcessManager before starting the process and ensure before we get the job running the dest index is fully searchable. I'm preparing a PR for it.

If a job stops right after reindexing is finished but before we refresh the destination index, we don't refresh at all. If the job is started again right after, it jumps into the analyzing state. However, the data is still not searchable. This is why we were seeing test failures that we start the process expecting X rows (where X is lower than the expected number of docs) and we end up getting X+. We fix this by moving the refresh of the dest index right before we start the process so it always ensures the data is searchable. Closes elastic#47612

) If a job stops right after reindexing is finished but before we refresh the destination index, we don't refresh at all. If the job is started again right after, it jumps into the analyzing state. However, the data is still not searchable. This is why we were seeing test failures that we start the process expecting X rows (where X is lower than the expected number of docs) and we end up getting X+. We fix this by moving the refresh of the dest index right before we start the process so it always ensures the data is searchable. Closes #47612

…elastic#48090) If a job stops right after reindexing is finished but before we refresh the destination index, we don't refresh at all. If the job is started again right after, it jumps into the analyzing state. However, the data is still not searchable. This is why we were seeing test failures that we start the process expecting X rows (where X is lower than the expected number of docs) and we end up getting X+. We fix this by moving the refresh of the dest index right before we start the process so it always ensures the data is searchable. Closes elastic#47612 Backport of elastic#48090

…#48090) (#48196) If a job stops right after reindexing is finished but before we refresh the destination index, we don't refresh at all. If the job is started again right after, it jumps into the analyzing state. However, the data is still not searchable. This is why we were seeing test failures that we start the process expecting X rows (where X is lower than the expected number of docs) and we end up getting X+. We fix this by moving the refresh of the dest index right before we start the process so it always ensures the data is searchable. Closes #47612 Backport of #48090

…#48090) (#48197) If a job stops right after reindexing is finished but before we refresh the destination index, we don't refresh at all. If the job is started again right after, it jumps into the analyzing state. However, the data is still not searchable. This is why we were seeing test failures that we start the process expecting X rows (where X is lower than the expected number of docs) and we end up getting X+. We fix this by moving the refresh of the dest index right before we start the process so it always ensures the data is searchable. Closes #47612 Backport of #48090

DaveCTurner · 2019-10-25T10:10:16Z

This still seems to be failing, in master, 7.x and 7.5:

dimitris-athanasiou · 2019-10-25T10:42:06Z

This seems to be a different failure:

org.elasticsearch.ElasticsearchException: Failed to launch data frame analytics memory usage estimation process for job regression_stop_and_restartOpen stacktrace
Caused by: java.io.FileNotFoundException: /dev/shm/elastic+elasticsearch+7.5+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/corretto11/nodes/general-purpose/x-pack/plugin/ml/qa/native-multi-node-tests/build/testclusters/integTest-1/tmp/data_frame_analyzer_regression_stop_and_restart_output_360144 (No such file or directory)Close stacktrace
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:219)
at java.io.FileInputStream.<init>(FileInputStream.java:157)
at java.io.FileInputStream.<init>(FileInputStream.java:112)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277)
at java.security.AccessController.doPrivileged(Native Method)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97)
at org.elasticsearch.xpack.ml.process.ProcessPipes.connectStreams(ProcessPipes.java:141)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createNativeProcess(NativeMemoryUsageEstimationProcessFactory.java:100)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createAnalyticsProcess(NativeMemoryUsageEstimationProcessFactory.java:67)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createAnalyticsProcess(NativeMemoryUsageEstimationProcessFactory.java:34)
at org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.runJob(MemoryUsageEstimationProcessManager.java:77)
at org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.lambda$runJobAsync$0(MemoryUsageEstimationProcessManager.java:47)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)

@przemekwitek Could you take a look at this?

przemekwitek · 2019-10-25T11:01:51Z

@przemekwitek Could you take a look at this?

Sure

dnhatn · 2019-11-18T17:09:00Z

Another instance https://gradle-enterprise.elastic.co/s/dakhdcri2celg/tests/zcquf3hc3eoda-q3ovxh6ju37ic

przemekwitek · 2019-11-19T09:32:40Z

Another instance https://gradle-enterprise.elastic.co/s/dakhdcri2celg/tests/zcquf3hc3eoda-q3ovxh6ju37ic

This failure is likely unrelated as it affects outlier detection analysis, not regression analysis.

przemekwitek · 2020-01-08T08:23:03Z

A number of fixes were applied and the last CI failure of this test was a week ago.
https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:e58bf320-7efd-11e8-bf69-63c8ef516157,key:branch,negate:!t,params:(query:move-jobs,type:phrase),type:phrase,value:move-jobs),query:(match:(branch:(query:move-jobs,type:phrase))))),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'%22RegressionIT%20testStopAndRestart%22'),sort:!(time,desc))

Closing this issue for now.

talevy added :ml Machine learning >test-failure Triaged test failures from CI labels Oct 4, 2019

talevy added 7x v8.0.0 and removed 7x v8.0.0 labels Oct 4, 2019

dimitris-athanasiou self-assigned this Oct 5, 2019

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Oct 5, 2019

[ML] Mute RegressionIT.testStopAndRestart

7582fc2

Relates elastic#47612

dimitris-athanasiou mentioned this issue Oct 5, 2019

[ML] Mute RegressionIT.testStopAndRestart #47624

Merged

dimitris-athanasiou added a commit that referenced this issue Oct 5, 2019

[ML] Mute RegressionIT.testStopAndRestart (#47624)

9945d5c

Relates #47612

dimitris-athanasiou mentioned this issue Oct 5, 2019

[7.x] [ML] Mute RegressionIT.testStopAndRestart #47625

Merged

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Oct 5, 2019

[7.x][ML] Mute RegressionIT.testStopAndRestart (elastic#47624)

f3cb19d

Relates elastic#47612

dimitris-athanasiou added a commit that referenced this issue Oct 5, 2019

[7.x][ML] Mute RegressionIT.testStopAndRestart (#47624) (#47625)

ffacfc6

Relates #47612

dimitris-athanasiou mentioned this issue Oct 10, 2019

[CI] RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart fails periodically #47862

Closed

imotov added a commit that referenced this issue Oct 10, 2019

Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart

f966fa1

Tracked by #47612

imotov added a commit that referenced this issue Oct 10, 2019

Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart

17433e7

Tracked by #47612

imotov added a commit that referenced this issue Oct 10, 2019

Fix Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart

f4a2b52

Tracked by #47612

imotov added a commit that referenced this issue Oct 10, 2019

Fix Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart

b5afa95

Tracked by #47612

dimitris-athanasiou mentioned this issue Oct 15, 2019

[ML] Always refresh dest index before starting analytics process #48090

Merged

dimitris-athanasiou closed this as completed in #48090 Oct 17, 2019

dimitris-athanasiou mentioned this issue Oct 17, 2019

[7.x][ML] Always refresh dest index before starting analytics process… #48196

Merged

dimitris-athanasiou mentioned this issue Oct 17, 2019

[7.5][ML] Always refresh dest index before starting analytics process… #48197

Merged

DaveCTurner reopened this Oct 25, 2019

droberts195 mentioned this issue Oct 25, 2019

ML tests failing due to unfinished tasks #48512

Closed

przemekwitek self-assigned this Oct 30, 2019

przemekwitek mentioned this issue Oct 30, 2019

Remove duplicated code. elastic/ml-cpp#799

Merged

przemekwitek mentioned this issue Nov 14, 2019

Unmute testStopAndRestart test #48985

Merged

droberts195 changed the title ~~[CI] Failure in ml.integration.RegressionIT.testStopAndRestore~~ [CI] Failure in ml.integration.RegressionIT.testStopAndRestart Nov 28, 2019

dliappis mentioned this issue Dec 13, 2019

[CI] Timeout in ml.integration.RegressionIT.testStopAndRestart on Windows 2019 #50177

Closed

przemekwitek closed this as completed Jan 8, 2020

ChrisHegarty unassigned przemekwitek Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Failure in ml.integration.RegressionIT.testStopAndRestart #47612

[CI] Failure in ml.integration.RegressionIT.testStopAndRestart #47612

talevy commented Oct 4, 2019 •

edited

Loading

elasticmachine commented Oct 4, 2019

dimitris-athanasiou commented Oct 5, 2019

droberts195 commented Oct 5, 2019

droberts195 commented Oct 5, 2019

dimitris-athanasiou commented Oct 5, 2019

droberts195 commented Oct 5, 2019

dimitris-athanasiou commented Oct 5, 2019

dimitris-athanasiou commented Oct 15, 2019

DaveCTurner commented Oct 25, 2019

dimitris-athanasiou commented Oct 25, 2019 •

edited

Loading

przemekwitek commented Oct 25, 2019

dnhatn commented Nov 18, 2019

przemekwitek commented Nov 19, 2019

przemekwitek commented Jan 8, 2020

[CI] Failure in ml.integration.RegressionIT.testStopAndRestart #47612

[CI] Failure in ml.integration.RegressionIT.testStopAndRestart #47612

Comments

talevy commented Oct 4, 2019 • edited Loading

elasticmachine commented Oct 4, 2019

dimitris-athanasiou commented Oct 5, 2019

droberts195 commented Oct 5, 2019

droberts195 commented Oct 5, 2019

dimitris-athanasiou commented Oct 5, 2019

droberts195 commented Oct 5, 2019

dimitris-athanasiou commented Oct 5, 2019

dimitris-athanasiou commented Oct 15, 2019

DaveCTurner commented Oct 25, 2019

dimitris-athanasiou commented Oct 25, 2019 • edited Loading

przemekwitek commented Oct 25, 2019

dnhatn commented Nov 18, 2019

przemekwitek commented Nov 19, 2019

przemekwitek commented Jan 8, 2020

talevy commented Oct 4, 2019 •

edited

Loading

dimitris-athanasiou commented Oct 25, 2019 •

edited

Loading