Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Failure in ml.integration.RegressionIT.testStopAndRestart #47612

Closed
talevy opened this issue Oct 4, 2019 · 14 comments · Fixed by #48090
Closed

[CI] Failure in ml.integration.RegressionIT.testStopAndRestart #47612

talevy opened this issue Oct 4, 2019 · 14 comments · Fixed by #48090
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@talevy
Copy link
Contributor

talevy commented Oct 4, 2019

Something has triggered this test to fail 14 times in CI, starting on October 2.

reproduction step:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.RegressionIT.testStopAndRestart" -Dtests.seed=6C78F050FC5C2A2

Failed with stacktrace:

java.lang.AssertionError: 

Expected: <stopped>
     but: was <failed>

Open stacktrace

[2019-10-04T16:15:56,253][ERROR][o.e.x.m.i.RegressionIT   ] [testStopAndRestart] Failed to stop data frame analytics jobs; trying force
org.elasticsearch.ElasticsearchStatusException: cannot close data frame analytics [regression_stop_and_restart] because it failed, use force stop instead
	at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:59) ~[x-pack-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.findAnalyticsToStop(TransportStopDataFrameAnalyticsAction.java:127) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.lambda$doExecute$1(TransportStopDataFrameAnalyticsAction.java:102) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.lambda$expandIds$4(TransportStopDataFrameAnalyticsAction.java:173) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
@talevy talevy added :ml Machine learning >test-failure Triaged test failures from CI labels Oct 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@dimitris-athanasiou
Copy link
Contributor

@talevy Could you please add links to CI and the build scans?

@droberts195
Copy link
Contributor

@droberts195
Copy link
Contributor

Looks like it’s due to:

18:45:24 » ERROR][o.e.x.m.p.AbstractNativeProcess] [integTest-1] [regression_stop_and_restart] analytics process stopped unexpectedly: Input error: expected no more than '232' rows but got '350' rows. Please report this problem.

@dimitris-athanasiou
Copy link
Contributor

This is really unexpected. I'll look into how this could happen.

@droberts195
Copy link
Contributor

It just failed again, but on a different assertion and no message about the C++ process exiting early (unless I missed it - possible as I looked on a phone).

Details are:

@dimitris-athanasiou
Copy link
Contributor

I'll mute it for now.

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Oct 5, 2019
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Oct 5, 2019
@dimitris-athanasiou
Copy link
Contributor

I have understood why these failures may happen.

If the stopping occurs right after reindexing is finished but before we refresh the destination index, we don't refresh at all. The job is started again right after and jumps into the analyzing state. However, the data is still not searchable. This is why we see that we start the process expecting X rows (where X is lower than the expected number of docs) and we end up getting X+.

The fix is quite simple actually. There is no reason to perform the refresh at that part of the code. We can instead do it in AnalyticsProcessManager before starting the process and ensure before we get the job running the dest index is fully searchable. I'm preparing a PR for it.

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Oct 15, 2019
If a job stops right after reindexing is finished but before
we refresh the destination index, we don't refresh at all.
If the job is started again right after, it jumps into the analyzing state.
However, the data is still not searchable.
This is why we were seeing test failures that we start the process
expecting X rows (where X is lower than the expected number of docs)
and we end up getting X+.

We fix this by moving the refresh of the dest index right before
we start the process so it always ensures the data is searchable.

Closes elastic#47612
dimitris-athanasiou added a commit that referenced this issue Oct 17, 2019
)

If a job stops right after reindexing is finished but before
we refresh the destination index, we don't refresh at all.
If the job is started again right after, it jumps into the analyzing state.
However, the data is still not searchable.
This is why we were seeing test failures that we start the process
expecting X rows (where X is lower than the expected number of docs)
and we end up getting X+.

We fix this by moving the refresh of the dest index right before
we start the process so it always ensures the data is searchable.

Closes #47612
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Oct 17, 2019
…elastic#48090)

If a job stops right after reindexing is finished but before
we refresh the destination index, we don't refresh at all.
If the job is started again right after, it jumps into the analyzing state.
However, the data is still not searchable.
This is why we were seeing test failures that we start the process
expecting X rows (where X is lower than the expected number of docs)
and we end up getting X+.

We fix this by moving the refresh of the dest index right before
we start the process so it always ensures the data is searchable.

Closes elastic#47612

Backport of elastic#48090
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Oct 17, 2019
…elastic#48090)

If a job stops right after reindexing is finished but before
we refresh the destination index, we don't refresh at all.
If the job is started again right after, it jumps into the analyzing state.
However, the data is still not searchable.
This is why we were seeing test failures that we start the process
expecting X rows (where X is lower than the expected number of docs)
and we end up getting X+.

We fix this by moving the refresh of the dest index right before
we start the process so it always ensures the data is searchable.

Closes elastic#47612

Backport of elastic#48090
dimitris-athanasiou added a commit that referenced this issue Oct 17, 2019
…#48090) (#48196)

If a job stops right after reindexing is finished but before
we refresh the destination index, we don't refresh at all.
If the job is started again right after, it jumps into the analyzing state.
However, the data is still not searchable.
This is why we were seeing test failures that we start the process
expecting X rows (where X is lower than the expected number of docs)
and we end up getting X+.

We fix this by moving the refresh of the dest index right before
we start the process so it always ensures the data is searchable.

Closes #47612

Backport of #48090
dimitris-athanasiou added a commit that referenced this issue Oct 17, 2019
…#48090) (#48197)

If a job stops right after reindexing is finished but before
we refresh the destination index, we don't refresh at all.
If the job is started again right after, it jumps into the analyzing state.
However, the data is still not searchable.
This is why we were seeing test failures that we start the process
expecting X rows (where X is lower than the expected number of docs)
and we end up getting X+.

We fix this by moving the refresh of the dest index right before
we start the process so it always ensures the data is searchable.

Closes #47612

Backport of #48090
@DaveCTurner DaveCTurner reopened this Oct 25, 2019
@dimitris-athanasiou
Copy link
Contributor

dimitris-athanasiou commented Oct 25, 2019

This seems to be a different failure:

org.elasticsearch.ElasticsearchException: Failed to launch data frame analytics memory usage estimation process for job regression_stop_and_restartOpen stacktrace
Caused by: java.io.FileNotFoundException: /dev/shm/elastic+elasticsearch+7.5+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/corretto11/nodes/general-purpose/x-pack/plugin/ml/qa/native-multi-node-tests/build/testclusters/integTest-1/tmp/data_frame_analyzer_regression_stop_and_restart_output_360144 (No such file or directory)Close stacktrace
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:219)
at java.io.FileInputStream.<init>(FileInputStream.java:157)
at java.io.FileInputStream.<init>(FileInputStream.java:112)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277)
at java.security.AccessController.doPrivileged(Native Method)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97)
at org.elasticsearch.xpack.ml.process.ProcessPipes.connectStreams(ProcessPipes.java:141)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createNativeProcess(NativeMemoryUsageEstimationProcessFactory.java:100)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createAnalyticsProcess(NativeMemoryUsageEstimationProcessFactory.java:67)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createAnalyticsProcess(NativeMemoryUsageEstimationProcessFactory.java:34)
at org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.runJob(MemoryUsageEstimationProcessManager.java:77)
at org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.lambda$runJobAsync$0(MemoryUsageEstimationProcessManager.java:47)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)

@przemekwitek Could you take a look at this?

@przemekwitek
Copy link
Contributor

@przemekwitek Could you take a look at this?

Sure

@dnhatn
Copy link
Member

dnhatn commented Nov 18, 2019

@przemekwitek
Copy link
Contributor

Another instance https://gradle-enterprise.elastic.co/s/dakhdcri2celg/tests/zcquf3hc3eoda-q3ovxh6ju37ic

This failure is likely unrelated as it affects outlier detection analysis, not regression analysis.

@droberts195 droberts195 changed the title [CI] Failure in ml.integration.RegressionIT.testStopAndRestore [CI] Failure in ml.integration.RegressionIT.testStopAndRestart Nov 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants