Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Datafeed assignment should cancel datafeeds with failed jobs #48934

Closed
droberts195 opened this issue Nov 11, 2019 · 4 comments · Fixed by #80665
Closed

[ML] Datafeed assignment should cancel datafeeds with failed jobs #48934

droberts195 opened this issue Nov 11, 2019 · 4 comments · Fixed by #80665
Assignees
Labels
>bug :ml Machine learning

Comments

@droberts195
Copy link
Contributor

Currently if an ML job fails then we expect the datafeed to stop when it tries to send data to that job and receives an exception response. However, this may not happen for a while after the job has failed, as the datafeed only sends data to it periodically.

If the node that the datafeed is running on is removed from the cluster while the job is in the failed state then there is a problem: the datafeed node assignment will refuse to assign it while its job is failed. This leads to a datafeed that is in limbo, unable to be assigned to a new node and (due to #48931) unable to be force-stopped.

It would make more sense if the datafeed node assignment cancelled datafeeds whose jobs were in states where the datafeed would not work after reassignment, such as closing or failed.

There may be subtlety here, as the same node assignment code is used to generate error messages for the start_datafeed endpoint. So we may need to find a way of telling the node assignment code whether it's being called on an initial assignment or a reassignment, and having it work slightly differently in the two cases:

  1. For an initial assignment return an error message if the datafeed's job is in an unacceptable state
  2. For a reassignment log an error and cancel the datafeed task if the datafeed's job is in an unacceptable state
@droberts195 droberts195 added >bug :ml Machine learning labels Nov 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor Author

We should probably also do this for closing jobs that need to relocate - see #30006 (comment)

@droberts195
Copy link
Contributor Author

We should probably also do this for closing jobs that need to relocate

This was done in #55315

@droberts195
Copy link
Contributor Author

droberts195 commented Nov 11, 2021

After looking at the code it seems that the best way to do this without the risk of causing a flood of cluster state updates is that we force-stop the datafeed whenever we set the job state to failed during the code that runs on assignment to a new node.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 11, 2021
The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes elastic#48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes elastic#80621
droberts195 added a commit that referenced this issue Nov 11, 2021
)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes #48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes #80621
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 11, 2021
…stic#80665)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes elastic#48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes elastic#80621
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 11, 2021
…stic#80665)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes elastic#48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes elastic#80621
elasticsearchmachine pushed a commit that referenced this issue Nov 11, 2021
) (#80677)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes #48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes #80621
elasticsearchmachine pushed a commit that referenced this issue Nov 12, 2021
…ed (#80665) (#80678)

* [ML] Audit job open failures and stop any corresponding datafeed (#80665)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes #48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes #80621

* Removing 8.0+ test that won't work on 7.16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants