Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] anomaly detection Job failed without reason audit message when model snapshot is too old #80621

Closed
wwang500 opened this issue Nov 10, 2021 · 3 comments · Fixed by #80665
Closed
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@wwang500
Copy link

Steps to reproduce (easier to do this from a cloud env)

  • Deploy a 7.8.1 cluster and start a real-time anomaly detection job,
  • Then create a 8.0 cluster by restoring that 7.8.1 cluster snapshot,
  • after 8.0 cluster successfully was created, login to check the AD job status:

AD job failed on UI, and there is no failed reason being thrown, neither in job message. (At the time of 11:56 on the below screenshot)

Screen Shot 2021-11-10 at 9 50 03 AM

After stop datafeed and force-close job (at the time of 16:01 on above screen), restarting job gave the root cause of job failure, which was caused by old model snapshot:

{
  "error": {
    "root_cause": [
      {
        "type": "exception",
        "reason": "[response_code_rates] job snapshot [1636472812] has min version before [7.0.0], please revert to a newer model snapshot or reset the job"
      }
    ],
    "type": "exception",
    "reason": "[response_code_rates] job snapshot [1636472812] has min version before [7.0.0], please revert to a newer model snapshot or reset the job"
  },
  "status": 500
}

Expected behaviour:
Good failure message on UI and job messages

@wwang500 wwang500 added >bug :ml Machine learning labels Nov 10, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Nov 10, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor

droberts195 commented Nov 10, 2021

Some extra information about this:

  1. The 7.8.1 snapshot contained job model states dating back to 6.x. So this is why the 8.0 job (correctly) refuses to load these.
  2. The job goes into the failed state at 11:56, but the datafeed remains started. This highlights that [ML] Datafeed assignment should cancel datafeeds with failed jobs #48934 is still a problem, and the fact that that issue is still not fixed is the reason why we see support cases where datafeeds are started while the corresponding jobs are not opened.

So we should add an audit message that records what happens when an open job goes into the failed state when being assigned to a new node. And we should also fix #48934 (which would have benefits beyond upgrade use cases).

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 11, 2021
The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes elastic#48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes elastic#80621
@droberts195
Copy link
Contributor

This is actually one particular case of a broader problem. There are a number of reasons why a job can fail when it is assigned to a new node, for example:

  1. Failure to update mappings
  2. Failure to revert to current snapshot
  3. Failure to search for associated datafeed

None of these were creating an audit message, so would have been invisible to the end user viewing the UI. #80665 should fix those 3 (which existed in 7.x) plus the one this issue was opened for (which is new in 8.0).

droberts195 added a commit that referenced this issue Nov 11, 2021
)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes #48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes #80621
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 11, 2021
…stic#80665)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes elastic#48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes elastic#80621
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 11, 2021
…stic#80665)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes elastic#48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes elastic#80621
elasticsearchmachine pushed a commit that referenced this issue Nov 11, 2021
) (#80677)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes #48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes #80621
elasticsearchmachine pushed a commit that referenced this issue Nov 12, 2021
…ed (#80665) (#80678)

* [ML] Audit job open failures and stop any corresponding datafeed (#80665)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes #48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes #80621

* Removing 8.0+ test that won't work on 7.16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants