[ML] anomaly detection Job failed without reason audit message when model snapshot is too old #80621

wwang500 · 2021-11-10T16:10:51Z

Steps to reproduce (easier to do this from a cloud env)

Deploy a 7.8.1 cluster and start a real-time anomaly detection job,
Then create a 8.0 cluster by restoring that 7.8.1 cluster snapshot,
after 8.0 cluster successfully was created, login to check the AD job status:

AD job failed on UI, and there is no failed reason being thrown, neither in job message. (At the time of 11:56 on the below screenshot)

After stop datafeed and force-close job (at the time of 16:01 on above screen), restarting job gave the root cause of job failure, which was caused by old model snapshot:

{
  "error": {
    "root_cause": [
      {
        "type": "exception",
        "reason": "[response_code_rates] job snapshot [1636472812] has min version before [7.0.0], please revert to a newer model snapshot or reset the job"
      }
    ],
    "type": "exception",
    "reason": "[response_code_rates] job snapshot [1636472812] has min version before [7.0.0], please revert to a newer model snapshot or reset the job"
  },
  "status": 500
}

Expected behaviour:
Good failure message on UI and job messages

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-11-10T16:10:54Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2021-11-10T16:18:39Z

Some extra information about this:

The 7.8.1 snapshot contained job model states dating back to 6.x. So this is why the 8.0 job (correctly) refuses to load these.
The job goes into the failed state at 11:56, but the datafeed remains started. This highlights that [ML] Datafeed assignment should cancel datafeeds with failed jobs #48934 is still a problem, and the fact that that issue is still not fixed is the reason why we see support cases where datafeeds are started while the corresponding jobs are not opened.

So we should add an audit message that records what happens when an open job goes into the failed state when being assigned to a new node. And we should also fix #48934 (which would have benefits beyond upgrade use cases).

The anomaly detection code contained an assumption dating back to 2016 that if a job failed then its datafeed would notice and stop itself. That works if the job fails on a node after it has successfully started up. But it doesn't work if the job fails during the startup sequence. If the job is being started for the first time then the datafeed won't be running, so there's no problem, but if the job fails when it's being reassigned to a new node then it breaks down, because the datafeed is started by not assigned to any node at that instant. This PR addresses this by making the job force-stop its own datafeed if it fails during its startup sequence and the datafeed is started. Fixes elastic#48934 Additionally, auditing of job failures during the startup sequence is moved so that it happens for all failure scenarios instead of just one. Fixes elastic#80621

droberts195 · 2021-11-11T14:00:57Z

This is actually one particular case of a broader problem. There are a number of reasons why a job can fail when it is assigned to a new node, for example:

Failure to update mappings
Failure to revert to current snapshot
Failure to search for associated datafeed

None of these were creating an audit message, so would have been invisible to the end user viewing the UI. #80665 should fix those 3 (which existed in 7.x) plus the one this issue was opened for (which is new in 8.0).

) The anomaly detection code contained an assumption dating back to 2016 that if a job failed then its datafeed would notice and stop itself. That works if the job fails on a node after it has successfully started up. But it doesn't work if the job fails during the startup sequence. If the job is being started for the first time then the datafeed won't be running, so there's no problem, but if the job fails when it's being reassigned to a new node then it breaks down, because the datafeed is started by not assigned to any node at that instant. This PR addresses this by making the job force-stop its own datafeed if it fails during its startup sequence and the datafeed is started. Fixes #48934 Additionally, auditing of job failures during the startup sequence is moved so that it happens for all failure scenarios instead of just one. Fixes #80621

…stic#80665) The anomaly detection code contained an assumption dating back to 2016 that if a job failed then its datafeed would notice and stop itself. That works if the job fails on a node after it has successfully started up. But it doesn't work if the job fails during the startup sequence. If the job is being started for the first time then the datafeed won't be running, so there's no problem, but if the job fails when it's being reassigned to a new node then it breaks down, because the datafeed is started by not assigned to any node at that instant. This PR addresses this by making the job force-stop its own datafeed if it fails during its startup sequence and the datafeed is started. Fixes elastic#48934 Additionally, auditing of job failures during the startup sequence is moved so that it happens for all failure scenarios instead of just one. Fixes elastic#80621

) (#80677) The anomaly detection code contained an assumption dating back to 2016 that if a job failed then its datafeed would notice and stop itself. That works if the job fails on a node after it has successfully started up. But it doesn't work if the job fails during the startup sequence. If the job is being started for the first time then the datafeed won't be running, so there's no problem, but if the job fails when it's being reassigned to a new node then it breaks down, because the datafeed is started by not assigned to any node at that instant. This PR addresses this by making the job force-stop its own datafeed if it fails during its startup sequence and the datafeed is started. Fixes #48934 Additionally, auditing of job failures during the startup sequence is moved so that it happens for all failure scenarios instead of just one. Fixes #80621

…ed (#80665) (#80678) * [ML] Audit job open failures and stop any corresponding datafeed (#80665) The anomaly detection code contained an assumption dating back to 2016 that if a job failed then its datafeed would notice and stop itself. That works if the job fails on a node after it has successfully started up. But it doesn't work if the job fails during the startup sequence. If the job is being started for the first time then the datafeed won't be running, so there's no problem, but if the job fails when it's being reassigned to a new node then it breaks down, because the datafeed is started by not assigned to any node at that instant. This PR addresses this by making the job force-stop its own datafeed if it fails during its startup sequence and the datafeed is started. Fixes #48934 Additionally, auditing of job failures during the startup sequence is moved so that it happens for all failure scenarios instead of just one. Fixes #80621 * Removing 8.0+ test that won't work on 7.16

wwang500 added >bug :ml Machine learning labels Nov 10, 2021

elasticmachine added the Team:ML Meta label for the ML team label Nov 10, 2021

droberts195 mentioned this issue Nov 11, 2021

[ML] Audit job open failures and stop any corresponding datafeed #80665

Merged

droberts195 closed this as completed in #80665 Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] anomaly detection Job failed without reason audit message when model snapshot is too old #80621

[ML] anomaly detection Job failed without reason audit message when model snapshot is too old #80621

wwang500 commented Nov 10, 2021

elasticmachine commented Nov 10, 2021

droberts195 commented Nov 10, 2021 •

edited

Loading

droberts195 commented Nov 11, 2021

[ML] anomaly detection Job failed without reason audit message when model snapshot is too old #80621

[ML] anomaly detection Job failed without reason audit message when model snapshot is too old #80621

Comments

wwang500 commented Nov 10, 2021

elasticmachine commented Nov 10, 2021

droberts195 commented Nov 10, 2021 • edited Loading

droberts195 commented Nov 11, 2021

droberts195 commented Nov 10, 2021 •

edited

Loading