Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Check for error messages in the Anomaly Detection jobs health rule type #108701

Merged
merged 14 commits into from
Aug 17, 2021

Conversation

darnautov
Copy link
Contributor

@darnautov darnautov commented Aug 16, 2021

Summary

Part of #101028

Adds a test for errors in the jobs messages for the Anomaly detection jobs health rule type.

image

Checklist

@darnautov darnautov added :ml Feature:Anomaly Detection ML anomaly detection release_note:feature Makes this part of the condensed release notes auto-backport Deprecated - use backport:version if exact versions are needed v7.15.0 Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types 8.0.0 labels Aug 16, 2021
@darnautov darnautov requested a review from a team as a code owner August 16, 2021 14:03
@darnautov darnautov self-assigned this Aug 16, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the text change, LGTM!

@sophiec20
Copy link
Contributor

sophiec20 commented Aug 16, 2021

  1. Can we please add some UI helper text to explain that these operational alerts are best suited for your mission critical or important jobs. For example, the "datafeed is not started" alert is only useful if applied to a datafeed that is operationally critical (i.e. that is a real-time job for which you probably already have an alert running on the anomaly detection results).

  2. Re errors in job messages - The other alerts can all be solved. e.g. a datafeed can be started, and job memory amended. How are we expecting the job message errors to be resolved? Does it take the "Clear job messages" option into account? or is there a time frame over which to look back for errors in which case it will age out? -- the helper text should explain.

  3. "There are errors in the job messages" - this wording does not seem in keeping with the rest of the operational alerts.

  4. How do we tell which jobs are experiencing which problems? - Until an integrated alerting UI is available, we are relying on the alert action (e.g. email message) to describe which jobs are experiencing which problem. Therefore, we rely on easy (ish) access to this context info and well written documentation to describe how to do it. Is this part of this PR or will it be a follow up?

@szabosteve
Copy link
Contributor

@darnautov As that part of the text is not edited in this PR, I cannot add a suggestion to the "There are errors in job messages" text that Sophie mentioned, so I leave some options here as a comment:

  • Errors in job messages (I'd prefer this one.)
  • Job messages contain errors

@darnautov
Copy link
Contributor Author

thanks for the feedback @sophiec20!

Can we please add some UI helper text to explain that these operational alerts are best suited for your mission critical or important jobs. For example, the "datafeed is not started" alert is only useful if applied to a datafeed that is operationally critical (i.e. that is a real-time job for which you probably already have an alert running on the anomaly detection results).

Do you suggest updating the rule type helper text and the health check description as well?
image

Re errors in job messages - The other alerts can all be solved. e.g. a datafeed can be started, and job memory amended. How are we expecting the job message errors to be resolved? Does it take the "Clear job messages" option into account? or is there a time frame over which to look back for errors in which case it will age out? -- the helper text should explain.

@droberts195 suggested notifying about errors only once and I think it makes sense. So during the initial check, we query for any existing error messages in specified jobs, and for consecutive executions applying a time range according to the previous execution time.

"There are errors in the job messages" - this wording does not seem in keeping with the rest of the operational alerts.

@szabosteve @lcawl do you have any suggestions?

How do we tell which jobs are experiencing which problems? - Until an integrated alerting UI is available, we are relying on the alert action (e.g. email message) to describe which jobs are experiencing which problem. Therefore, we rely on easy (ish) access to this context info and well written documentation to describe how to do it. Is this part of this PR or will it be a follow up?

There is a limitation of the Alerts and actions framework. Our alerting context contains a collection, i.e. for each health check we provide a set of results (array of objects) and it is not possible to describe such context variables. I created an enhancement request but I haven't got an estimation yet. The best we can do so far is:

  • Provide a predefined default message that contains a mustache template with all possible fields. It's already in place.
  • Describe the alerting context in the documentation, similar to context.hits in the Elasticsearch query rule type

Copy link
Contributor

@alvarezmelissa87 alvarezmelissa87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ⚡

@sophiec20
Copy link
Contributor

Do you suggest updating the rule type helper text and the health check description as well?

Rule type helper text. Please work with our docs team for suitable wording.

notifying about errors only once ... So during the initial check, we query for any existing error messages in specified jobs, and for consecutive executions applying a time range according to the previous execution time.

I think we need to think through this a little more. On the first invocation, it would not be ideal to search for any error since the beginning of time, because this could be last year for a very long running job. Or it could be from a time since before the job got reset as we do not clear out job messages. Perhaps it should only ever check since the prev execution time, and use the first invocation to set the execution time.

@lukasolson lukasolson removed the 8.0.0 label Aug 17, 2021
latest_errors: Pick<estypes.SearchResponse<JobMessage>, 'hits'>;
}>;

const result = errors.buckets.map((bucket) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total nit, result isn't needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 54cc87a

* Retrieve list of errors per job.
* @param jobIds
*/
async function getJobsErrors(jobIds: string[], earliestMs?: number): Promise<JobsErrorsResponse> {
Copy link
Member

@jgowdyelastic jgowdyelastic Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could take the message level as a parameter, possibly defaulting to MESSAGE_LEVEL.ERROR, to make it more reusable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was thinking about it but not sure about the use case, i.e. if we ever want to retrieve warnings or info messages

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I can't see a use case at the moment, it would just make it potentially more reusable for no extra cost, especially if it had a default message level set to error.
If not i think the function should be renamed to getJobsErrorMessages, to conform to the general naming convention in the file.

...(earliestMs ? [{ range: { timestamp: { gte: earliestMs } } }] : []),
{ terms: { job_id: jobIds } },
{
term: { level: { value: 'error' } },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comment above about making message level a param isn't added, this should be MESSAGE_LEVEL.ERROR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 54cc87a

@szabosteve
Copy link
Contributor

szabosteve commented Aug 17, 2021

@darnautov I suggest the following alternative for the rule type helper text:

Alert when anomaly detection jobs experience operational issues. Enable suitable alerts for critically important jobs.

And then the link to the documentation as the screenshot above shows.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
ml 6.0MB 6.0MB +146.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @darnautov

@darnautov darnautov merged commit f243b05 into elastic:master Aug 17, 2021
@darnautov darnautov deleted the ml-101028-errors branch August 17, 2021 14:21
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Aug 17, 2021
…le type (elastic#108701)

* [ML] retrieve job errors

* [ML] account for previous execution time

* [ML] update default message

* [ML] update description

* [ML] update unit tests

* [ML] update unit tests

* [ML] update action name

* [ML] update errorMessages name

* [ML] update a default message to avoid line breaks

* [ML] update rule helper text

* [ML] refactor getJobsErrors

* [ML] perform errors check starting from the second execution
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Aug 17, 2021
…le type (#108701) (#108918)

* [ML] retrieve job errors

* [ML] account for previous execution time

* [ML] update default message

* [ML] update description

* [ML] update unit tests

* [ML] update unit tests

* [ML] update action name

* [ML] update errorMessages name

* [ML] update a default message to avoid line breaks

* [ML] update rule helper text

* [ML] refactor getJobsErrors

* [ML] perform errors check starting from the second execution

Co-authored-by: Dima Arnautov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Feature:Anomaly Detection ML anomaly detection :ml release_note:feature Makes this part of the condensed release notes v7.15.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants