-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Check for error messages in the Anomaly Detection jobs health rule type #108701
Conversation
Pinging @elastic/ml-ui (:ml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the text change, LGTM!
|
@darnautov As that part of the text is not edited in this PR, I cannot add a suggestion to the "There are errors in job messages" text that Sophie mentioned, so I leave some options here as a comment:
|
thanks for the feedback @sophiec20!
Do you suggest updating the rule type helper text and the health check description as well?
@droberts195 suggested notifying about errors only once and I think it makes sense. So during the initial check, we query for any existing error messages in specified jobs, and for consecutive executions applying a time range according to the previous execution time.
@szabosteve @lcawl do you have any suggestions?
There is a limitation of the Alerts and actions framework. Our alerting context contains a collection, i.e. for each health check we provide a set of results (array of objects) and it is not possible to describe such context variables. I created an enhancement request but I haven't got an estimation yet. The best we can do so far is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ⚡
Rule type helper text. Please work with our docs team for suitable wording.
I think we need to think through this a little more. On the first invocation, it would not be ideal to search for any error since the beginning of time, because this could be last year for a very long running job. Or it could be from a time since before the job got reset as we do not clear out job messages. Perhaps it should only ever check since the prev execution time, and use the first invocation to set the execution time. |
latest_errors: Pick<estypes.SearchResponse<JobMessage>, 'hits'>; | ||
}>; | ||
|
||
const result = errors.buckets.map((bucket) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total nit, result
isn't needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 54cc87a
* Retrieve list of errors per job. | ||
* @param jobIds | ||
*/ | ||
async function getJobsErrors(jobIds: string[], earliestMs?: number): Promise<JobsErrorsResponse> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function could take the message level as a parameter, possibly defaulting to MESSAGE_LEVEL.ERROR
, to make it more reusable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I was thinking about it but not sure about the use case, i.e. if we ever want to retrieve warnings or info messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I can't see a use case at the moment, it would just make it potentially more reusable for no extra cost, especially if it had a default message level set to error.
If not i think the function should be renamed to getJobsErrorMessages
, to conform to the general naming convention in the file.
...(earliestMs ? [{ range: { timestamp: { gte: earliestMs } } }] : []), | ||
{ terms: { job_id: jobIds } }, | ||
{ | ||
term: { level: { value: 'error' } }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the comment above about making message level a param isn't added, this should be MESSAGE_LEVEL.ERROR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 54cc87a
@darnautov I suggest the following alternative for the rule type helper text:
And then the link to the documentation as the screenshot above shows. |
💚 Build SucceededMetrics [docs]Async chunks
History
To update your PR or re-run it, just comment with: cc @darnautov |
…le type (elastic#108701) * [ML] retrieve job errors * [ML] account for previous execution time * [ML] update default message * [ML] update description * [ML] update unit tests * [ML] update unit tests * [ML] update action name * [ML] update errorMessages name * [ML] update a default message to avoid line breaks * [ML] update rule helper text * [ML] refactor getJobsErrors * [ML] perform errors check starting from the second execution
💚 Backport successful
This backport PR will be merged automatically after passing CI. |
…le type (#108701) (#108918) * [ML] retrieve job errors * [ML] account for previous execution time * [ML] update default message * [ML] update description * [ML] update unit tests * [ML] update unit tests * [ML] update action name * [ML] update errorMessages name * [ML] update a default message to avoid line breaks * [ML] update rule helper text * [ML] refactor getJobsErrors * [ML] perform errors check starting from the second execution Co-authored-by: Dima Arnautov <[email protected]>
Summary
Part of #101028
Adds a test for errors in the jobs messages for the Anomaly detection jobs health rule type.
Checklist