Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics Alerts] Add No Data alert #64080

Closed
Zacqary opened this issue Apr 21, 2020 · 10 comments · Fixed by #64365
Closed

[Metrics Alerts] Add No Data alert #64080

Zacqary opened this issue Apr 21, 2020 · 10 comments · Fixed by #64365
Labels
Feature:Alerting Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@Zacqary
Copy link
Contributor

Zacqary commented Apr 21, 2020

Branched off discussion in #61825

Add an [ ] Alert me if there's no data checkbox to the alert creation form. If this is checked, the alert should go into a No Data state and fire an action to send a No Data message.

By default, this action should send a message on all of the same channels the user has configured for a regular alert. e.g. if the user is receiving Slack messages and emails when the alert fires, also send the No Data notification over Slack and email

@Zacqary Zacqary added Feature:Alerting Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Apr 21, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@Zacqary
Copy link
Contributor Author

Zacqary commented Apr 21, 2020

Continuing discussion from the previous issue:

The question is when should we fire such alert? I see two options -

  1. Let the user worry about it and make it a parameter to configure:
    [ ] Alert me if there's no data for X minutes.
  2. Give the user less parameters to worry about. Think of a good default value and use it. A hardcoded number obviously won't work, but something proportional to a check interval could do.

I'd go with option 2, and it should send the alert the first time it checks for data and can't find any. If the alert interval is set to X minutes then that's how many minutes the user expects is a reasonable amount of time to wait for data, and any deviation from this should be reported.

@Zacqary
Copy link
Contributor Author

Zacqary commented Apr 22, 2020

Okay so the Alerting plugin's UI doesn't make this very easy.

Issue as spec'd is to just add a checkbox to alert when there's no data. It's not that simple though, because we're now getting into the weeds of action messages.

Here's the UI for setting up an action:
Screen Shot 2020-04-22 at 2 44 35 PM

We can customize one message. The UI doesn't allow us to define a second type of message to send.

The only way to do this without making changes to the alerting plugin is to do this:

  • Generate a standard message from within the alert type executor, and have it be either "This alert fired, its current value is x" or "This alert had no data".
  • Send it to the context
  • Only allow the user to customize the alert by adding information before or after the message

So basically our default action message now becomes:
Screen Shot 2020-04-22 at 2 46 16 PM

Which is weird.

But on the plus side, now we have complete product design control over what information is included in alert messages, and we don't have to rely on the user to include all of the important variables. Although they can still delete context.message in this case and truly shoot themselves in the foot. So that's the only reason this use case is weird: because the Alerting plugin assumes that every action is going to send a completely customized message, save for some template variables.

I think we need to rethink this assumption in order for this feature to merge without being awkward.

@Zacqary
Copy link
Contributor Author

Zacqary commented Apr 22, 2020

Alternatively we could do something like this:

Screen Shot 2020-04-22 at 2 59 02 PM

I feel like this makes alerts needlessly advanced though.

@phillipb
Copy link
Contributor

phillipb commented Apr 23, 2020

I don't think IS_EMPTY should be a checkbox. It feels like a totally separate alert. I should be able to create an alert that fires when storefront.checkouts has no data for the last 5 minutes. Also, that would actually solve the issues with the UX. Thoughts @sorantis?

@sorantis
Copy link

{{context.metricOf.condition0}} has reported no data over the past {{conext.interval}} seems like a good default message to send. Why do we need to also expose it to the user?

@Zacqary
Copy link
Contributor Author

Zacqary commented Apr 23, 2020

Yeah I agree @sorantis, it would just require a much more substantial refactor of parts of the alerting plugin than I'd thought in order to achieve that design. I can proceed with that if we think it's worth it.

Should we still allow the user to customize the message that's sent when the alert fires? Or have that also be a message generated by the alert executor?

@Zacqary
Copy link
Contributor Author

Zacqary commented Apr 23, 2020

After some consideration, I think we should set the default alert message to this:

{{alertName}} - {{context.group}} is in a state of {{context.alertState}}

Because {{context.reason}}

Which would produce something like

My Alert - elasticsearch-master-0 is in a state of ALERT

Because system.load.1 is greater than a threshold of 1.0 (current value 
is 2.3); system.mem.usage is greater than a threshold of 85 (current 
value is 90)

or

My Alert - elasticsearch-master-0 is in a state of NO DATA

Because system.load.1 has reported no data over the past 5 minutes

We could even use this system to send a message when the alert recovers:

My Alert - elasticsearch-master-0 is in a state of OK [Recovered]

Because system.load.1 is no longer greater than a threshold of 1.0 
(current value is 0.95); system.mem.usage is no longer greater than
a threshold of 85 (current value is 76)

This is something we could achieve without refactoring the alerting plugin, and it would reduce a lot of noise in the message box. Users wouldn't have to manually add all their conditions to the alert message, and they could just focus on providing additional context.

The only disadvantage is that this would be a breaking change, but I think we can do that since this feature's in beta?

@sorantis
Copy link

sorantis commented Apr 23, 2020

I like it. I don't think it'll be a big issue to change the template. Perhaps even make it part of 7.7.1?
EDIT: scratch that. We should be ok to change it in the next minor.

@finviman
Copy link

finviman commented Feb 10, 2023

Does someone use this feature? I checked it in Kibana 8.4.2, but I didn't get any alerts when no metrics log report for about 10 hours because of network problem .
Do I need to add another action that "Run when No Data" to get alert?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants