Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Observability] Inventory rule doesn't alert on "Alert me if there's no data" #165421

Open
TheRiffRafi opened this issue Aug 31, 2023 · 12 comments
Open
Labels
bug Fixes for quality problems that affect the customer experience Team:obs-ux-management Observability Management User Experience Team

Comments

@TheRiffRafi
Copy link

Kibana version: 8.9.1

Elasticsearch version: 8.9.1

Describe the bug:

When configuring an inventory rule in Observability and setting the option "Alert me if there's no data" the rule doesn't generate an alert if there is no data.
Selecting to alert on "Status Change" or "Checks interval" has no effect on whether "no data" is reported or not.

Steps to reproduce:

  1. Get environment setup (Metricbeat with system module)
  2. Go to Observability - Alerts - Create rule - Inventory Type.
  3. Configure any threshold.
  4. Configure "Alert me if there's no data"
  5. Enable email notification.
  6. Observe alerting on set threshold value.
  7. Stop metricbeat.
  8. Observe that there is no alert on "no data".

Expected behavior:

A notification should be received if there is no data received.

Any additional context:
Tested the same steps in version 7.17.9 and issue was not reproducible, alerting notifies on "no data" when metricbeat is stopped. For version 8.x I've only tested on latest (8.9.1) and 8.8.2.

@TheRiffRafi TheRiffRafi added the bug Fixes for quality problems that affect the customer experience label Aug 31, 2023
@botelastic botelastic bot added the needs-team Issues missing a team label label Aug 31, 2023
@ppisljar ppisljar added Team:Observability Team label for Observability Team (for things that are handled across all of observability) and removed needs-team Issues missing a team label labels Sep 11, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/unified-observability (Team:Observability)

@cauemarcondes cauemarcondes added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Feb 27, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@mgiota mgiota added the Team:obs-ux-management Observability Management User Experience Team label Feb 27, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@mgiota
Copy link
Contributor

mgiota commented Feb 27, 2024

@TheRiffRafi I am gonna take a look and come back with more answers.

@jasonrhodes
Copy link
Member

For reference, the inventory rule communicates this option like this:

Screenshot 2024-02-27 at 3 32 47 PM

Whereas the metric threshold and custom threshold rules communicate their similar option differently:

Screenshot 2024-02-27 at 3 38 49 PM

Based on the language used, would it be okay for the Inventory Rule to trigger this alert only when there are zero documents returned overall (I'm not sure what the "or if the alert fails to query Elasticsearch" is meant to get at)? We probably want @vinaychandrasekhar to weigh in from the product perspective on this, and on whether we need to continue offering this option in the Inventory Rule in the first place.

@renangenova
Copy link
Member

Thank you for the continuation of this bug - I've created a KB article for visibility: https://support.elastic.dev/knowledge/view/f7e0ba8d

@maryam-saeidi
Copy link
Member

@jasonrhodes For the metric threshold, we have 2 no data settings, one for overall, and one for missing group. I think the one that you shared for the inventory rule is similar to the overall setting in the metric threshold one:

It is important to note that

  1. We removed the overall setting in the custom threshold UI to simplify the UI.
  2. If I remember correctly, in metric and custom threshold, the overall setting does not work if we add a group, in that case, the missing group setting is the one that applies.

@jasonrhodes
Copy link
Member

Makes sense, @maryam-saeidi, thanks for those explanations.

@jasonrhodes
Copy link
Member

What's the level of effort involved in making this work roughly as expected for the inventory rule?

@vinaychandrasekhar we should talk about options re: this no data scenario for the inventory rule (and possibly for the other rules).

@maryam-saeidi
Copy link
Member

@jasonrhodes I think this functionality is not the best way to solve the underlying issue (related to the availability of a service or related data) and we need to solve it at a different level (meaning rule level). We previously had a discussion with ResponseOps to have similar functionality for all the rules, not only the infra-related ones (inventory/metric threshold/custom threshold). Here is the outcome of the previous discussion. This will also cause an issue when we send notifications as we don't have a separate recovery notification per different groups of triggering alerts (alert/no data/warning).

My suggestion is to focus on introducing this functionality for all the rules, meaning in case of not having data, the rule will be in a warning state since nothing about the condition related to this alert is wrong, we don't have any data to draw that conclusion and this is relevant for any rule, not only infra-related ones and remove/deprecate this logic per rule.

Also, I can imagine 2 different teams being responsible for handling this issue:

  1. a team responsible for monitoring data ingestion and infrastructure.
  2. an app-level team that knows about the services and how to monitor them.

@jasonrhodes
Copy link
Member

That sounds very reasonable as a general way forward, but I'd like to hear from product about how comfortable they are with just removing the checkbox from existing rules. If we can't remove it and can only deprecate it, I think we should fix it so that it at least "works" as best as it can in the current context.

@vinaychandrasekhar
Copy link

I discussed this with @jasonrhodes . Do we know what the Level Of Effort is for fixing this on the Inventory threshold rule? If small_ish), we should discuss and get this on our team backlog to fix. If it's a large effort, let's chat live.

I agree with Maryam above that the longer term fix is to treat this need to "alert on no data" as a separate use case and related to, but separate from the day to day monitoring and alerting needs around thresholds and inventory monitoring and such. In addition, the (separate) alert will help SREs plan and manage the lack of data with things like automated baselining, analytics and visualizations etc. in addition to "just" alerting.

@smith smith removed Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:obs-ux-management Observability Management User Experience Team
Projects
None yet
Development

No branches or pull requests

10 participants