[Stack Monitoring] Improve reliability of CPU Usage rule #172955

miltonhultgren · 2023-12-08T15:06:36Z

Summary

We've had issues for a while with the CPU Usage rule causing false negatives. An attempt was made to correct this but that did three things:
a) Introduce new error cases that generate noise
b) Highlighted areas where the data model makes it hard to evaluate the rule state
c) Showed that the domain is more complex than the rule currently accounts for (such as autoscaling, or non-container based cgroup environments)

The formula itself is simple but to be precise it relies on the limit to be fixed, which isn't always the case. If we use the same limit across two segments where the limit has changed, we're likely to either underreport or overreport the usage, which depending on the range the rule is looking at could be very bad.
This means we need to be able to query for all the "limit segments" within the lookback window, calculate the average per segment and then take the average of all segments.
Perhaps even more sophisticated would be to have the app mark out when these limits change in the UI etc.

As for the noise, we made some assumptions about where cgroups are used, namly only in containerized environments (that's how the code is currently worded) but this isn't true, cgroups are widely used in other setups so the rule needs to accommodate for that. Ideally we would also be able to alert on both cgroup and non-cgroup based setups with the same rule to allow people to have mixed environments.

In addition to changing how we resolve the results for the query, we also need better tools for when the rule executor faces issues it cannot work around that leave the rule "broken". Ideally these would not trigger on the first failure but would kick in after a few repeated failures. Further, these actions need to be separate from the normal rule actions so that users can decide if that is something that should ping an SRE that is on call or not.

Links

Attempted fixed:
#159351
#167244

Revert:
#172913

Original issues:
#116128
#160905

elasticmachine · 2023-12-12T11:59:55Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

miltonhultgren added the Feature:Stack Monitoring label Dec 8, 2023

botelastic bot added the needs-team Issues missing a team label label Dec 8, 2023

jsanz added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Dec 12, 2023

botelastic bot removed the needs-team Issues missing a team label label Dec 12, 2023

smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] Improve reliability of CPU Usage rule #172955

[Stack Monitoring] Improve reliability of CPU Usage rule #172955

miltonhultgren commented Dec 8, 2023 •

edited by sachin-frayne

Loading

elasticmachine commented Dec 12, 2023

[Stack Monitoring] Improve reliability of CPU Usage rule #172955

[Stack Monitoring] Improve reliability of CPU Usage rule #172955

Comments

miltonhultgren commented Dec 8, 2023 • edited by sachin-frayne Loading

Summary

Links

elasticmachine commented Dec 12, 2023

miltonhultgren commented Dec 8, 2023 •

edited by sachin-frayne

Loading