Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] Improve reliability of CPU Usage rule #172955

Open
miltonhultgren opened this issue Dec 8, 2023 · 1 comment
Open

[Stack Monitoring] Improve reliability of CPU Usage rule #172955

miltonhultgren opened this issue Dec 8, 2023 · 1 comment

Comments

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Dec 8, 2023

Summary

We've had issues for a while with the CPU Usage rule causing false negatives. An attempt was made to correct this but that did three things:
a) Introduce new error cases that generate noise
b) Highlighted areas where the data model makes it hard to evaluate the rule state
c) Showed that the domain is more complex than the rule currently accounts for (such as autoscaling, or non-container based cgroup environments)

The formula itself is simple but to be precise it relies on the limit to be fixed, which isn't always the case. If we use the same limit across two segments where the limit has changed, we're likely to either underreport or overreport the usage, which depending on the range the rule is looking at could be very bad.
This means we need to be able to query for all the "limit segments" within the lookback window, calculate the average per segment and then take the average of all segments.
Perhaps even more sophisticated would be to have the app mark out when these limits change in the UI etc.

As for the noise, we made some assumptions about where cgroups are used, namly only in containerized environments (that's how the code is currently worded) but this isn't true, cgroups are widely used in other setups so the rule needs to accommodate for that. Ideally we would also be able to alert on both cgroup and non-cgroup based setups with the same rule to allow people to have mixed environments.

In addition to changing how we resolve the results for the query, we also need better tools for when the rule executor faces issues it cannot work around that leave the rule "broken". Ideally these would not trigger on the first failure but would kick in after a few repeated failures. Further, these actions need to be separate from the normal rule actions so that users can decide if that is something that should ping an SRE that is on call or not.

Links

Attempted fixed:
#159351
#167244

Revert:
#172913

Original issues:
#116128
#160905

Internal issues:
https://github.com/elastic/sdh-beats/issues/4082
https://github.com/elastic/sdh-kibana/issues/4299
https://github.com/elastic/sdh-kibana/issues/4122
https://github.com/elastic/sdh-kibana/issues/4158
https://github.com/elastic/sdh-kibana/issues/4299
https://github.com/elastic/sdh-beats/issues/4082
https://github.com/elastic/sdh-kibana/issues/3329
https://github.com/elastic/sdh-kibana/issues/2759
https://github.com/elastic/sdh-kibana/issues/2860
https://github.com/elastic/sdh-kibana/issues/3069
https://github.com/elastic/sdh-kibana/issues/3117
https://github.com/elastic/sdh-kibana/issues/3768
https://github.com/elastic/sdh-kibana/issues/2436
https://github.com/elastic/sdh-kibana/issues/4695
https://github.com/elastic/sdh-kibana/issues/4875

@botelastic botelastic bot added the needs-team Issues missing a team label label Dec 8, 2023
@jsanz jsanz added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Dec 12, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Dec 12, 2023
@smith smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants