You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've had issues for a while with the CPU Usage rule causing false negatives. An attempt was made to correct this but that did three things:
a) Introduce new error cases that generate noise
b) Highlighted areas where the data model makes it hard to evaluate the rule state
c) Showed that the domain is more complex than the rule currently accounts for (such as autoscaling, or non-container based cgroup environments)
The formula itself is simple but to be precise it relies on the limit to be fixed, which isn't always the case. If we use the same limit across two segments where the limit has changed, we're likely to either underreport or overreport the usage, which depending on the range the rule is looking at could be very bad.
This means we need to be able to query for all the "limit segments" within the lookback window, calculate the average per segment and then take the average of all segments.
Perhaps even more sophisticated would be to have the app mark out when these limits change in the UI etc.
As for the noise, we made some assumptions about where cgroups are used, namly only in containerized environments (that's how the code is currently worded) but this isn't true, cgroups are widely used in other setups so the rule needs to accommodate for that. Ideally we would also be able to alert on both cgroup and non-cgroup based setups with the same rule to allow people to have mixed environments.
In addition to changing how we resolve the results for the query, we also need better tools for when the rule executor faces issues it cannot work around that leave the rule "broken". Ideally these would not trigger on the first failure but would kick in after a few repeated failures. Further, these actions need to be separate from the normal rule actions so that users can decide if that is something that should ping an SRE that is on call or not.
Summary
We've had issues for a while with the CPU Usage rule causing false negatives. An attempt was made to correct this but that did three things:
a) Introduce new error cases that generate noise
b) Highlighted areas where the data model makes it hard to evaluate the rule state
c) Showed that the domain is more complex than the rule currently accounts for (such as autoscaling, or non-container based cgroup environments)
The formula itself is simple but to be precise it relies on the limit to be fixed, which isn't always the case. If we use the same limit across two segments where the limit has changed, we're likely to either underreport or overreport the usage, which depending on the range the rule is looking at could be very bad.
This means we need to be able to query for all the "limit segments" within the lookback window, calculate the average per segment and then take the average of all segments.
Perhaps even more sophisticated would be to have the app mark out when these limits change in the UI etc.
As for the noise, we made some assumptions about where cgroups are used, namly only in containerized environments (that's how the code is currently worded) but this isn't true, cgroups are widely used in other setups so the rule needs to accommodate for that. Ideally we would also be able to alert on both cgroup and non-cgroup based setups with the same rule to allow people to have mixed environments.
In addition to changing how we resolve the results for the query, we also need better tools for when the rule executor faces issues it cannot work around that leave the rule "broken". Ideally these would not trigger on the first failure but would kick in after a few repeated failures. Further, these actions need to be separate from the normal rule actions so that users can decide if that is something that should ping an SRE that is on call or not.
Links
Attempted fixed:
#159351
#167244
Revert:
#172913
Original issues:
#116128
#160905
Internal issues:
https://github.com/elastic/sdh-beats/issues/4082
https://github.com/elastic/sdh-kibana/issues/4299
https://github.com/elastic/sdh-kibana/issues/4122
https://github.com/elastic/sdh-kibana/issues/4158
https://github.com/elastic/sdh-kibana/issues/4299
https://github.com/elastic/sdh-beats/issues/4082
https://github.com/elastic/sdh-kibana/issues/3329
https://github.com/elastic/sdh-kibana/issues/2759
https://github.com/elastic/sdh-kibana/issues/2860
https://github.com/elastic/sdh-kibana/issues/3069
https://github.com/elastic/sdh-kibana/issues/3117
https://github.com/elastic/sdh-kibana/issues/3768
https://github.com/elastic/sdh-kibana/issues/2436
https://github.com/elastic/sdh-kibana/issues/4695
https://github.com/elastic/sdh-kibana/issues/4875
The text was updated successfully, but these errors were encountered: