Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[monitoring] Rewrite CPU usage rule to improve accuracy #159351

Merged

Conversation

miltonhultgren
Copy link
Contributor

@miltonhultgren miltonhultgren commented Jun 8, 2023

Fixes #116128

Summary

This PR changes how the CPU Usage Rule calculates the usage percentage for containerized clusters.

Based on the comment here, my understanding of the issue was that because we were using a date_histogram to grab the values we could sometimes run into issues around how date_histogram rounds the time range and aligns it towards the start rather than the end, causing the last bucket to be incomplete, this is aggravated by the fact that we make the fixed duration of the histogram the size of the lookback window.

I took a slightly different path for the rewrite, rather than using the derivative I just look at the usage across the whole range using a simple delta.

This has a glaring flaw in that it cannot account for the limits changing within the lookback window (going higher/lower or set/unset), which we will have to try to address in #160905. The changes in this PR should make the situation better in the other cases and it makes clear when the limits have changed by firing alerts.
#160897 outlines follow up work to align how the CPU usage is presented in other places in the UI.

Screenshots

Above threshold:
above-threshold

Failed to compute usage:
failed-to-compute

Limits changed:
limits-changed

Limits missing:
missing-resource-limits

Unexpected limits:
unexpected-resource-limits

CPU usage for the Completely Fair Scheduler (CFS) for Control Groups (cgroup)

The way CPU usage for containers is calculated is this formula:
execution_time / (time_quota_per_schedule_period * number_of_periods)

Execution time is a counter of how many cycles the container was allowed to execute for by the scheduler, the quota is the limit of how many cycles are allowed per period.

The number of periods is derived from the length of the period which can also be changed. the default being 0.1 seconds.
At the end of each period, the available cycles is refilled to time_quota_per_schedule_period. With a longer period, you're likely to be throttled more often since you'll have to wait longer for a refresh, so once you've used your allowance for that period you're blocked. With a shorter period you're getting refilled more often so your total available usage is higher.
Both scenarios have an effect on your percentage CPU usage but the number of elapsed periods is a proxy for both of these cases. If you wanted to know about throttling compared to only CPU usage then you might want a separate rule for that stat. In short, 100% CPU usage means you're being throttled to some degree. The number of periods is a safe proxy for the details of period length as the period length will only affect the rate at which quota is refreshed.

These fields are counters, so for any given time range, we need to grab the biggest value (the latest) and subtract from that the lowest value (the earliest) to get the delta, then we plug those delta values into the formula above to get the factor (then multiply by 100 to make that a percentage). The code also has some unit conversion because the quota is in microseconds while the usage is in nano seconds.

How to test

There are 3 main states to test:
No limit set but Kibana configured to use container stats.
Limit changed during lookback period (to/from real value, to/from no limit).
Limit set and CPU usage crossing threshold and then falling down to recovery

Note: Please also test the non-container use case for this rule to ensure that didn't get broken during this refactor

1. Start Elasticsearch in a container without setting the CPU limits:

docker network create elastic
docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT

(We're using master-SNAPSHOT to include a recent fix to reporting for cgroup v2)

Make note of the generated password for the elastic user.

2. Start another Elasticsearch instance to act as the monitoring cluster

3. Configure Kibana to connect to the monitoring cluster and start it

4. Configure Metricbeat to collect metrics from the Docker cluster and ship them to the monitoring cluster, then start it

Execute the below command next to the Metricbeat binary to grab the CA certificate from the Elasticsearch cluster.

docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .

Use the elastic password and the CA certificate to configure the elasticsearch module:

  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - "https://localhost:9201"
    username: "elastic"
    password: "PASSWORD"
    ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt"

5. Configure an alert in Kibana with a chosen threshold

OBSERVE: Alert gets fired to inform you that there looks to be a misconfiguration, together with reporting the current value for the fallback metric (warning if the fallback metric is below threshold, danger is if is above).

6. Set limit
First stop ES using docker stop es01, then set the limit using docker update --cpus=1 es01 and start it again using docker start es01.
After a brief delay you should now see the alert change to a warning about the limits having changed during the alert lookback period and stating that the CPU usage could not be confidently calculated.
Wait for change event to pass out of lookback window.

7. Generate load on the monitored cluster

Slingshot is an option. After you clone it, you need to update the package.json to match this change before running npm install.

Then you can modify the value for elasticsearch in the configs/hosts.json file like this:

"elasticsearch": {
    "node": "https://localhost:9201",
    "auth": {
      "username": "elastic",
      "password": "PASSWORD"
    },
    "ssl": {
      "ca": "PATH_TO_CERT/http_ca.crt",
      "rejectUnauthorized": false
    }
  }

Then you can start one or more instances of Slingshot like this:
npx ts-node bin/slingshot load --config configs/hosts.json

7. Observe the alert firing in the logs
Assuming you're using a connector for server log output, you should see a message like below once the threshold is breached:

`[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))`

The alert should also be visible in the Stack Monitoring UI overview page.

At this point you can stop Slingshot and confirm that the alert recovers once CPU usage goes back down below the threshold.

8. Stop the load and confirm that the rule recovers.

A second opinion

I made a little dashboard to replicate what the graph in SM and the rule should see:
cpu_usage_dashboard.ndjson.zip

If you want to play with the data, I've collected an es_archive which you can load like this:
node scripts/es_archiver load PATH_TO_ARCHIVE/containerized_cpu_load --es-url http://elastic:changeme@localhost:9200 --kibana-url http://elastic:changeme@localhost:5601/__UNSAFE_bypassBasePath
containerized_cpu_load.zip

These are the timestamps to view the data:
Start: Jun 13, 2023 @ 11:40:00.000
End: Jun 13, 2023 @ 12:40:00.000
CPU average: 52.76%

@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@miltonhultgren miltonhultgren marked this pull request as ready for review June 13, 2023 13:24
@miltonhultgren miltonhultgren requested a review from a team as a code owner June 13, 2023 13:24
@miltonhultgren miltonhultgren added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Jun 13, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@miltonhultgren miltonhultgren changed the title [monitoring] Rewrite CPU usage rule to not use histogram (#116128) [monitoring] Rewrite CPU usage rule to improve accuracy (#116128) Jun 15, 2023
@miltonhultgren miltonhultgren marked this pull request as draft June 19, 2023 15:13
@miltonhultgren
Copy link
Contributor Author

@klacabane and I did some code review over Zoom, putting this back into draft while I address that feedback.

@tonyghiani
Copy link
Contributor

I was able to reproduce the environment and test the different behaviours and they all work as expected, also when getting the limits change it felt pretty accurate, great job 👏
Screenshot 2023-07-04 at 13 53 25
Screenshot 2023-07-04 at 12 58 21
Screenshot 2023-07-04 at 12 07 07

@mohamedhamed-ahmed
Copy link
Contributor

Works great 👏 , I tried to simulate most use cases but for some reason I wasn't able to reproduce the change in the resource limits.

Screenshot 2023-07-04 at 14 27 01 Screenshot 2023-07-04 at 15 22 00 Screenshot 2023-07-04 at 15 22 45

@miltonhultgren
Copy link
Contributor Author

@mohamedhamed-ahmed @tonyghiani Some times ES has an issue when you change the CPU limits without restarting the instance.
So you might need to do:

docker stop es01
docker update --cpus=1 es01
docker start es01

If you change without a restart, if ES is checking it's stats it'll break because the number of cores is no longer valid (see elastic/elasticsearch#97088 which I reported as part of this work).
Which will cause your Metricbeat to also break and not collect data anymore so there won't be any data to trigger the detection of things changing.

At least I'm guessing this is what happened!

@mohamedhamed-ahmed
Copy link
Contributor

@mohamedhamed-ahmed @tonyghiani Some times ES has an issue when you change the CPU limits without restarting the instance. So you might need to do:

docker stop es01
docker update --cpus=1 es01
docker start es01

If you change without a restart, if ES is checking it's stats it'll break because the number of cores is no longer valid (see elastic/elasticsearch#97088 which I reported as part of this work). Which will cause your Metricbeat to also break and not collect data anymore so there won't be any data to trigger the detection of things changing.

At least I'm guessing this is what happened!

This did happen to me, and metricbeat wouldn't want to connect again. I had to create a new docker container.
But even then, I wasn't able to reproduce the resource changed case...for some reason I kept getting the message below

image

@tonyghiani
Copy link
Contributor

@mohamedhamed-ahmed To reproduce the limits change, I configured the alert with the following parameters:

  • Cpu usage > 50%
  • Look at the average over: 2-3 mins
  • Check every: 10 seconds
    This should give you enough time to stop the docker, change the --cpus config and restart the docker.
    For the --cpus config, I jumped from 0 to 1, 1 to 0.5 and finally 0.5 to 2 to experiment different cases, always stopping the docker before changing the value.

@miltonhultgren
Copy link
Contributor Author

@mohamedhamed-ahmed Make sure you also change the Kibana setting to go into the container mode!
monitoring.ui.container.elasticsearch.enabled: true

Copy link
Contributor

@tonyghiani tonyghiani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it works well and I see the applied changes, thanks 👏

Copy link
Contributor

@mohamedhamed-ahmed mohamedhamed-ahmed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great Job 👏

@miltonhultgren miltonhultgren enabled auto-merge (squash) July 6, 2023 18:23
@miltonhultgren miltonhultgren merged commit bcb1649 into elastic:main Jul 6, 2023
@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #34 / Machine Learning jobs update groups "before all" hook for "returns expected list of groups after update"

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
enterpriseSearch 14 16 +2
securitySolution 410 414 +4
total +6

Total ESLint disabled count

id before after diff
enterpriseSearch 15 17 +2
securitySolution 489 493 +4
total +6

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@kibanamachine kibanamachine added v8.10.0 backport:skip This commit does not require backporting labels Jul 6, 2023
tonyghiani added a commit that referenced this pull request Sep 28, 2023
## 📓 Summary

When retrieving the CPU stats for containerized (or non-container)
clusters, we were not considering a scenario where the user could run in
a cgroup but without limits set.
These changes re-write the conditions to determine whether we allow
treating limitless containers as non-containerized, covering the case
where a user run in a cgroup and for some reason hasn't set the limit.

## Testing

> Taken from #159351 since it
reproduced the same behaviours

There are 3 main states to test:
No limit set but Kibana configured to use container stats.
Limit changed during lookback period (to/from real value, to/from no
limit).
Limit set and CPU usage crossing threshold and then falling down to
recovery

**Note: Please also test the non-container use case for this rule to
ensure that didn't get broken during this refactor**

**1. Start Elasticsearch in a container without setting the CPU
limits:**
```
docker network create elastic
docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT
```

(We're using `master-SNAPSHOT` to include a recent fix to reporting for
cgroup v2)

Make note of the generated password for the `elastic` user.

**2. Start another Elasticsearch instance to act as the monitoring
cluster**

**3. Configure Kibana to connect to the monitoring cluster and start
it**

**4. Configure Metricbeat to collect metrics from the Docker cluster and
ship them to the monitoring cluster, then start it**

Execute the below command next to the Metricbeat binary to grab the CA
certificate from the Elasticsearch cluster.

```
docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .
```

Use the `elastic` password and the CA certificate to configure the
`elasticsearch` module:
```
  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - "https://localhost:9201"
    username: "elastic"
    password: "PASSWORD"
    ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt"
```

**5. Configure an alert in Kibana with a chosen threshold**

OBSERVE: Alert gets fired to inform you that there looks to be a
misconfiguration, together with reporting the current value for the
fallback metric (warning if the fallback metric is below threshold,
danger is if is above).

**6. Set limit**
First stop ES using `docker stop es01`, then set the limit using `docker
update --cpus=1 es01` and start it again using `docker start es01`.
After a brief delay you should now see the alert change to a warning
about the limits having changed during the alert lookback period and
stating that the CPU usage could not be confidently calculated.
Wait for change event to pass out of lookback window.

**7. Generate load on the monitored cluster**

[Slingshot](https://github.com/elastic/slingshot) is an option. After
you clone it, you need to update the `package.json` to match [this
change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46)
before running `npm install`.

Then you can modify the value for `elasticsearch` in the
`configs/hosts.json` file like this:
```
"elasticsearch": {
    "node": "https://localhost:9201",
    "auth": {
      "username": "elastic",
      "password": "PASSWORD"
    },
    "ssl": {
      "ca": "PATH_TO_CERT/http_ca.crt",
      "rejectUnauthorized": false
    }
  }
```

Then you can start one or more instances of Slingshot like this:
`npx ts-node bin/slingshot load --config configs/hosts.json`

**7. Observe the alert firing in the logs**
Assuming you're using a connector for server log output, you should see
a message like below once the threshold is breached:
```
`[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))`
```

The alert should also be visible in the Stack Monitoring UI overview
page.

At this point you can stop Slingshot and confirm that the alert recovers
once CPU usage goes back down below the threshold.

**8. Stop the load and confirm that the rule recovers.**

---------

Co-authored-by: Marco Antonio Ghiani <[email protected]>
Co-authored-by: kibanamachine <[email protected]>
tonyghiani added a commit to tonyghiani/kibana that referenced this pull request Nov 7, 2023
## 📓 Summary

When retrieving the CPU stats for containerized (or non-container)
clusters, we were not considering a scenario where the user could run in
a cgroup but without limits set.
These changes re-write the conditions to determine whether we allow
treating limitless containers as non-containerized, covering the case
where a user run in a cgroup and for some reason hasn't set the limit.

## Testing

> Taken from elastic#159351 since it
reproduced the same behaviours

There are 3 main states to test:
No limit set but Kibana configured to use container stats.
Limit changed during lookback period (to/from real value, to/from no
limit).
Limit set and CPU usage crossing threshold and then falling down to
recovery

**Note: Please also test the non-container use case for this rule to
ensure that didn't get broken during this refactor**

**1. Start Elasticsearch in a container without setting the CPU
limits:**
```
docker network create elastic
docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT
```

(We're using `master-SNAPSHOT` to include a recent fix to reporting for
cgroup v2)

Make note of the generated password for the `elastic` user.

**2. Start another Elasticsearch instance to act as the monitoring
cluster**

**3. Configure Kibana to connect to the monitoring cluster and start
it**

**4. Configure Metricbeat to collect metrics from the Docker cluster and
ship them to the monitoring cluster, then start it**

Execute the below command next to the Metricbeat binary to grab the CA
certificate from the Elasticsearch cluster.

```
docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .
```

Use the `elastic` password and the CA certificate to configure the
`elasticsearch` module:
```
  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - "https://localhost:9201"
    username: "elastic"
    password: "PASSWORD"
    ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt"
```

**5. Configure an alert in Kibana with a chosen threshold**

OBSERVE: Alert gets fired to inform you that there looks to be a
misconfiguration, together with reporting the current value for the
fallback metric (warning if the fallback metric is below threshold,
danger is if is above).

**6. Set limit**
First stop ES using `docker stop es01`, then set the limit using `docker
update --cpus=1 es01` and start it again using `docker start es01`.
After a brief delay you should now see the alert change to a warning
about the limits having changed during the alert lookback period and
stating that the CPU usage could not be confidently calculated.
Wait for change event to pass out of lookback window.

**7. Generate load on the monitored cluster**

[Slingshot](https://github.com/elastic/slingshot) is an option. After
you clone it, you need to update the `package.json` to match [this
change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46)
before running `npm install`.

Then you can modify the value for `elasticsearch` in the
`configs/hosts.json` file like this:
```
"elasticsearch": {
    "node": "https://localhost:9201",
    "auth": {
      "username": "elastic",
      "password": "PASSWORD"
    },
    "ssl": {
      "ca": "PATH_TO_CERT/http_ca.crt",
      "rejectUnauthorized": false
    }
  }
```

Then you can start one or more instances of Slingshot like this:
`npx ts-node bin/slingshot load --config configs/hosts.json`

**7. Observe the alert firing in the logs**
Assuming you're using a connector for server log output, you should see
a message like below once the threshold is breached:
```
`[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))`
```

The alert should also be visible in the Stack Monitoring UI overview
page.

At this point you can stop Slingshot and confirm that the alert recovers
once CPU usage goes back down below the threshold.

**8. Stop the load and confirm that the rule recovers.**

---------

Co-authored-by: Marco Antonio Ghiani <[email protected]>
Co-authored-by: kibanamachine <[email protected]>
(cherry picked from commit 833c075)
miltonhultgren added a commit to miltonhultgren/kibana that referenced this pull request Dec 8, 2023
miltonhultgren added a commit that referenced this pull request Dec 8, 2023
Reverts #159351
Reverts #167244

Due to the many unexpected issues that these changes introduced we've
decided to revert these changes until we have better solutions for the
problems we've learnt about.

Problems:
- Gaps in data cause alerts to fire (see next point)
- Normal CPU rescaling causes alerts to fire
#160905
- Any error fires an alert (since there is no other way to inform the
user about the problems faced by the rule executor)
- Many assumptions about cgroups only being for container users are
wrong

To address some of these issues we also need more functionality in the
alerting framework to be able to register secondary actions so that we
may trigger non-oncall workflows for when a rule faces issues with
evaluating the stats.

Original issue #116128
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Dec 8, 2023
Reverts elastic#159351
Reverts elastic#167244

Due to the many unexpected issues that these changes introduced we've
decided to revert these changes until we have better solutions for the
problems we've learnt about.

Problems:
- Gaps in data cause alerts to fire (see next point)
- Normal CPU rescaling causes alerts to fire
elastic#160905
- Any error fires an alert (since there is no other way to inform the
user about the problems faced by the rule executor)
- Many assumptions about cgroups only being for container users are
wrong

To address some of these issues we also need more functionality in the
alerting framework to be able to register secondary actions so that we
may trigger non-oncall workflows for when a rule faces issues with
evaluating the stats.

Original issue elastic#116128

(cherry picked from commit 55bc6d5)
kibanamachine referenced this pull request Dec 8, 2023
# Backport

This will backport the following commits from `main` to `8.12`:
- [[monitoring] Revert CPU Usage rule changes
(#172913)](#172913)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Milton
Hultgren","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-12-08T15:25:23Z","message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","backport:prev-minor","v8.12.0","v8.13.0"],"number":172913,"url":"https://github.com/elastic/kibana/pull/172913","mergeCommit":{"message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","labelRegex":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/172913","number":172913,"mergeCommit":{"message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559"}}]}]
BACKPORT-->

Co-authored-by: Milton Hultgren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Stack Monitoring release_note:fix Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.10.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Stack Monitoring] CPU Usage duration for alerting is not correctly used in the elastic query
7 participants