[monitoring] Rewrite CPU usage rule to improve accuracy #159351

miltonhultgren · 2023-06-08T17:58:32Z

Summary

This PR changes how the CPU Usage Rule calculates the usage percentage for containerized clusters.

Based on the comment here, my understanding of the issue was that because we were using a date_histogram to grab the values we could sometimes run into issues around how date_histogram rounds the time range and aligns it towards the start rather than the end, causing the last bucket to be incomplete, this is aggravated by the fact that we make the fixed duration of the histogram the size of the lookback window.

I took a slightly different path for the rewrite, rather than using the derivative I just look at the usage across the whole range using a simple delta.

This has a glaring flaw in that it cannot account for the limits changing within the lookback window (going higher/lower or set/unset), which we will have to try to address in #160905. The changes in this PR should make the situation better in the other cases and it makes clear when the limits have changed by firing alerts.
#160897 outlines follow up work to align how the CPU usage is presented in other places in the UI.

Screenshots

Above threshold:

Failed to compute usage:

Limits changed:

Limits missing:

Unexpected limits:

CPU usage for the Completely Fair Scheduler (CFS) for Control Groups (cgroup)

The way CPU usage for containers is calculated is this formula:
execution_time / (time_quota_per_schedule_period * number_of_periods)

Execution time is a counter of how many cycles the container was allowed to execute for by the scheduler, the quota is the limit of how many cycles are allowed per period.

The number of periods is derived from the length of the period which can also be changed. the default being 0.1 seconds.
At the end of each period, the available cycles is refilled to time_quota_per_schedule_period. With a longer period, you're likely to be throttled more often since you'll have to wait longer for a refresh, so once you've used your allowance for that period you're blocked. With a shorter period you're getting refilled more often so your total available usage is higher.
Both scenarios have an effect on your percentage CPU usage but the number of elapsed periods is a proxy for both of these cases. If you wanted to know about throttling compared to only CPU usage then you might want a separate rule for that stat. In short, 100% CPU usage means you're being throttled to some degree. The number of periods is a safe proxy for the details of period length as the period length will only affect the rate at which quota is refreshed.

These fields are counters, so for any given time range, we need to grab the biggest value (the latest) and subtract from that the lowest value (the earliest) to get the delta, then we plug those delta values into the formula above to get the factor (then multiply by 100 to make that a percentage). The code also has some unit conversion because the quota is in microseconds while the usage is in nano seconds.

How to test

There are 3 main states to test:
No limit set but Kibana configured to use container stats.
Limit changed during lookback period (to/from real value, to/from no limit).
Limit set and CPU usage crossing threshold and then falling down to recovery

Note: Please also test the non-container use case for this rule to ensure that didn't get broken during this refactor

1. Start Elasticsearch in a container without setting the CPU limits:

docker network create elastic
docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT

(We're using master-SNAPSHOT to include a recent fix to reporting for cgroup v2)

Make note of the generated password for the elastic user.

2. Start another Elasticsearch instance to act as the monitoring cluster

3. Configure Kibana to connect to the monitoring cluster and start it

4. Configure Metricbeat to collect metrics from the Docker cluster and ship them to the monitoring cluster, then start it

Execute the below command next to the Metricbeat binary to grab the CA certificate from the Elasticsearch cluster.

docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .

Use the elastic password and the CA certificate to configure the elasticsearch module:

  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - "https://localhost:9201"
    username: "elastic"
    password: "PASSWORD"
    ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt"

5. Configure an alert in Kibana with a chosen threshold

OBSERVE: Alert gets fired to inform you that there looks to be a misconfiguration, together with reporting the current value for the fallback metric (warning if the fallback metric is below threshold, danger is if is above).

6. Set limit
First stop ES using docker stop es01, then set the limit using docker update --cpus=1 es01 and start it again using docker start es01.
After a brief delay you should now see the alert change to a warning about the limits having changed during the alert lookback period and stating that the CPU usage could not be confidently calculated.
Wait for change event to pass out of lookback window.

7. Generate load on the monitored cluster

Slingshot is an option. After you clone it, you need to update the package.json to match this change before running npm install.

Then you can modify the value for elasticsearch in the configs/hosts.json file like this:

"elasticsearch": {
    "node": "https://localhost:9201",
    "auth": {
      "username": "elastic",
      "password": "PASSWORD"
    },
    "ssl": {
      "ca": "PATH_TO_CERT/http_ca.crt",
      "rejectUnauthorized": false
    }
  }

Then you can start one or more instances of Slingshot like this:
npx ts-node bin/slingshot load --config configs/hosts.json

7. Observe the alert firing in the logs
Assuming you're using a connector for server log output, you should see a message like below once the threshold is breached:

`[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))`

The alert should also be visible in the Stack Monitoring UI overview page.

At this point you can stop Slingshot and confirm that the alert recovers once CPU usage goes back down below the threshold.

8. Stop the load and confirm that the rule recovers.

A second opinion

I made a little dashboard to replicate what the graph in SM and the rule should see:
cpu_usage_dashboard.ndjson.zip

If you want to play with the data, I've collected an es_archive which you can load like this:
node scripts/es_archiver load PATH_TO_ARCHIVE/containerized_cpu_load --es-url http://elastic:changeme@localhost:9200 --kibana-url http://elastic:changeme@localhost:5601/__UNSAFE_bypassBasePath
containerized_cpu_load.zip

These are the timestamps to view the data:
Start: Jun 13, 2023 @ 11:40:00.000
End: Jun 13, 2023 @ 12:40:00.000
CPU average: 52.76%

)

apmmachine · 2023-06-08T17:58:51Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

…fix'

…e-rule-rewrite

…gren/kibana into 116128-cpu-usage-rule-rewrite

elasticmachine · 2023-06-13T13:24:47Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

…e-rule-rewrite

miltonhultgren · 2023-06-19T15:13:56Z

@klacabane and I did some code review over Zoom, putting this back into draft while I address that feedback.

…e-rule-rewrite

x-pack/plugins/monitoring/server/lib/alerts/fetch_cpu_usage_node_stats.ts

x-pack/plugins/monitoring/server/alerts/cpu_usage_rule.ts

x-pack/plugins/monitoring/server/lib/alerts/fetch_cpu_usage_node_stats.ts

tonyghiani · 2023-07-04T12:30:04Z

I was able to reproduce the environment and test the different behaviours and they all work as expected, also when getting the limits change it felt pretty accurate, great job 👏

mohamedhamed-ahmed · 2023-07-04T14:29:17Z

Works great 👏 , I tried to simulate most use cases but for some reason I wasn't able to reproduce the change in the resource limits.

miltonhultgren · 2023-07-04T17:00:34Z

@mohamedhamed-ahmed @tonyghiani Some times ES has an issue when you change the CPU limits without restarting the instance.
So you might need to do:

docker stop es01
docker update --cpus=1 es01
docker start es01

If you change without a restart, if ES is checking it's stats it'll break because the number of cores is no longer valid (see elastic/elasticsearch#97088 which I reported as part of this work).
Which will cause your Metricbeat to also break and not collect data anymore so there won't be any data to trigger the detection of things changing.

At least I'm guessing this is what happened!

mohamedhamed-ahmed · 2023-07-04T17:21:56Z

@mohamedhamed-ahmed @tonyghiani Some times ES has an issue when you change the CPU limits without restarting the instance. So you might need to do:
docker stop es01
docker update --cpus=1 es01
docker start es01
If you change without a restart, if ES is checking it's stats it'll break because the number of cores is no longer valid (see elastic/elasticsearch#97088 which I reported as part of this work). Which will cause your Metricbeat to also break and not collect data anymore so there won't be any data to trigger the detection of things changing.

At least I'm guessing this is what happened!

This did happen to me, and metricbeat wouldn't want to connect again. I had to create a new docker container.
But even then, I wasn't able to reproduce the resource changed case...for some reason I kept getting the message below

tonyghiani · 2023-07-05T06:30:54Z

@mohamedhamed-ahmed To reproduce the limits change, I configured the alert with the following parameters:

Cpu usage > 50%
Look at the average over: 2-3 mins
Check every: 10 seconds
This should give you enough time to stop the docker, change the --cpus config and restart the docker.
For the --cpus config, I jumped from 0 to 1, 1 to 0.5 and finally 0.5 to 2 to experiment different cases, always stopping the docker before changing the value.

…e-rule-rewrite

miltonhultgren · 2023-07-05T14:56:31Z

@mohamedhamed-ahmed Make sure you also change the Kibana setting to go into the container mode!
monitoring.ui.container.elasticsearch.enabled: true

tonyghiani

LGTM, it works well and I see the applied changes, thanks 👏

mohamedhamed-ahmed

LGTM! Great Job 👏

kibana-ci · 2023-07-06T19:14:55Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: e21ef38

Failed CI Steps

FTR Configs #34

Test Failures

[job] [logs] FTR Configs #34 / Machine Learning jobs update groups "before all" hook for "returns expected list of groups after update"

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id	before	after	diff
`enterpriseSearch`	14	16	+2
`securitySolution`	410	414	+4
total			+6

Total ESLint disabled count

id	before	after	diff
`enterpriseSearch`	15	17	+2
`securitySolution`	489	493	+4
total			+6

History

💔 Build #140347 failed ed87318
💚 Build #139310 succeeded 0e71fc9
💔 Build #139247 failed ccb4998
💔 Build #139163 failed ddfcfdd
💔 Build #138586 failed 98bc2a7

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

## 📓 Summary When retrieving the CPU stats for containerized (or non-container) clusters, we were not considering a scenario where the user could run in a cgroup but without limits set. These changes re-write the conditions to determine whether we allow treating limitless containers as non-containerized, covering the case where a user run in a cgroup and for some reason hasn't set the limit. ## Testing > Taken from #159351 since it reproduced the same behaviours There are 3 main states to test: No limit set but Kibana configured to use container stats. Limit changed during lookback period (to/from real value, to/from no limit). Limit set and CPU usage crossing threshold and then falling down to recovery **Note: Please also test the non-container use case for this rule to ensure that didn't get broken during this refactor** **1. Start Elasticsearch in a container without setting the CPU limits:** ``` docker network create elastic docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT ``` (We're using `master-SNAPSHOT` to include a recent fix to reporting for cgroup v2) Make note of the generated password for the `elastic` user. **2. Start another Elasticsearch instance to act as the monitoring cluster** **3. Configure Kibana to connect to the monitoring cluster and start it** **4. Configure Metricbeat to collect metrics from the Docker cluster and ship them to the monitoring cluster, then start it** Execute the below command next to the Metricbeat binary to grab the CA certificate from the Elasticsearch cluster. ``` docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt . ``` Use the `elastic` password and the CA certificate to configure the `elasticsearch` module: ``` - module: elasticsearch xpack.enabled: true period: 10s hosts: - "https://localhost:9201" username: "elastic" password: "PASSWORD" ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt" ``` **5. Configure an alert in Kibana with a chosen threshold** OBSERVE: Alert gets fired to inform you that there looks to be a misconfiguration, together with reporting the current value for the fallback metric (warning if the fallback metric is below threshold, danger is if is above). **6. Set limit** First stop ES using `docker stop es01`, then set the limit using `docker update --cpus=1 es01` and start it again using `docker start es01`. After a brief delay you should now see the alert change to a warning about the limits having changed during the alert lookback period and stating that the CPU usage could not be confidently calculated. Wait for change event to pass out of lookback window. **7. Generate load on the monitored cluster** [Slingshot](https://github.com/elastic/slingshot) is an option. After you clone it, you need to update the `package.json` to match [this change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46) before running `npm install`. Then you can modify the value for `elasticsearch` in the `configs/hosts.json` file like this: ``` "elasticsearch": { "node": "https://localhost:9201", "auth": { "username": "elastic", "password": "PASSWORD" }, "ssl": { "ca": "PATH_TO_CERT/http_ca.crt", "rejectUnauthorized": false } } ``` Then you can start one or more instances of Slingshot like this: `npx ts-node bin/slingshot load --config configs/hosts.json` **7. Observe the alert firing in the logs** Assuming you're using a connector for server log output, you should see a message like below once the threshold is breached: ``` `[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))` ``` The alert should also be visible in the Stack Monitoring UI overview page. At this point you can stop Slingshot and confirm that the alert recovers once CPU usage goes back down below the threshold. **8. Stop the load and confirm that the rule recovers.** --------- Co-authored-by: Marco Antonio Ghiani <[email protected]> Co-authored-by: kibanamachine <[email protected]>

## 📓 Summary When retrieving the CPU stats for containerized (or non-container) clusters, we were not considering a scenario where the user could run in a cgroup but without limits set. These changes re-write the conditions to determine whether we allow treating limitless containers as non-containerized, covering the case where a user run in a cgroup and for some reason hasn't set the limit. ## Testing > Taken from elastic#159351 since it reproduced the same behaviours There are 3 main states to test: No limit set but Kibana configured to use container stats. Limit changed during lookback period (to/from real value, to/from no limit). Limit set and CPU usage crossing threshold and then falling down to recovery **Note: Please also test the non-container use case for this rule to ensure that didn't get broken during this refactor** **1. Start Elasticsearch in a container without setting the CPU limits:** ``` docker network create elastic docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT ``` (We're using `master-SNAPSHOT` to include a recent fix to reporting for cgroup v2) Make note of the generated password for the `elastic` user. **2. Start another Elasticsearch instance to act as the monitoring cluster** **3. Configure Kibana to connect to the monitoring cluster and start it** **4. Configure Metricbeat to collect metrics from the Docker cluster and ship them to the monitoring cluster, then start it** Execute the below command next to the Metricbeat binary to grab the CA certificate from the Elasticsearch cluster. ``` docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt . ``` Use the `elastic` password and the CA certificate to configure the `elasticsearch` module: ``` - module: elasticsearch xpack.enabled: true period: 10s hosts: - "https://localhost:9201" username: "elastic" password: "PASSWORD" ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt" ``` **5. Configure an alert in Kibana with a chosen threshold** OBSERVE: Alert gets fired to inform you that there looks to be a misconfiguration, together with reporting the current value for the fallback metric (warning if the fallback metric is below threshold, danger is if is above). **6. Set limit** First stop ES using `docker stop es01`, then set the limit using `docker update --cpus=1 es01` and start it again using `docker start es01`. After a brief delay you should now see the alert change to a warning about the limits having changed during the alert lookback period and stating that the CPU usage could not be confidently calculated. Wait for change event to pass out of lookback window. **7. Generate load on the monitored cluster** [Slingshot](https://github.com/elastic/slingshot) is an option. After you clone it, you need to update the `package.json` to match [this change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46) before running `npm install`. Then you can modify the value for `elasticsearch` in the `configs/hosts.json` file like this: ``` "elasticsearch": { "node": "https://localhost:9201", "auth": { "username": "elastic", "password": "PASSWORD" }, "ssl": { "ca": "PATH_TO_CERT/http_ca.crt", "rejectUnauthorized": false } } ``` Then you can start one or more instances of Slingshot like this: `npx ts-node bin/slingshot load --config configs/hosts.json` **7. Observe the alert firing in the logs** Assuming you're using a connector for server log output, you should see a message like below once the threshold is breached: ``` `[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))` ``` The alert should also be visible in the Stack Monitoring UI overview page. At this point you can stop Slingshot and confirm that the alert recovers once CPU usage goes back down below the threshold. **8. Stop the load and confirm that the rule recovers.** --------- Co-authored-by: Marco Antonio Ghiani <[email protected]> Co-authored-by: kibanamachine <[email protected]> (cherry picked from commit 833c075)

…tic#159351)" This reverts commit bcb1649.

Reverts #159351 Reverts #167244 Due to the many unexpected issues that these changes introduced we've decided to revert these changes until we have better solutions for the problems we've learnt about. Problems: - Gaps in data cause alerts to fire (see next point) - Normal CPU rescaling causes alerts to fire #160905 - Any error fires an alert (since there is no other way to inform the user about the problems faced by the rule executor) - Many assumptions about cgroups only being for container users are wrong To address some of these issues we also need more functionality in the alerting framework to be able to register secondary actions so that we may trigger non-oncall workflows for when a rule faces issues with evaluating the stats. Original issue #116128

Reverts elastic#159351 Reverts elastic#167244 Due to the many unexpected issues that these changes introduced we've decided to revert these changes until we have better solutions for the problems we've learnt about. Problems: - Gaps in data cause alerts to fire (see next point) - Normal CPU rescaling causes alerts to fire elastic#160905 - Any error fires an alert (since there is no other way to inform the user about the problems faced by the rule executor) - Many assumptions about cgroups only being for container users are wrong To address some of these issues we also need more functionality in the alerting framework to be able to register secondary actions so that we may trigger non-oncall workflows for when a rule faces issues with evaluating the stats. Original issue elastic#116128 (cherry picked from commit 55bc6d5)

# Backport This will backport the following commits from `main` to `8.12`: - [[monitoring] Revert CPU Usage rule changes (#172913)](#172913)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Milton Hultgren <[email protected]>

[monitoring] Rewrite CPU usage rule to not use histogram (elastic#116128

1057874

)

kibanamachine and others added 5 commits June 8, 2023 18:05

[CI] Auto-commit changed files from 'node scripts/lint_ts_projects --…

7b4768f

…fix'

Merge branch 'main' of github.com:elastic/kibana into 116128-cpu-usag…

4ccea19

…e-rule-rewrite

Add archive with test data

70df469

Merge branch 'main' of github.com:elastic/kibana into 116128-cpu-usag…

4a70273

…e-rule-rewrite

Merge branch '116128-cpu-usage-rule-rewrite' of github.com:miltonhult…

374fa6d

…gren/kibana into 116128-cpu-usage-rule-rewrite

miltonhultgren marked this pull request as ready for review June 13, 2023 13:24

miltonhultgren requested a review from a team as a code owner June 13, 2023 13:24

miltonhultgren added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Jun 13, 2023

miltonhultgren added 3 commits June 15, 2023 06:40

Remove reference to pipeline in archive

cb09ed3

Update unit tests

eecb4c6

Merge branch 'main' of github.com:elastic/kibana into 116128-cpu-usag…

dcbdb66

…e-rule-rewrite

miltonhultgren added the release_note:fix label Jun 15, 2023

miltonhultgren changed the title ~~[monitoring] Rewrite CPU usage rule to not use histogram (#116128)~~ [monitoring] Rewrite CPU usage rule to improve accuracy (#116128) Jun 15, 2023

Merge branch 'main' into 116128-cpu-usage-rule-rewrite

a5e8490

miltonhultgren marked this pull request as draft June 19, 2023 15:13

miltonhultgren added 4 commits June 26, 2023 09:36

Merge branch 'main' of github.com:elastic/kibana into 116128-cpu-usag…

5645756

…e-rule-rewrite

Detect when limits change during lookback period

ef2ac03

Merge branch 'main' of github.com:elastic/kibana into 116128-cpu-usag…

e4e1bc7

…e-rule-rewrite

Improve error handling and messaging

98bc2a7

This was referenced Jun 29, 2023

[Stack Monitoring] CPU usage stats should follow same metrics path #160897

Open

[Stack Monitoring] CPU usage rule should handle usage limit changes #160905

Open

miltonhultgren added 4 commits June 29, 2023 16:58

Alert if limits are found when not expected

fd09d84

Merge branch 'main' of github.com:elastic/kibana into 116128-cpu-usag…

ddfcfdd

…e-rule-rewrite

Fix tests

296efb6

Merge branch 'main' of github.com:elastic/kibana into 116128-cpu-usag…

ccb4998

…e-rule-rewrite

tonyghiani reviewed Jul 3, 2023

View reviewed changes

x-pack/plugins/monitoring/server/lib/alerts/fetch_cpu_usage_node_stats.ts Show resolved Hide resolved

tonyghiani reviewed Jul 3, 2023

View reviewed changes

x-pack/plugins/monitoring/server/lib/alerts/fetch_cpu_usage_node_stats.ts Outdated Show resolved Hide resolved

tonyghiani reviewed Jul 3, 2023

View reviewed changes

x-pack/plugins/monitoring/server/lib/alerts/fetch_cpu_usage_node_stats.ts Outdated Show resolved Hide resolved

mohamedhamed-ahmed reviewed Jul 3, 2023

View reviewed changes

x-pack/plugins/monitoring/server/alerts/cpu_usage_rule.ts Show resolved Hide resolved

x-pack/plugins/monitoring/server/alerts/cpu_usage_rule.ts Show resolved Hide resolved

x-pack/plugins/monitoring/server/lib/alerts/fetch_cpu_usage_node_stats.ts Outdated Show resolved Hide resolved

Merge branch 'main' of github.com:elastic/kibana into 116128-cpu-usag…

1b0e9e2

…e-rule-rewrite

Fix some review comments

ed87318

tonyghiani approved these changes Jul 5, 2023

View reviewed changes

mohamedhamed-ahmed approved these changes Jul 5, 2023

View reviewed changes

miltonhultgren added 2 commits July 6, 2023 20:06

Update tests

06a9de9

Merge branch 'main' into 116128-cpu-usage-rule-rewrite

e21ef38

miltonhultgren enabled auto-merge (squash) July 6, 2023 18:23

miltonhultgren merged commit bcb1649 into elastic:main Jul 6, 2023

kibanamachine added v8.10.0 backport:skip This commit does not require backporting labels Jul 6, 2023

tonyghiani mentioned this pull request Sep 27, 2023

[Stack Monitoring] Update flows for cpu stats fetching #167244

Merged

miltonhultgren added a commit to miltonhultgren/kibana that referenced this pull request Dec 8, 2023

Revert "[monitoring] Rewrite CPU usage rule to improve accuracy (elas…

a6dbc76

…tic#159351)" This reverts commit bcb1649.

This was referenced Dec 8, 2023

[monitoring] Revert CPU Usage rule changes #172913

Merged

[Stack Monitoring] Improve reliability of CPU Usage rule #172955

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[monitoring] Rewrite CPU usage rule to improve accuracy #159351

[monitoring] Rewrite CPU usage rule to improve accuracy #159351

miltonhultgren commented Jun 8, 2023 •

edited

Loading

apmmachine commented Jun 8, 2023

elasticmachine commented Jun 13, 2023

miltonhultgren commented Jun 19, 2023

tonyghiani commented Jul 4, 2023

mohamedhamed-ahmed commented Jul 4, 2023

miltonhultgren commented Jul 4, 2023

mohamedhamed-ahmed commented Jul 4, 2023

tonyghiani commented Jul 5, 2023

miltonhultgren commented Jul 5, 2023

tonyghiani left a comment

mohamedhamed-ahmed left a comment

kibana-ci commented Jul 6, 2023

ESLint disabled line counts

Total ESLint disabled count

[monitoring] Rewrite CPU usage rule to improve accuracy #159351

[monitoring] Rewrite CPU usage rule to improve accuracy #159351

Conversation

miltonhultgren commented Jun 8, 2023 • edited Loading

Summary

Screenshots

CPU usage for the Completely Fair Scheduler (CFS) for Control Groups (cgroup)

How to test

A second opinion

apmmachine commented Jun 8, 2023

🤖 GitHub comments

elasticmachine commented Jun 13, 2023

miltonhultgren commented Jun 19, 2023

tonyghiani commented Jul 4, 2023

mohamedhamed-ahmed commented Jul 4, 2023

miltonhultgren commented Jul 4, 2023

mohamedhamed-ahmed commented Jul 4, 2023

tonyghiani commented Jul 5, 2023

miltonhultgren commented Jul 5, 2023

tonyghiani left a comment

Choose a reason for hiding this comment

mohamedhamed-ahmed left a comment

Choose a reason for hiding this comment

kibana-ci commented Jul 6, 2023

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

ESLint disabled line counts

Total ESLint disabled count

History

miltonhultgren commented Jun 8, 2023 •

edited

Loading