Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] Update flows for cpu stats fetching #167244

Merged
merged 16 commits into from
Sep 28, 2023

Conversation

tonyghiani
Copy link
Contributor

@tonyghiani tonyghiani commented Sep 26, 2023

📓 Summary

When retrieving the CPU stats for containerized (or non-container) clusters, we were not considering a scenario where the user could run in a cgroup but without limits set.
These changes re-write the conditions to determine whether we allow treating limitless containers as non-containerized, covering the case where a user run in a cgroup and for some reason hasn't set the limit.

Testing

Taken from #159351 since it reproduced the same behaviours

There are 3 main states to test:
No limit set but Kibana configured to use container stats.
Limit changed during lookback period (to/from real value, to/from no limit).
Limit set and CPU usage crossing threshold and then falling down to recovery

Note: Please also test the non-container use case for this rule to ensure that didn't get broken during this refactor

1. Start Elasticsearch in a container without setting the CPU limits:

docker network create elastic
docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT

(We're using master-SNAPSHOT to include a recent fix to reporting for cgroup v2)

Make note of the generated password for the elastic user.

2. Start another Elasticsearch instance to act as the monitoring cluster

3. Configure Kibana to connect to the monitoring cluster and start it

4. Configure Metricbeat to collect metrics from the Docker cluster and ship them to the monitoring cluster, then start it

Execute the below command next to the Metricbeat binary to grab the CA certificate from the Elasticsearch cluster.

docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .

Use the elastic password and the CA certificate to configure the elasticsearch module:

  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - "https://localhost:9201"
    username: "elastic"
    password: "PASSWORD"
    ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt"

5. Configure an alert in Kibana with a chosen threshold

OBSERVE: Alert gets fired to inform you that there looks to be a misconfiguration, together with reporting the current value for the fallback metric (warning if the fallback metric is below threshold, danger is if is above).

6. Set limit
First stop ES using docker stop es01, then set the limit using docker update --cpus=1 es01 and start it again using docker start es01.
After a brief delay you should now see the alert change to a warning about the limits having changed during the alert lookback period and stating that the CPU usage could not be confidently calculated.
Wait for change event to pass out of lookback window.

7. Generate load on the monitored cluster

Slingshot is an option. After you clone it, you need to update the package.json to match this change before running npm install.

Then you can modify the value for elasticsearch in the configs/hosts.json file like this:

"elasticsearch": {
    "node": "https://localhost:9201",
    "auth": {
      "username": "elastic",
      "password": "PASSWORD"
    },
    "ssl": {
      "ca": "PATH_TO_CERT/http_ca.crt",
      "rejectUnauthorized": false
    }
  }

Then you can start one or more instances of Slingshot like this:
npx ts-node bin/slingshot load --config configs/hosts.json

7. Observe the alert firing in the logs
Assuming you're using a connector for server log output, you should see a message like below once the threshold is breached:

`[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))`

The alert should also be visible in the Stack Monitoring UI overview page.

At this point you can stop Slingshot and confirm that the alert recovers once CPU usage goes back down below the threshold.

8. Stop the load and confirm that the rule recovers.

@tonyghiani tonyghiani added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services release_note:skip Skip the PR/issue when compiling release notes labels Sep 26, 2023
@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • /oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@tonyghiani tonyghiani marked this pull request as ready for review September 27, 2023 08:32
@tonyghiani tonyghiani requested a review from a team as a code owner September 27, 2023 08:32
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

Copy link
Contributor

@miltonhultgren miltonhultgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on these changes!

I think we should go just a little bit further to clean up some noise that I introduced in the last PR which just wasn't needed.

Comment on lines 185 to 188
logger.warn(
`CPU usage rule: Node "${node.key}" does not have resource limits configured. Fallback metric reports usage of ${cpuUsage}%.`
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having real doubts about if we should log something or not.

On the one hand, this looks to me as a missed configuration but maybe it's more common than I think?
And then, having this logged every time the rule executes, possibly for many nodes, feels like a risk for spam which will make the Kibana log grow which means more cost of storing that log for no real gain and the customer might be fine with it.

So maybe we should make it a debug level instead? But, if we do, is it really useful? Will it help customers notice something is wrong.
I don't know if users really look at the Kibana logs anyway, similar for the log we make below about values missing?

Maybe both of those cases should just rely on the fallback metric silently?

I just feel like in the last PR, I included more changes than needed for the bug fix.
I think the limits change part is good because that's a real cause for bugs in the calculation.

But the other two cases should probably just silently return the base CPU metric since I'm not sure if either really helps the user find issues?

So my vote is to remove the logging here, and combine this check with the check below about if they are null, and not report any special status about that. What do you think?
That would bring the total set of changes closer to fixing only the original problem and removes a bunch of noise from the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const limitsNotSet = node.quota_micros_max.value === -1 && node.quota_micros_min.value === -1;
// This check was mostly for the compiler to be happy about the types anyway
const statsMissing =  node.max_usage_nanos.value === null &&
        node.min_usage_nanos.value === null &&
        node.max_periods.value === null &&
        node.min_periods.value === null &&
        node.quota_micros_max.value === null;

if (limitsNotSet || statsMissing) {
  const cpuUsage = node.average_cpu_usage_percent.value ?? undefined;
  return {
  clusterUuid: cluster.key as string,
  nodeId: node.key as string,
  cpuUsage,
  nodeName,
  ccs,
  };
}

Then after we check if the values are different or not.

@tonyghiani tonyghiani requested a review from a team as a code owner September 27, 2023 11:10
@tonyghiani tonyghiani removed the request for review from a team September 27, 2023 12:31
Copy link
Contributor

@miltonhultgren miltonhultgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@tonyghiani tonyghiani merged commit 833c075 into elastic:main Sep 28, 2023
@tonyghiani tonyghiani deleted the fix-cgroup-limits-computation branch September 28, 2023 12:54
@kibanamachine kibanamachine added v8.11.0 backport:skip This commit does not require backporting labels Sep 28, 2023
@bdols
Copy link

bdols commented Oct 19, 2023

hello, I see a question posed about how common it is to run ES without container limits. The docs here https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-managing-compute-resources.html#k8s-elasticsearch-cpu almost encourage users (like myself) to not set limits because of an old kubernetes bug relating to over aggressive throttling. And, all the examples do not set container CPU limits.
can I request that this get backported? it's desirable to see the CPU usage in the node listing overview for those who do not have access to the kubernetes cluster or for those who prefer to work in Kibana, I think, and restarting the cluster just to set the CPU limits across all nodes doesn't seem worth it on a few fronts.
thank you for your consideration

@smith
Copy link
Contributor

smith commented Oct 30, 2023

@bdols this will be in 8.11 I think and I'll send a note to our docs team about the container limits. What version are you running?

@bdols
Copy link

bdols commented Oct 31, 2023

running 8.8.2 here, hopefully we'll be able to catch up on the release cycles soon. thank you

@tonyghiani
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.10

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

tonyghiani added a commit to tonyghiani/kibana that referenced this pull request Nov 7, 2023
## 📓 Summary

When retrieving the CPU stats for containerized (or non-container)
clusters, we were not considering a scenario where the user could run in
a cgroup but without limits set.
These changes re-write the conditions to determine whether we allow
treating limitless containers as non-containerized, covering the case
where a user run in a cgroup and for some reason hasn't set the limit.

## Testing

> Taken from elastic#159351 since it
reproduced the same behaviours

There are 3 main states to test:
No limit set but Kibana configured to use container stats.
Limit changed during lookback period (to/from real value, to/from no
limit).
Limit set and CPU usage crossing threshold and then falling down to
recovery

**Note: Please also test the non-container use case for this rule to
ensure that didn't get broken during this refactor**

**1. Start Elasticsearch in a container without setting the CPU
limits:**
```
docker network create elastic
docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT
```

(We're using `master-SNAPSHOT` to include a recent fix to reporting for
cgroup v2)

Make note of the generated password for the `elastic` user.

**2. Start another Elasticsearch instance to act as the monitoring
cluster**

**3. Configure Kibana to connect to the monitoring cluster and start
it**

**4. Configure Metricbeat to collect metrics from the Docker cluster and
ship them to the monitoring cluster, then start it**

Execute the below command next to the Metricbeat binary to grab the CA
certificate from the Elasticsearch cluster.

```
docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .
```

Use the `elastic` password and the CA certificate to configure the
`elasticsearch` module:
```
  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - "https://localhost:9201"
    username: "elastic"
    password: "PASSWORD"
    ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt"
```

**5. Configure an alert in Kibana with a chosen threshold**

OBSERVE: Alert gets fired to inform you that there looks to be a
misconfiguration, together with reporting the current value for the
fallback metric (warning if the fallback metric is below threshold,
danger is if is above).

**6. Set limit**
First stop ES using `docker stop es01`, then set the limit using `docker
update --cpus=1 es01` and start it again using `docker start es01`.
After a brief delay you should now see the alert change to a warning
about the limits having changed during the alert lookback period and
stating that the CPU usage could not be confidently calculated.
Wait for change event to pass out of lookback window.

**7. Generate load on the monitored cluster**

[Slingshot](https://github.com/elastic/slingshot) is an option. After
you clone it, you need to update the `package.json` to match [this
change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46)
before running `npm install`.

Then you can modify the value for `elasticsearch` in the
`configs/hosts.json` file like this:
```
"elasticsearch": {
    "node": "https://localhost:9201",
    "auth": {
      "username": "elastic",
      "password": "PASSWORD"
    },
    "ssl": {
      "ca": "PATH_TO_CERT/http_ca.crt",
      "rejectUnauthorized": false
    }
  }
```

Then you can start one or more instances of Slingshot like this:
`npx ts-node bin/slingshot load --config configs/hosts.json`

**7. Observe the alert firing in the logs**
Assuming you're using a connector for server log output, you should see
a message like below once the threshold is breached:
```
`[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))`
```

The alert should also be visible in the Stack Monitoring UI overview
page.

At this point you can stop Slingshot and confirm that the alert recovers
once CPU usage goes back down below the threshold.

**8. Stop the load and confirm that the rule recovers.**

---------

Co-authored-by: Marco Antonio Ghiani <[email protected]>
Co-authored-by: kibanamachine <[email protected]>
(cherry picked from commit 833c075)
miltonhultgren added a commit to miltonhultgren/kibana that referenced this pull request Dec 8, 2023
miltonhultgren added a commit that referenced this pull request Dec 8, 2023
Reverts #159351
Reverts #167244

Due to the many unexpected issues that these changes introduced we've
decided to revert these changes until we have better solutions for the
problems we've learnt about.

Problems:
- Gaps in data cause alerts to fire (see next point)
- Normal CPU rescaling causes alerts to fire
#160905
- Any error fires an alert (since there is no other way to inform the
user about the problems faced by the rule executor)
- Many assumptions about cgroups only being for container users are
wrong

To address some of these issues we also need more functionality in the
alerting framework to be able to register secondary actions so that we
may trigger non-oncall workflows for when a rule faces issues with
evaluating the stats.

Original issue #116128
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Dec 8, 2023
Reverts elastic#159351
Reverts elastic#167244

Due to the many unexpected issues that these changes introduced we've
decided to revert these changes until we have better solutions for the
problems we've learnt about.

Problems:
- Gaps in data cause alerts to fire (see next point)
- Normal CPU rescaling causes alerts to fire
elastic#160905
- Any error fires an alert (since there is no other way to inform the
user about the problems faced by the rule executor)
- Many assumptions about cgroups only being for container users are
wrong

To address some of these issues we also need more functionality in the
alerting framework to be able to register secondary actions so that we
may trigger non-oncall workflows for when a rule faces issues with
evaluating the stats.

Original issue elastic#116128

(cherry picked from commit 55bc6d5)
kibanamachine referenced this pull request Dec 8, 2023
# Backport

This will backport the following commits from `main` to `8.12`:
- [[monitoring] Revert CPU Usage rule changes
(#172913)](#172913)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Milton
Hultgren","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-12-08T15:25:23Z","message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","backport:prev-minor","v8.12.0","v8.13.0"],"number":172913,"url":"https://github.com/elastic/kibana/pull/172913","mergeCommit":{"message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","labelRegex":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/172913","number":172913,"mergeCommit":{"message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559"}}]}]
BACKPORT-->

Co-authored-by: Milton Hultgren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants