KEP-2371: Add cgroup metrics + CRI implementation plan #3559

danielye11 · 2022-09-27T21:27:51Z

One-line PR description: Updating KEP with cgroup stats of cadvisor metrics, adding CRI implementation details

Issue link: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2371-cri-pod-container-stats

Other comments:

k8s-ci-robot · 2022-09-27T21:27:59Z

Welcome @danielye11!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

linux-foundation-easycla · 2022-09-27T21:27:59Z

The committers listed above are authorized under a signed CLA.

✅ login: danielye11 (822dae3)

k8s-ci-robot · 2022-09-27T21:28:00Z

Hi @danielye11. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

haircommander · 2022-09-28T17:06:38Z

keps/sig-node/2371-cri-pod-container-stats/README.md

+|                              |N/A                   |container_memory_max_usage_bytes                |N/A                             |cAdvisor              |CRI or N/A                 | memory.max_usage_in_bytes  |  memory.max
+|                              |N/A                   |container_memory_swap                           |N/A                             |cAdvisor              |CRI or N/A                 | (memory.stat) swap  |  memory.swap.current - memory.current
+|ProcessStats                  |ProcessCount          |container_processes                             |Pod                             |cAdvisor              |CRI                        |  Process
+|AcceleratorStats              |Make                  |N/A (too lazy to find the mapping)              |Container                       |cAdvisor              |cAdvisor or N/A            |  accelerators/nvidia.go    | accelerators/nvidia.go


AFAIK we still plan on dropping these. do we need to include them in this change?

haircommander · 2022-09-28T17:07:08Z

table update LGTM, one question about a new addition. You also need to sign the CLA, and I would prefer if all of the commits were squashed together

haircommander · 2022-10-03T13:25:53Z

keps/sig-node/2371-cri-pod-container-stats/README.md

+message ListPodSandboxMetricsResponse {
+    repeated PodSandboxMetrics pod_metrics= 1;
+    repeated ContainerMetrics container_metrics = 2;
+}
+
+message PodSandboxMetrics {
+    string pod_sandbox_id = 1;
+    repeated Metric metrics = 2;
+}
+
+message ContainerMetrics {
+    string container_id = 1;
+    repeated Metric metrics = 2;
+}


based on the structure of other CRI calls, I think I'd expect this more to be

essage ListPodSandboxMetricsResponse { repeated PodSandboxMetrics pod_metrics= 1; } message PodSandboxMetrics { string pod_sandbox_id = 1; repeated Metric metrics = 2; repeated ContainerMetrics container_metrics = 3; }

Restructured

haircommander · 2022-10-03T13:26:11Z

keps/sig-node/2371-cri-pod-container-stats/README.md

+message ListPodSandboxMetricsRequest {} 
+
+message ListPodSandboxMetricsResponse {
+    repeated PodSandboxMetrics pod_metrics= 1;


nit: missing space beteween s and =

haircommander · 2022-10-03T13:28:02Z

keps/sig-node/2371-cri-pod-container-stats/README.md

@@ -286,8 +286,7 @@ as cAdvisor is fine tuned to perform in an adequate manner.
 ### Stats Summary API

 #### CRI Implementation
-The CRI implementation will need to be extended to support reporting the full set of container-level from the [Summary API](#summary-container-stats-object).
-
+The CRI implementation will need to be extended to support reporting the full set of container-level from the [Summary API](#summary-container-stats-object). A new GRPC call will also be added to the CRI that allows reporting for metrics currently exported by cAdvisor, but are outside the scope of the Summary API. This new GRPC call will return a Prometheus metric based response which Kubelet can export. Additionally, a feature gate will be added to only report Prometheus based metrics from the CRI when calling /stats endpoint. The additional metrics we support will need to be added to the individual container runtimes.


Additionally, a feature gate will be added to only report Prometheus based metrics from the CRI when calling /stats endpoint.

1: I thought we'd reuse PodAndContainerStatsFromCRI for this?
2: isn't it /metrics/cadvisor not /stats

3: nit, I believe the capitalization is gRPC

Good point, updated in commit

haircommander · 2022-10-03T14:38:51Z

/ok-to-test

haircommander · 2022-10-03T18:55:05Z

linter is saying to run hack/update-toc.sh

haircommander · 2022-10-03T19:18:03Z

/lgtm

@bobbypage @mrunalp @derekwaynecarr PTAL

bobbypage · 2022-10-03T21:36:58Z

Few small comments.

Thanks for updating the KEP, LGTM on the changes proposed.

/lgtm

bobbypage · 2022-10-03T22:03:13Z

One more thing, can you please also update https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2371-cri-pod-container-stats/kep.yaml#L20 to update latest-milestone from v1.25 to v1.26

This will address comment in #2371 (comment)

Thanks!

danielye11 · 2022-10-03T22:23:19Z

Updated latest-milestone to 1.26

bobbypage · 2022-10-03T22:36:51Z

Thanks for updating

/lgtm

bobbypage · 2022-10-03T22:40:24Z

/cc @dashpole

per SIG instrumentation

jpbetz · 2022-10-04T15:18:52Z

keps/sig-node/2371-cri-pod-container-stats/README.md

@@ -286,8 +287,7 @@ as cAdvisor is fine tuned to perform in an adequate manner.
 ### Stats Summary API

 #### CRI Implementation
-The CRI implementation will need to be extended to support reporting the full set of container-level from the [Summary API](#summary-container-stats-object).
-
+The CRI implementation will need to be extended to support reporting the full set of container-level from the [Summary API](#summary-container-stats-object). A new gRPC call will also be added to the CRI that allows reporting for metrics currently exported by cAdvisor, but are outside the scope of the Summary API. This new gRPC call will return a Prometheus metric based response which Kubelet can export. Additionally, `PodAndContainerStatsFromCRI` feature gate support will be added to only report Prometheus based metrics from the CRI when calling `/metrics/cadvisor` endpoint when the feature gate is enabled. The additional metrics we support will need to be added to the individual container runtimes.


From previous release, the PRR Questionnaire section for "Does enabling the feature change any default behavior?" said "Enabling this behavior means some stats endpoints will not be filled: some entries in /metrics/cadvisor"

Is this still accurate with this change? Is there anything about the structure of the metrics or what metrics are available that should be included in that PRR section?

yeah this is still true. accelerator metrics, for instance, are being dropped. The table at the top has a column dedicated to saying whether we're aiming to support them

dashpole · 2022-10-06T14:56:54Z

keps/sig-node/2371-cri-pod-container-stats/README.md

+}
+
+message Metric {
+    int64 timestamp = 1;


Are container runtimes expected to return cached (with timestamp in the past) metrics?

What if a container runtime wants to return "fresh" metrics each time this is called? Is there a way to omit the timestamp from the metric. I believe it is recommended to omit the timestamp from the prometheus exposition when that is the case.

I believe container runtimes will return "fresh" metrics whenever the gRPC call ListPodSandboxMetrics is called (at least on containerd side) so I think it makes sense to have the timestamp. I suppose depending on container runtime implementations there can be cached metrics. We keep Metric and make another type similar to it that does not include timestamp.

we also could interpret a timestamp of 0 to mean omitted and leave it up to the CRI to say when they were collected (or say it was instantaneous)

If metrics are fresh, there is no need to attach timestamps. The timestamp will not be meaningfully different from the scrape time, and should not be attached to prometheus metrics.

Disk usage metrics in particular can be very expensive to collect, and are likely to be cached.

A timestamp of 0 meaning no timestamp works for me, but should be documented in the API.

I don't think we want to enforce they must be fresh. a CRI impl may want to cache them (cri-o may...). I think timestamp 0 as fresh is a good way to express it

dashpole

This is a reasonable approach. A few drawbacks that are worth including in the Drawbacks section:

This doesn't enforce that container runtimes continue to support the full /metrics/cadvisor endpoint. CRI implementations appear to be free to deviate and produce different metrics as they see fit.
Compared with the container runtime exposing these metrics directly, there is significant complexity (I have to implement a new CRI function using prometheus metrics as input), and some overhead from converting between prometheus, the CRI format, and back.

Update README.md Update README.md Add cpu stat linux cgroups v1 Add additional cgroup stats Add some more v1 stats Add v2 cpu metrics Add spacing Add additional v2 stats Add spacing Add network stats Fix spacing remove column one network stat commit add Add spacing Add spacing Add network stats Add stats Add v2 process usage stats Add cri implementation plan Update KEP with CRI API Refactor CRI implementation Resolve reviewer comments Fix capitalization Add backticks Fix linting Update latest-milestone Clarify fresh metrics

derekwaynecarr · 2022-10-06T19:09:42Z

for sig-node

/lgtm
/approve

k8s-ci-robot · 2022-10-06T19:09:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielye11, derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-node/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Sep 27, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 27, 2022

k8s-ci-robot requested review from dchen1107 and derekwaynecarr September 27, 2022 21:28

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 27, 2022

haircommander reviewed Sep 28, 2022

View reviewed changes

haircommander mentioned this pull request Sep 29, 2022

Enabling feature PodAndContainerStatsFromCRI breaks metric collection kubernetes/kubernetes#111276

Open

danielye11 force-pushed the cri-api branch 3 times, most recently from 7a39434 to 822dae3 Compare October 1, 2022 04:16

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 1, 2022

haircommander reviewed Oct 3, 2022

View reviewed changes

This was referenced Oct 3, 2022

cAdvisor-less, CRI-full Container and Pod Stats #2371

Open

kubelet: add support for reverse-proxying endpoints from config kubernetes/kubernetes#111357

Closed

kubelet: add support for reverse-proxying metrics from CRI kubernetes/kubernetes#111355

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 3, 2022

danielye11 force-pushed the cri-api branch from 4c9ac6d to b72232c Compare October 3, 2022 16:25

danielye11 force-pushed the cri-api branch from 1fe9acb to ff98495 Compare October 3, 2022 19:02

k8s-ci-robot assigned haircommander Oct 3, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2022

k8s-ci-robot assigned bobbypage Oct 3, 2022

danielye11 force-pushed the cri-api branch from ff98495 to ce83059 Compare October 3, 2022 22:17

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2022

k8s-ci-robot requested a review from dashpole October 3, 2022 22:40

jpbetz reviewed Oct 4, 2022

View reviewed changes

dashpole reviewed Oct 6, 2022

View reviewed changes

danielye11 force-pushed the cri-api branch from ce83059 to ac24867 Compare October 6, 2022 18:42

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2022

k8s-ci-robot assigned derekwaynecarr Oct 6, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 6, 2022

k8s-ci-robot merged commit f0a1067 into kubernetes:master Oct 6, 2022

k8s-ci-robot added this to the v1.26 milestone Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2371: Add cgroup metrics + CRI implementation plan #3559

KEP-2371: Add cgroup metrics + CRI implementation plan #3559

danielye11 commented Sep 27, 2022

k8s-ci-robot commented Sep 27, 2022

linux-foundation-easycla bot commented Sep 27, 2022 •

edited

Loading

k8s-ci-robot commented Sep 27, 2022

haircommander Sep 28, 2022

haircommander commented Sep 28, 2022

haircommander Oct 3, 2022

danielye11 Oct 3, 2022

haircommander Oct 3, 2022

haircommander Oct 3, 2022

haircommander Oct 3, 2022

danielye11 Oct 3, 2022

haircommander commented Oct 3, 2022

haircommander commented Oct 3, 2022

haircommander commented Oct 3, 2022

bobbypage commented Oct 3, 2022

bobbypage commented Oct 3, 2022 •

edited

Loading

danielye11 commented Oct 3, 2022

bobbypage commented Oct 3, 2022

bobbypage commented Oct 3, 2022

jpbetz Oct 4, 2022

haircommander Oct 4, 2022

dashpole Oct 6, 2022

danielye11 Oct 6, 2022

haircommander Oct 6, 2022

dashpole Oct 6, 2022

haircommander Oct 6, 2022

dashpole left a comment

derekwaynecarr commented Oct 6, 2022

k8s-ci-robot commented Oct 6, 2022

KEP-2371: Add cgroup metrics + CRI implementation plan #3559

KEP-2371: Add cgroup metrics + CRI implementation plan #3559

Conversation

danielye11 commented Sep 27, 2022

k8s-ci-robot commented Sep 27, 2022

linux-foundation-easycla bot commented Sep 27, 2022 • edited Loading

k8s-ci-robot commented Sep 27, 2022

Choose a reason for hiding this comment

haircommander commented Sep 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haircommander commented Oct 3, 2022

haircommander commented Oct 3, 2022

haircommander commented Oct 3, 2022

bobbypage commented Oct 3, 2022

bobbypage commented Oct 3, 2022 • edited Loading

danielye11 commented Oct 3, 2022

bobbypage commented Oct 3, 2022

bobbypage commented Oct 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dashpole left a comment

Choose a reason for hiding this comment

derekwaynecarr commented Oct 6, 2022

k8s-ci-robot commented Oct 6, 2022

linux-foundation-easycla bot commented Sep 27, 2022 •

edited

Loading

bobbypage commented Oct 3, 2022 •

edited

Loading