[receiver/kubeletstats] Add metric to represent the percentage of cpu and memory utilization compared to the configured limit #24905

TylerHelmuth · 2023-08-04T15:50:06Z

Component(s)

No response

Is your feature request related to a problem? Please describe.

The receiver reports the current memory and cpu usage for pods and containers, but unless you know the limits set for the containers in the pod it is hard to determine if that value is getting too high.

Describe the solution you'd like

A new set of pod and container metrics that represents the utilization of memory and cpu that the pod/container is consuming based on the usage value and the configured limits.

k8s.pod.memory.usagePercent
k8s.pod.cpu.usagePercent
k8s.container.memory.usagePercent
k8s.container.cpu.usagePercent

github-actions · 2023-08-04T15:50:23Z

Pinging code owners:

receiver/kubeletstats: @dmitryax

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-08-04T15:50:31Z

Pinging code owners for receiver/kubeletstats: @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

jmichalek132 · 2023-08-14T11:11:49Z

I mean there is already another metric for the limits, isn't comparing the two in your storage/query pipeline enough?

TylerHelmuth · 2023-08-14T20:13:38Z

@jmichalek132 I am not aware of a metric collected by kubletstats for individual container resource limits or representing the sum of the container limits for the whole pod.

But I don't feel we should limit what types of metrics the collector can produce. As a storage/query-agnostic solution the Collector should be able to produce any metric that the user needs. We already perform other computations while collecting metrics. I don't think we should restrict ourselves from producing meaningful metrics based on other metrics (in this case a the percentage of UsageNanoCores use compared to a limit).

jinja2 · 2023-08-18T16:00:19Z

If we add utilization compared to limits, we should also introduce utilization compared to requests. Or at least make the metric reflect whether is relative to the limit or request.

dmitryax · 2023-08-18T17:34:59Z

If we add utilization compared to limits, we should also introduce utilization compared to requests. Or at least make the metric reflect whether is relative to the limit or request.

This is a good point. I'm curious what metrics other k8s monitoring agents provide regarding resource utilization.

TylerHelmuth · 2023-08-18T18:19:41Z

I will do some research. A metric with respect to resources.requests is doable, but would be less value that resources.limits in my opinion since Requests are a scheduling concern, not a runtime concern.

TylerHelmuth · 2023-08-18T19:11:08Z

Seeing when your usage is above your requests could be interesting as this would indicate the you've under-provisioned your containers/pod and that Kubernetes might evict it.

Going back to the OTel naming specification, utilization is currently reserved to mean the fraction of usage out of its limit, where limit is defined as the "constant, known total amount of something". Since k8s's will kill pods above its defined resources.limits, I think .utilization should be reserved for representing the usage compared to resources.limits.

I think measurement comparing usage to resources.requests is a good idea, but I don't think it should be named utilization. I also think such a metric should be allowed to go above 1 (100%), since tracking how often the metric is above 1 is what is interesting.

As for research, I can share that we (Honeycomb) have an agent that emits a utilization metric based on limits. I also found some DataDog documentation recommending utilization compared to limits and compared to requests. Kube-state-metrics reports the values individually and expects you to compare them yourself, but I am still staunchly in favor of calculating the result within the Collector.

jinja2 · 2023-08-22T20:03:58Z

I should've given more context on why I think usage wrt request is also useful, in addition to the limits. Requests (minimum amount of guaranteed resources) are relevant even after pods have been scheduled, and being able to track usage w.r.t requests is important for users to ensure that optimal resource requests are set to keep their applications running with reduced disruptions. One runtime use is when a node is under resource pressure, one of the factors taken into consideration when deciding which pods to evict is the usage w.r.t requests, so setting appropriate requests is important to reduce chances of being evicted.

Another use case for the metric could be when making vertical scaling decisions. Users who don't use Guaranteed QoS strategy (request = limit), usually take into account if an application has been using more/less than requested resources to right-size their workloads. It can also be useful to detect changes in app behavior/benchmarks, for e.g., a container consistently using more than requested memory after a new version rollout, could indicate that recent changes might've introduced some additional unaccounted for memory overhead, and the minimum memory requirement for the application to run has changed.

CPU request decides the weighted distribution of CPU time for containers, so tracking usage wrt requests to set appropriate values for req is important to avoid cpu starvation. There are some resource allocation guidelines where users are recommended to only set appropriate CPU requests and no limits (cause limits can prevent processes from utilizing idle CPU cycles as they become available on the node).

jinja2 · 2023-08-22T20:07:31Z

Re: the proposed metric k8s.pod.memory.usagePercent and k8s.pod.cpu.usagePercent, since limits/requests are set on containers, I assume these are calculated as sum of container limits. It would make sense to not calculate for the pod if any of its containers does not have the limit set. Is this the proposed plan?

Going back to the OTel naming specification, utilization is currently reserved to mean the fraction of usage out of its limit, where limit is defined as the "constant, known total amount of something". Since k8s's will kill pods above its defined resources.limits, I think .utilization should be reserved for representing the usage compared to resources.limits.

I think a request can be seen as a limit as defined by the convention since it is a known quantity we can meaningfully track usage against. But I can't think of a good way to indicate if the utilization is wrt to request or limit w/o having it part of the metric name as container.cpu.request.utilization which feels verbose and weird.

TylerHelmuth · 2023-08-22T20:41:22Z

It would make sense to not calculate for the pod if any of its containers does not have the limit set. Is this the proposed plan?

Yes this is how I have implemented the metrics for pods.

I think a request can be seen as a limit as defined by the convention since it is a known quantity we can meaningfully track usage against.

I agree that it is a known value we can track against, but I wouldn't consider it a limit in the sense that OTel has defined the term limit since Kubernetes allows the memory or cpu usage to go above the configured requests. Otel's current naming conventions expect that utilization be fixed between [0,1], which makes sense when limit is an actual constraint. But we definitely want to allow a metric related to requests to go above 1.

Although it wasn't my original goal with this issue, I think that a metric relating the usage to the requests is a good idea. Reading through the specification, the definitions use SHOULD instead of MUST, which does give us an opening to treat requests as a limit, use the name utilization, and go above 100. I see no other defined name in the specification that fits better.

I will bring this topic up in the Collector SIG meeting tomorrow. @jinja2 are you able to attend? If we use utilization for both forms of the metric I think the suggestion to add the type in the name is the best solution. The value could go before or after. If we choose to use the term utilization with respect to request I think we should move forward with the following metrics:

k8s.pod.cpu.limits.utilization
k8s.pod.cpu.requests.utilization
k8s.pod.memory.limits.utilization
k8s.pod.memory.requests.utilization
container.cpu.limits.utilization
container.cpu.requests.utilization
container.memory.limits.utilization
container.memory.requests.utilization

If utilization cannot be used with respect to requests due to requests not being a hard limit, then we will have to come up with another name for those metrics, and the metrics measuring utilization with respect to the limits can be called *.utilization.

jinja2 · 2023-08-22T21:11:59Z

@TylerHelmuth Sgtm, and, I can attend the SIG meeting tomorrow.

TylerHelmuth · 2023-08-23T17:39:29Z

Update from the SIG meeting today. The community liked these metrics in the form discussed above. The next step is to submit a PR to the semantic conventions introducing the concept of a soft limit and removing the restriction that utilization must be between 0 and 1.

Separately we need to join the semantic discussion around plurality.

jmichalek132 · 2023-08-25T12:09:54Z

@TylerHelmuth the k8sclusterreceiver already produces metrics for cpu and memory limits.
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/k8sclusterreceiver/metadata.yaml#L187

TylerHelmuth · 2023-08-25T12:51:03Z

@jmichalek132 those metrics are the actual configured values. They are limit metrics. I'm proposing we produce some new utilization metrics based on those limits.

dmitryax · 2023-08-28T00:50:29Z

@jmichalek132, thanks for bringing this up. That made me think more about the naming for the new utilization metrics.

Since plural requests and limits apply to all the resources in the k8s spec, it may be more natural to keep singular limit and request when applied to one resource. I think we should keep it consistent with the existing k8scluster metrics and introduce:

k8s.container.cpu_limit_utilization
k8s.container.memory_limit_utilization

Note that we cannot have k8s.container.cpu_limit.utilization because k8s.container.cpu_limit should not be introduced as a namespace if there is a metric with the same name already. But this rule might change as a result of open-telemetry/semantic-conventions#50

TylerHelmuth · 2023-08-28T15:12:52Z

I am also ok with those names. If we did that convention the list of new metrics would be:

k8s.pod.cpu_limit_utilization
k8s.pod.cpu_request_utilization
k8s.pod.memory_limit_utilization
k8s.pod.memory_request_utilization
k8s.container.cpu_limit_utilization
k8s.container.cpu_request_utilization
k8s.container.memory_limit_utilization
k8s.container.memory_request_utilization

jmichalek132 · 2023-08-29T08:46:29Z

The main reason I brought it up since there is a way to collect the usage and limits, it's quite easy to calculate the desired data in the backend storing metrics, so would it be worth pre-calculating this (since these metrics will be produced for all pods / containers, there will be quite a few of them).

TylerHelmuth · 2023-08-29T15:59:10Z

@jmichalek132 you're right that backends could calculate these metrics (assuming they can do some arithmetic, which most can). But in order for the backend to be the source of these metrics you would need the k8sclusterreceiver, which is a separate receiver and has specific deployment requirements. I don't like the idea of setting a precedence that a metric that can be calculated in receiver A should not be included, even as optional, because a combination of metrics from receiver A and some other receiver B could produce the metrics.

Users should be able to look to the kubeletstats receiver for their pod and container metrics needs. It is reasonable for users to start off with only the kubeletstats receiver and not the k8scluster receiver, especially since introducing the k8sclusterreciever means introducing a second deployment.

**Description:** Add metadata map for pod and container requests and limits. Will be used to calculate new metrics in a future PR. **Link to tracking Issue:** <Issue number if applicable> #24905 **Testing:** <Describe what testing was performed and which tests were added.> added unit tests

**Description:** Adds new `k8s.pod.memory.utilization` and `container.memory.utilization` metrics that represent the ratio of memory used vs limits set. The metrics are only emitted for pods/containers that have defined resource limits. It takes advantage of the pod metadata to acquire the container limits. A pod limit is computed as the sum of all the container limits and if any container limit is zero or undefined the pod limit is also considered undefined. **Link to tracking Issue:** Related to #24905 **Testing:** Unit tests and local testing

**Description:** Adds new CPU utilization metrics with respect to pod/container CPU limits and requests **Link to tracking Issue:** <Issue number if applicable> Closes #24905 **Testing:** <Describe what testing was performed and which tests were added.> Added new unit tests and tested locally

**Description:** Add metadata map for pod and container requests and limits. Will be used to calculate new metrics in a future PR. **Link to tracking Issue:** <Issue number if applicable> open-telemetry#24905 **Testing:** <Describe what testing was performed and which tests were added.> added unit tests

…metry#25894) **Description:** Adds new `k8s.pod.memory.utilization` and `container.memory.utilization` metrics that represent the ratio of memory used vs limits set. The metrics are only emitted for pods/containers that have defined resource limits. It takes advantage of the pod metadata to acquire the container limits. A pod limit is computed as the sum of all the container limits and if any container limit is zero or undefined the pod limit is also considered undefined. **Link to tracking Issue:** Related to open-telemetry#24905 **Testing:** Unit tests and local testing

…ry#27276) **Description:** Adds new CPU utilization metrics with respect to pod/container CPU limits and requests **Link to tracking Issue:** <Issue number if applicable> Closes open-telemetry#24905 **Testing:** <Describe what testing was performed and which tests were added.> Added new unit tests and tested locally

**Description:** Starts the name change processor for `*.cpu.utilization` metrics. **Link to tracking Issue:** Related to #24905 Related to #27885

…elemetry#25901) **Description:** Starts the name change processor for `*.cpu.utilization` metrics. **Link to tracking Issue:** Related to open-telemetry#24905 Related to open-telemetry#27885

TylerHelmuth added enhancement New feature or request needs triage New item requiring triage priority:p2 Medium receiver/kubeletstats and removed needs triage New item requiring triage labels Aug 4, 2023

TylerHelmuth self-assigned this Aug 15, 2023

TylerHelmuth mentioned this issue Aug 15, 2023

[receiver/kubeletstats] Add new percent-based cpu and memory metrics #25835

Closed

TylerHelmuth mentioned this issue Aug 18, 2023

[receiver/kubeletstats] Add new memory utilization metrics #25894

Merged

TylerHelmuth mentioned this issue Aug 18, 2023

[receiver/kubeletstats] Start name change for cpu.utilization #25901

Merged

This was referenced Aug 23, 2023

Reduce restrictions on which metrics may be named utilization open-telemetry/semantic-conventions#280

Merged

Proposal: don't pluralize metric namespaces open-telemetry/semantic-conventions#212

Closed

TylerHelmuth mentioned this issue Sep 5, 2023

Question about the meaning of container.cpu.utilization from kubeletstats receiver #26463

Closed

TylerHelmuth mentioned this issue Sep 14, 2023

[receiver/kubeletstats] Add resource getters #26690

Merged

TylerHelmuth mentioned this issue Sep 29, 2023

[receiver/kubeletstats] Add new CPU utilization metrics #27276

Merged

TylerHelmuth closed this as completed in #27276 Oct 6, 2023

TylerHelmuth mentioned this issue Oct 20, 2023

[receiver/kubeletstat] Review cpu.utilization naming #27885

Open

jinja2 mentioned this issue Oct 20, 2023

REQUEST: New membership for jinja2 open-telemetry/community#1749

Closed

6 tasks

ChrsMark mentioned this issue Dec 12, 2023

Add container metric fields (from ECS) open-telemetry/semantic-conventions#282

Merged

dmitryax pushed a commit that referenced this issue Jan 12, 2024

[receiver/kubeletstats] Start name change for cpu.utilization (#25901)

74294cb

**Description:** Starts the name change processor for `*.cpu.utilization` metrics. **Link to tracking Issue:** Related to #24905 Related to #27885

a-thaler mentioned this issue Feb 29, 2024

Fix and review cpu utilization metrics kyma-project/telemetry-manager#838

Closed

ChrsMark mentioned this issue May 29, 2024

[receiver/kubeletstats] Add k8s.container.cpu.node.utilization metric #32295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/kubeletstats] Add metric to represent the percentage of cpu and memory utilization compared to the configured limit #24905

[receiver/kubeletstats] Add metric to represent the percentage of cpu and memory utilization compared to the configured limit #24905

TylerHelmuth commented Aug 4, 2023 •

edited

Loading

github-actions bot commented Aug 4, 2023

github-actions bot commented Aug 4, 2023

jmichalek132 commented Aug 14, 2023

TylerHelmuth commented Aug 14, 2023

jinja2 commented Aug 18, 2023 •

edited

Loading

dmitryax commented Aug 18, 2023

TylerHelmuth commented Aug 18, 2023

TylerHelmuth commented Aug 18, 2023

jinja2 commented Aug 22, 2023

jinja2 commented Aug 22, 2023

TylerHelmuth commented Aug 22, 2023

jinja2 commented Aug 22, 2023

TylerHelmuth commented Aug 23, 2023

jmichalek132 commented Aug 25, 2023

TylerHelmuth commented Aug 25, 2023

dmitryax commented Aug 28, 2023

TylerHelmuth commented Aug 28, 2023

jmichalek132 commented Aug 29, 2023

TylerHelmuth commented Aug 29, 2023

[receiver/kubeletstats] Add metric to represent the percentage of cpu and memory utilization compared to the configured limit #24905

[receiver/kubeletstats] Add metric to represent the percentage of cpu and memory utilization compared to the configured limit #24905

Comments

TylerHelmuth commented Aug 4, 2023 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

github-actions bot commented Aug 4, 2023

github-actions bot commented Aug 4, 2023

jmichalek132 commented Aug 14, 2023

TylerHelmuth commented Aug 14, 2023

jinja2 commented Aug 18, 2023 • edited Loading

dmitryax commented Aug 18, 2023

TylerHelmuth commented Aug 18, 2023

TylerHelmuth commented Aug 18, 2023

jinja2 commented Aug 22, 2023

jinja2 commented Aug 22, 2023

TylerHelmuth commented Aug 22, 2023

jinja2 commented Aug 22, 2023

TylerHelmuth commented Aug 23, 2023

jmichalek132 commented Aug 25, 2023

TylerHelmuth commented Aug 25, 2023

dmitryax commented Aug 28, 2023

TylerHelmuth commented Aug 28, 2023

jmichalek132 commented Aug 29, 2023

TylerHelmuth commented Aug 29, 2023

TylerHelmuth commented Aug 4, 2023 •

edited

Loading

jinja2 commented Aug 18, 2023 •

edited

Loading