-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/kubeletstats] Add metric to represent the percentage of cpu and memory utilization compared to the configured limit #24905
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Pinging code owners for receiver/kubeletstats: @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I mean there is already another metric for the limits, isn't comparing the two in your storage/query pipeline enough? |
@jmichalek132 I am not aware of a metric collected by kubletstats for individual container resource limits or representing the sum of the container limits for the whole pod. But I don't feel we should limit what types of metrics the collector can produce. As a storage/query-agnostic solution the Collector should be able to produce any metric that the user needs. We already perform other computations while collecting metrics. I don't think we should restrict ourselves from producing meaningful metrics based on other metrics (in this case a the percentage of |
If we add utilization compared to limits, we should also introduce utilization compared to |
This is a good point. I'm curious what metrics other k8s monitoring agents provide regarding resource utilization. |
I will do some research. A metric with respect to |
Seeing when your usage is above your requests could be interesting as this would indicate the you've under-provisioned your containers/pod and that Kubernetes might evict it. Going back to the OTel naming specification, I think measurement comparing usage to As for research, I can share that we (Honeycomb) have an agent that emits a utilization metric based on limits. I also found some DataDog documentation recommending utilization compared to limits and compared to requests. Kube-state-metrics reports the values individually and expects you to compare them yourself, but I am still staunchly in favor of calculating the result within the Collector. |
I should've given more context on why I think usage wrt Another use case for the metric could be when making vertical scaling decisions. Users who don't use Guaranteed QoS strategy (request = limit), usually take into account if an application has been using more/less than requested resources to right-size their workloads. It can also be useful to detect changes in app behavior/benchmarks, for e.g., a container consistently using more than requested memory after a new version rollout, could indicate that recent changes might've introduced some additional unaccounted for memory overhead, and the minimum memory requirement for the application to run has changed. CPU request decides the weighted distribution of CPU time for containers, so tracking usage wrt requests to set appropriate values for req is important to avoid cpu starvation. There are some resource allocation guidelines where users are recommended to only set appropriate CPU requests and no limits (cause limits can prevent processes from utilizing idle CPU cycles as they become available on the node). |
Re: the proposed metric
I think a |
Yes this is how I have implemented the metrics for pods.
I agree that it is a known value we can track against, but I wouldn't consider it a limit in the sense that OTel has defined the term limit since Kubernetes allows the memory or cpu usage to go above the configured requests. Otel's current naming conventions expect that utilization be fixed between [0,1], which makes sense when limit is an actual constraint. But we definitely want to allow a metric related to requests to go above 1. Although it wasn't my original goal with this issue, I think that a metric relating the usage to the requests is a good idea. Reading through the specification, the definitions use I will bring this topic up in the Collector SIG meeting tomorrow. @jinja2 are you able to attend? If we use
If |
@TylerHelmuth Sgtm, and, I can attend the SIG meeting tomorrow. |
Update from the SIG meeting today. The community liked these metrics in the form discussed above. The next step is to submit a PR to the semantic conventions introducing the concept of a soft limit and removing the restriction that Separately we need to join the semantic discussion around plurality. |
@TylerHelmuth the k8sclusterreceiver already produces metrics for cpu and memory limits. |
@jmichalek132 those metrics are the actual configured values. They are |
@jmichalek132, thanks for bringing this up. That made me think more about the naming for the new utilization metrics. Since plural
Note that we cannot have |
I am also ok with those names. If we did that convention the list of new metrics would be:
|
The main reason I brought it up since there is a way to collect the usage and limits, it's quite easy to calculate the desired data in the backend storing metrics, so would it be worth pre-calculating this (since these metrics will be produced for all pods / containers, there will be quite a few of them). |
@jmichalek132 you're right that backends could calculate these metrics (assuming they can do some arithmetic, which most can). But in order for the backend to be the source of these metrics you would need the k8sclusterreceiver, which is a separate receiver and has specific deployment requirements. I don't like the idea of setting a precedence that a metric that can be calculated in receiver A should not be included, even as optional, because a combination of metrics from receiver A and some other receiver B could produce the metrics. Users should be able to look to the kubeletstats receiver for their pod and container metrics needs. It is reasonable for users to start off with only the kubeletstats receiver and not the k8scluster receiver, especially since introducing the k8sclusterreciever means introducing a second deployment. |
**Description:** Add metadata map for pod and container requests and limits. Will be used to calculate new metrics in a future PR. **Link to tracking Issue:** <Issue number if applicable> #24905 **Testing:** <Describe what testing was performed and which tests were added.> added unit tests
**Description:** Adds new `k8s.pod.memory.utilization` and `container.memory.utilization` metrics that represent the ratio of memory used vs limits set. The metrics are only emitted for pods/containers that have defined resource limits. It takes advantage of the pod metadata to acquire the container limits. A pod limit is computed as the sum of all the container limits and if any container limit is zero or undefined the pod limit is also considered undefined. **Link to tracking Issue:** Related to #24905 **Testing:** Unit tests and local testing
**Description:** Adds new CPU utilization metrics with respect to pod/container CPU limits and requests **Link to tracking Issue:** <Issue number if applicable> Closes #24905 **Testing:** <Describe what testing was performed and which tests were added.> Added new unit tests and tested locally
**Description:** Add metadata map for pod and container requests and limits. Will be used to calculate new metrics in a future PR. **Link to tracking Issue:** <Issue number if applicable> open-telemetry#24905 **Testing:** <Describe what testing was performed and which tests were added.> added unit tests
…metry#25894) **Description:** Adds new `k8s.pod.memory.utilization` and `container.memory.utilization` metrics that represent the ratio of memory used vs limits set. The metrics are only emitted for pods/containers that have defined resource limits. It takes advantage of the pod metadata to acquire the container limits. A pod limit is computed as the sum of all the container limits and if any container limit is zero or undefined the pod limit is also considered undefined. **Link to tracking Issue:** Related to open-telemetry#24905 **Testing:** Unit tests and local testing
…ry#27276) **Description:** Adds new CPU utilization metrics with respect to pod/container CPU limits and requests **Link to tracking Issue:** <Issue number if applicable> Closes open-telemetry#24905 **Testing:** <Describe what testing was performed and which tests were added.> Added new unit tests and tested locally
…elemetry#25901) **Description:** Starts the name change processor for `*.cpu.utilization` metrics. **Link to tracking Issue:** Related to open-telemetry#24905 Related to open-telemetry#27885
Component(s)
No response
Is your feature request related to a problem? Please describe.
The receiver reports the current memory and cpu usage for pods and containers, but unless you know the limits set for the containers in the pod it is hard to determine if that value is getting too high.
Describe the solution you'd like
A new set of pod and container metrics that represents the utilization of memory and cpu that the pod/container is consuming based on the usage value and the configured limits.
The text was updated successfully, but these errors were encountered: