-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General discussion: Limit scaling and performance throttling #3091
Comments
Thanks for the detailed description. #3028 should address turning off limit scaling for pods, which would address your immediate problem. |
Hi @bskiba, Glad to hear that limit scaling will be a configurable option. I'd have to check with our metrics and engineers, but I believe some of our apps might spend only 200-300ms at a time CPU pinned- but because they're part of a latency sensitive flow, that particular app does in fact need as much CPU as possible during that very small time window. On the other hand, there are other apps that are less latency sensitive that would be fine being limited to 99.9% of their peak utilization. Perhaps a range of recommendations and a configurable peak would be a flexible solution? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'd like to share some experiences my team and organization had with the upgrade from VPA 0.4.0 to newer versions which implement proportional limit scaling.
We were on VPA 0.4.0 for about a year, and had a lengthy transition period to allow teams to switch to the new API version. When testing a newer version of VPA, we saw severe performance degradation for some applications compared to VPA 0.4.0. The investigation found that the simple proportional limit scaling in VPA was scaling Pod limits too low, causing CPU throttling.
Essentially, our org had a lot of deployed Pods where the Pod owners had configured very simple resource and limits, e.g. request 1CPU and limit 2CPU. The Pod owners then configured VPA 0.4.0 to autoscale the request to something more reasonable like 0.3CPU while keeping the limit at 2CPU. In practice, their app was actually bursting to 1.5+CPU for short periods, but held at the 0.3CPU steady state most of the time. This worked well with VPA 0.4.0, but the newer version of VPA configured a proportional limit to 0.6 CPU, causing throttling and performance problems.
If this were a smaller Kubernetes cluster, we could have increased the CPU limit on the deployed services to match the correct ratio. However, manually setting the CPU and memory limits on thousands of Deployments and StatefulSets across dozens of clusters supporting an organization of tens of thousands of people is not a scalable workaround for us.
We worked around the issue by creating a (quick and dirty) patch for VPA to disable limit scaling. However, we'd like to discuss better solutions:
The text was updated successfully, but these errors were encountered: