Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General discussion: Limit scaling and performance throttling #3091

Closed
dharmab opened this issue Apr 24, 2020 · 9 comments
Closed

General discussion: Limit scaling and performance throttling #3091

dharmab opened this issue Apr 24, 2020 · 9 comments
Labels
area/vertical-pod-autoscaler kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dharmab
Copy link
Contributor

dharmab commented Apr 24, 2020

I'd like to share some experiences my team and organization had with the upgrade from VPA 0.4.0 to newer versions which implement proportional limit scaling.

We were on VPA 0.4.0 for about a year, and had a lengthy transition period to allow teams to switch to the new API version. When testing a newer version of VPA, we saw severe performance degradation for some applications compared to VPA 0.4.0. The investigation found that the simple proportional limit scaling in VPA was scaling Pod limits too low, causing CPU throttling.

Essentially, our org had a lot of deployed Pods where the Pod owners had configured very simple resource and limits, e.g. request 1CPU and limit 2CPU. The Pod owners then configured VPA 0.4.0 to autoscale the request to something more reasonable like 0.3CPU while keeping the limit at 2CPU. In practice, their app was actually bursting to 1.5+CPU for short periods, but held at the 0.3CPU steady state most of the time. This worked well with VPA 0.4.0, but the newer version of VPA configured a proportional limit to 0.6 CPU, causing throttling and performance problems.

If this were a smaller Kubernetes cluster, we could have increased the CPU limit on the deployed services to match the correct ratio. However, manually setting the CPU and memory limits on thousands of Deployments and StatefulSets across dozens of clusters supporting an organization of tens of thousands of people is not a scalable workaround for us.

We worked around the issue by creating a (quick and dirty) patch for VPA to disable limit scaling. However, we'd like to discuss better solutions:

  • Could the VPA's limit scaler be improved to more accurately set limits? Instead of using a simple proportional scale, could the limit be set to accurately allow the Pod to burst?
  • Could the VPA object allow limit scaling to be a configurable option for specific Pods?
@dharmab dharmab changed the title General discussion: Limit scaling General discussion: Limit scaling and performance throttling Apr 24, 2020
@bskiba
Copy link
Member

bskiba commented May 7, 2020

Thanks for the detailed description.

#3028 should address turning off limit scaling for pods, which would address your immediate problem.
For improved limit scaling - this is certainly an interesting problem and a great potential feature. Do you have any suggestions on what kind of algorithm would work for your use case? For example if currently CPU recommendation is set to target 85th percentile of usage, if limit targeted 99th percentile, would that work for you?

@bskiba bskiba added kind/feature Categorizes issue or PR as related to a new feature. area/vertical-pod-autoscaler labels May 7, 2020
@dharmab
Copy link
Contributor Author

dharmab commented May 7, 2020

Hi @bskiba,

Glad to hear that limit scaling will be a configurable option.

I'd have to check with our metrics and engineers, but I believe some of our apps might spend only 200-300ms at a time CPU pinned- but because they're part of a latency sensitive flow, that particular app does in fact need as much CPU as possible during that very small time window. On the other hand, there are other apps that are less latency sensitive that would be fine being limited to 99.9% of their peak utilization.

Perhaps a range of recommendations and a configurable peak would be a flexible solution?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 4, 2020
@yashbhutwala
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 4, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 3, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 2, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants