General discussion: Limit scaling and performance throttling #3091

dharmab · 2020-04-24T16:47:38Z

I'd like to share some experiences my team and organization had with the upgrade from VPA 0.4.0 to newer versions which implement proportional limit scaling.

We were on VPA 0.4.0 for about a year, and had a lengthy transition period to allow teams to switch to the new API version. When testing a newer version of VPA, we saw severe performance degradation for some applications compared to VPA 0.4.0. The investigation found that the simple proportional limit scaling in VPA was scaling Pod limits too low, causing CPU throttling.

Essentially, our org had a lot of deployed Pods where the Pod owners had configured very simple resource and limits, e.g. request 1CPU and limit 2CPU. The Pod owners then configured VPA 0.4.0 to autoscale the request to something more reasonable like 0.3CPU while keeping the limit at 2CPU. In practice, their app was actually bursting to 1.5+CPU for short periods, but held at the 0.3CPU steady state most of the time. This worked well with VPA 0.4.0, but the newer version of VPA configured a proportional limit to 0.6 CPU, causing throttling and performance problems.

If this were a smaller Kubernetes cluster, we could have increased the CPU limit on the deployed services to match the correct ratio. However, manually setting the CPU and memory limits on thousands of Deployments and StatefulSets across dozens of clusters supporting an organization of tens of thousands of people is not a scalable workaround for us.

We worked around the issue by creating a (quick and dirty) patch for VPA to disable limit scaling. However, we'd like to discuss better solutions:

Could the VPA's limit scaler be improved to more accurately set limits? Instead of using a simple proportional scale, could the limit be set to accurately allow the Pod to burst?
Could the VPA object allow limit scaling to be a configurable option for specific Pods?

bskiba · 2020-05-07T14:01:31Z

Thanks for the detailed description.

#3028 should address turning off limit scaling for pods, which would address your immediate problem.
For improved limit scaling - this is certainly an interesting problem and a great potential feature. Do you have any suggestions on what kind of algorithm would work for your use case? For example if currently CPU recommendation is set to target 85th percentile of usage, if limit targeted 99th percentile, would that work for you?

dharmab · 2020-05-07T14:25:35Z

Hi @bskiba,

Glad to hear that limit scaling will be a configurable option.

I'd have to check with our metrics and engineers, but I believe some of our apps might spend only 200-300ms at a time CPU pinned- but because they're part of a latency sensitive flow, that particular app does in fact need as much CPU as possible during that very small time window. On the other hand, there are other apps that are less latency sensitive that would be fine being limited to 99.9% of their peak utilization.

Perhaps a range of recommendations and a configurable peak would be a flexible solution?

fejta-bot · 2020-08-05T15:09:48Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-09-04T15:51:35Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

yashbhutwala · 2020-09-04T16:42:20Z

/remove-lifecycle rotten

fejta-bot · 2020-12-03T17:00:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-01-02T17:45:49Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2021-02-01T18:29:34Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2021-02-01T18:29:46Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dharmab changed the title ~~General discussion: Limit scaling~~ General discussion: Limit scaling and performance throttling Apr 24, 2020

bskiba added kind/feature Categorizes issue or PR as related to a new feature. area/vertical-pod-autoscaler labels May 7, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 4, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 4, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 3, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 2, 2021

k8s-ci-robot closed this as completed Feb 1, 2021

Niksko mentioned this issue Jul 29, 2021

[VPA] Ability to control amount of acceptable throttling #4230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General discussion: Limit scaling and performance throttling #3091

General discussion: Limit scaling and performance throttling #3091

dharmab commented Apr 24, 2020

bskiba commented May 7, 2020

dharmab commented May 7, 2020

fejta-bot commented Aug 5, 2020

fejta-bot commented Sep 4, 2020

yashbhutwala commented Sep 4, 2020

fejta-bot commented Dec 3, 2020

fejta-bot commented Jan 2, 2021

fejta-bot commented Feb 1, 2021

k8s-ci-robot commented Feb 1, 2021

General discussion: Limit scaling and performance throttling #3091

General discussion: Limit scaling and performance throttling #3091

Comments

dharmab commented Apr 24, 2020

bskiba commented May 7, 2020

dharmab commented May 7, 2020

fejta-bot commented Aug 5, 2020

fejta-bot commented Sep 4, 2020

yashbhutwala commented Sep 4, 2020

fejta-bot commented Dec 3, 2020

fejta-bot commented Jan 2, 2021

fejta-bot commented Feb 1, 2021

k8s-ci-robot commented Feb 1, 2021