Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rate limiter to the Vertical Pod autoscaler updater component #2326

Merged
merged 1 commit into from
Oct 25, 2019

Conversation

guillaumebreton
Copy link
Contributor

This patch adds a rate limiter to the vertical pod autoscaler updater.
It adds two flags

  • eviction-rate-limit to control the number of pods that can be evicted
    every seconds.
  • eviction-rate-limit-burst to control the burst of that can be evicted
    immediately.

fixes #2178

@k8s-ci-robot
Copy link
Contributor

Welcome @guillaumebreton!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 11, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Sep 11, 2019
@guillaumebreton guillaumebreton marked this pull request as ready for review September 11, 2019 14:01
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 11, 2019
@bskiba
Copy link
Member

bskiba commented Sep 16, 2019

@guillaumebreton thanks! This looks ok, I have some minor comments but first I wanted to clarify that this is what we want.
@milesbxf does global setting like this solve your issues is there need for more fine grained control (for example ability to set this per-deployment)?

@@ -144,6 +149,11 @@ func (u *updater) RunOnce() {
if !evictionLimiter.CanEvict(pod) {
continue
}
err := u.evictionRateLimiter.Wait(ctx)
Copy link

@bandesz bandesz Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait always asks for one token, so it doesn't make much sense to set the burst rate to anything else than 1. If I'm correct then we should remove the eviction-rate-limit-burst parameter.

Sorry, the golang.org/x/time/rate documentation is not really clear. Looking at the code burst is also the maximum number of tokens you can accrue over time. This is important as VPA couldn't suddenly evict lots of pods after not evicting anything for a long time.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without knowing the autoscaler well I'm a little bit worried if Wait will block here for long time (e.g. minutes). The state of the cluster might change significantly during that time.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An idea for solving the Wait()/blocking problem:

  1. Let's collect all pods to be evicted in a list
  2. Randomize the list (to give all pods a fair chance to be evicted in a RunOnce run)
  3. Go through the list and evict a pod if u.evictionRateLimiter.Allow() returns true.
  4. If u.evictionRateLimiter.Allow() returns false then return from the RunOnce function. Any remaining pods to be evicted will be retried in the next RunOnce run

Copy link
Contributor Author

@guillaumebreton guillaumebreton Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your input 🙂

I'm not really sure why the Wait is an issue here :

  • A context with timeout it passed to the Wait function so the Wait will be cancelled if the delay is larger than the tick interval.
  • As far as I understand it. The state of the cluster is updated by the recommender, and the admission controller applies it. The updater is only responsible for eviction here, and when the pod restarts, the last value of the VPA are applied by the admission controller.

In your proposed solution, we would hit the rate limiter quickly and exit. So we would "waste" n seconds of the ticker loop doing nothing. Running on fresher data is an excellent idea, but the only option I see would be to refresh the pod list between every Wait, which doesn't seem ideal to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default tick for the updater is 1 minute, so as @guillaumebreton writes, the Wait will be canceled after that time at most (unless someone uses a custom tick). So we can basically look at info that is stale up to 1 minute, which is also true if updater has a lot of work to do. I'm inclined to agree this solution is acceptable.

As to randomizing the list of pods to evict - the updater has a priority mechanism to select which pods to evict first based on multiple factors (including how far away they are from their recommended resources). Shuffling the list wouldn't play well with that.

@milesbxf
Copy link

@guillaumebreton thanks! This looks ok, I have some minor comments but first I wanted to clarify that this is what we want.
@milesbxf does global setting like this solve your issues is there need for more fine grained control (for example ability to set this per-deployment)?

Yes, it does! 🙌 FYI @guillaumebreton and I are on the same team, so we've spoken about this offline 🙂

@bskiba
Copy link
Member

bskiba commented Sep 20, 2019

@milesbxf Thanks! In that case I'll get back to this early next week, appreciate your patience I am a bit short on time atm :)

@guillaumebreton
Copy link
Contributor Author

@bskiba No worries 👍 Thank you for reviewing it.

func getRateLimiter(evictionRateLimit float64, evictionRateLimitBurst int) *rate.Limiter {
var evictionRateLimiter *rate.Limiter
if evictionRateLimit == -1 || evictionRateLimit == 0 {
evictionRateLimiter = rate.NewLimiter(rate.Inf, evictionRateLimitBurst)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't dived into the rate package docs, is this equivalent to having no rateLimiter at all? i.e. will the Wait() return immedately?

Copy link
Member

@bskiba bskiba Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is the Burst completely ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, if we set the rate limite to rate.Inf the burst rate is completely ignored, and Wait will return immediately.
-> Documentation about burst being ignored if rate equals to Rate.inf https://github.com/golang/time/blob/master/rate/rate.go#L37
-> Wait() returning immediately : https://github.com/golang/time/blob/master/rate/rate.go#L307

}{
{0.0, 1, rate.NewLimiter(rate.Inf, 1)},
{-1.0, 2, rate.NewLimiter(rate.Inf, 2)},
{10.0, 3, rate.NewLimiter(rate.Every(time.Duration(1.0/10*float64(time.Second))), 3)},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @bandesz's comment above, looks like this can be simplified.

@bskiba
Copy link
Member

bskiba commented Oct 3, 2019

Thanks for your patience and apologies for the delay.

I've added some comments. My main concern is to make sure that the current behavior stays the same (i.e. if we are not using the rate limiter). Other comments are mostly nit's

@guillaumebreton guillaumebreton force-pushed the eviction-rate-limiter branch 2 times, most recently from 3e88a62 to e380765 Compare October 9, 2019 14:48
@guillaumebreton guillaumebreton requested a review from bskiba October 9, 2019 15:17
Copy link
Member

@bskiba bskiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, only minor comments left. Can you address them and squash your commits into one.

Thanks for answering my questions and getting this done!

vertical-pod-autoscaler/pkg/updater/main.go Outdated Show resolved Hide resolved
vertical-pod-autoscaler/pkg/updater/main.go Outdated Show resolved Hide resolved
vertical-pod-autoscaler/pkg/updater/main.go Outdated Show resolved Hide resolved
vertical-pod-autoscaler/pkg/updater/logic/updater.go Outdated Show resolved Hide resolved
vertical-pod-autoscaler/pkg/updater/logic/updater.go Outdated Show resolved Hide resolved
@guillaumebreton
Copy link
Contributor Author

@bskiba What would be the next step with this PR ? Should I ask @jbartosik and @mwielgus for a review ?

@bskiba
Copy link
Member

bskiba commented Oct 22, 2019

I'll review tomorrow, sorry for the wait!

@guillaumebreton
Copy link
Contributor Author

@bskiba No worries :) Github was showing you already approved it so I wasn't sure

`Number of pods that can be evicted per seconds. A rate limit set to 0 or -1 will disable
the rate limiter.`)

evictionRateBurst = flag.Int("eviction-rate-limit-burst", 1, `Burst of pods that can be evicted.`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also rename the flag to eviction-rate-burst?

@bskiba
Copy link
Member

bskiba commented Oct 24, 2019

One last small comment (sorry, I missed this in the previous review :( )

This patch adds a rate limiter to the vertical pod autoscaler updater.
It adds two flags
- eviction-rate-limit to control the number of pods that can be evicted
every seconds.
- eviction-rate-limit-burst to control the burst of that can be evicted
immediately.
@guillaumebreton
Copy link
Contributor Author

@bskiba Thank you 👍 I missed it and I fixed it.

@bskiba
Copy link
Member

bskiba commented Oct 25, 2019

Thanks!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bskiba

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2019
@k8s-ci-robot k8s-ci-robot merged commit 0e425f3 into kubernetes:master Oct 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vertical-pod-autoscaler: rate limit pod evictions
6 participants