aws: Don't pile up successive full refreshes during AWS scaledowns #3797

bpineau · 2021-01-06T19:31:02Z

Force refreshing everything at every DeleteNodes calls causes slow down
and throttling on large clusters with lots of ASGs and activity.

That function might be called many times in a row during scale-down.
Each time the forced refresh will re-discover all ASGs, all LaunchConfigurations,
then re-list all instances from discovered ASGs.

That immediate refresh isn't required anyway, as the cache's DeleteInstances
concrete implementation will decrement the nodegroup size, and we can
schedule a grouped refresh for the next loop iteration.

As a later step, we can consider splitting the asgCache.generate() function
to support per ASG refreshes (and maybe per ASG caches TTLs + jitter, to
spread API calls). But that should address the current issue for now.

gjtempleton · 2021-03-15T21:37:58Z

/assign @gjtempleton

gjtempleton · 2021-03-15T21:39:07Z

Thanks for the PR.

My only concern is whether we could somehow make the log messages clearer, in particular: https://github.com/kubernetes/autoscaler/pull/3797/files#diff-22984a3a02b16ff49b2a94a43b49f3aa61c856483c3612b06b69dd51e347746fL269 Though it's already a bit misleading.

bpineau · 2021-04-08T17:32:25Z

This might be clearer but still not great perhaps (or do you have a suggestion @gjtempleton )?:

DeleteInstances was called: scheduling an ASG list refresh for next accesses

gjtempleton · 2021-04-16T16:58:29Z

@bpineau that reads far better to me, one minor nit, maybe?:

DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation

Force refreshing everything at every DeleteNodes calls causes slow down and throttling on large clusters with many ASGs (and lot of activity). That function might be called several times in a row during scale-down (once for each ASG having a node to be removed). Each time the forced refresh will re-discover all ASGs, all LaunchConfigurations, then re-list all instances from discovered ASGs. That immediate refresh isn't required anyway, as the cache's DeleteInstances concrete implementation will decrement the nodegroup size, and we can schedule a grouped refresh for the next loop iteration.

bpineau · 2021-04-19T13:12:22Z

Thanks, that's better indeed! Updated accordingly.

gjtempleton · 2021-05-03T21:51:44Z

/lgtm
/approve

k8s-ci-robot · 2021-05-03T21:52:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bpineau, gjtempleton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/aws/OWNERS~~ [gjtempleton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…piles aws: Don't pile up successive full refreshes during AWS scaledowns

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 6, 2021

k8s-ci-robot requested review from feiskyer and towca January 6, 2021 19:31

bpineau force-pushed the aws-not-refreshes-dogpiles branch from 0f745a5 to 09ab1ea Compare January 7, 2021 11:49

k8s-ci-robot assigned gjtempleton Mar 15, 2021

bpineau changed the title ~~Don't pile up successive full refreshes during AWS scaledowns~~ aws: Don't pile up successive full refreshes during AWS scaledowns Apr 9, 2021

bpineau force-pushed the aws-not-refreshes-dogpiles branch from 09ab1ea to 037dc73 Compare April 19, 2021 13:12

jbartosik added the area/cluster-autoscaler label Apr 23, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2021

k8s-ci-robot merged commit 6c4101b into kubernetes:master May 3, 2021

evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Mar 24, 2022

Merge pull request kubernetes#3797 from DataDog/aws-not-refreshes-dog…

5727632

…piles aws: Don't pile up successive full refreshes during AWS scaledowns

jiancheung pushed a commit to airbnb/autoscaler that referenced this pull request Jul 29, 2022

Merge pull request kubernetes#3797 from DataDog/aws-not-refreshes-dog…

464ecd7

…piles aws: Don't pile up successive full refreshes during AWS scaledowns

akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Oct 27, 2022

Merge pull request kubernetes#3797 from DataDog/aws-not-refreshes-dog…

046f92c

…piles aws: Don't pile up successive full refreshes during AWS scaledowns

akirillov mentioned this pull request Oct 27, 2022

cluster autoscaler patch cluster autoscaler 1.21.3 airbnb0 airbnb/autoscaler#29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws: Don't pile up successive full refreshes during AWS scaledowns #3797

aws: Don't pile up successive full refreshes during AWS scaledowns #3797

bpineau commented Jan 6, 2021 •

edited

Loading

gjtempleton commented Mar 15, 2021

gjtempleton commented Mar 15, 2021

bpineau commented Apr 8, 2021

gjtempleton commented Apr 16, 2021

bpineau commented Apr 19, 2021

gjtempleton commented May 3, 2021

k8s-ci-robot commented May 3, 2021

aws: Don't pile up successive full refreshes during AWS scaledowns #3797

aws: Don't pile up successive full refreshes during AWS scaledowns #3797

Conversation

bpineau commented Jan 6, 2021 • edited Loading

gjtempleton commented Mar 15, 2021

gjtempleton commented Mar 15, 2021

bpineau commented Apr 8, 2021

gjtempleton commented Apr 16, 2021

bpineau commented Apr 19, 2021

gjtempleton commented May 3, 2021

k8s-ci-robot commented May 3, 2021

bpineau commented Jan 6, 2021 •

edited

Loading