Early abort if AWS node group has no capacity #4489

drmorr0 · 2021-11-30T18:43:15Z

Summary

This change makes it so that Cluster Autoscaler will early-abort if AWS does not have any capacity available (e.g., it is a spot node group, and there are no spot instances available). It works by polling the DescribeScalingActivities AWS EC2 endpoint, and setting the node group status to an error if the DescribeScalingActivities call returns "Failed". This short-circuits the 15-minute waiting period for nodes to join the cluster before the node group is added to the backoff list, which allows Cluster Autoscaler to more rapidly try a different node group that is more likely to succeed.

Notes for reviewer:

Recommend reviewing the commits separately
All existing tests pass
New unit tests for the new behaviour
Tested in a test cluster at Airbnb (see below for details)

Testing details

I created 3 c5ad.16xlarge ASGs in test-mc-a, and tried to scale up a test deployment in the cluster that had a "required during scheduling" node affinity for c5ad.16xlarge. The node group quickly ran out of capacity, and then deleted all of the "upcoming" nodes from the node group, as you can see in the follow series of graphs:

Specifically, note that at multiple points in the time period, it tries to scale up, creates a bunch of un-registered nodes (yellow bars in the bottom graph), and then deletes them shortly thereafter. We can see from the following graphs that the reason for the failed scale-up is placeholder-cannot-be-fulfilled:

We can also look at the logs for this time period:

F I1122 17:02:51.876677       1 auto_scaling_groups.go:432] Instance group kubernetes-minion-test-mc-a-amd-asg-us-east-1a has only 0 instances created while requested count is 14. Creating placeholder instances.
...
F W1122 17:02:51.935464       1 auto_scaling_groups.go:440] Instance group kubernetes-minion-test-mc-a-amd-asg-us-east-1a cannot provision any more nodes!
...
F W1122 17:02:52.094981       1 clusterstate.go:281] Disabling scale-up for node group kubernetes-minion-test-mc-a-amd-asg-us-east-1b until 2021-11-22 17:07:50.399430339 +0000 UTC m=+48821.870874859; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fullfilled
...
F I1122 17:02:52.095011       1 clusterstate.go:1043] Failed adding 14 nodes (14 unseen previously) to group kubernetes-minion-test-mc-a-amd-asg-us-east-1a due to OutOfResource.placeholder-cannot-be-fullfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
...
F I1122 17:02:52.095226       1 auto_scaling_groups.go:287] instance i-placeholder-kubernetes-minion-test-mc-a-amd-asg-us-east-1b-0 is detected as a placeholder, decreasing ASG requested size instead of deleting instance
...
F W1122 17:03:09.808770       1 scale_up.go:383] Node group kubernetes-minion-test-mc-a-amd-asg-us-east-1b is not ready for scaleup - backoff

Here we can see the node group errors out and is added to the backoff list, and the placeholder nodes are deleted, as desired.

gjtempleton · 2021-12-01T23:21:22Z

/area provider/aws

drmorr0 · 2022-02-07T21:17:49Z

@feiskyer @aleksandra-malinowska hi, any chance we could get some eyes on this?

gjtempleton · 2022-02-16T21:53:49Z

/assign @gjtempleton

gjtempleton

One question for you, but other than that this looks good to merge to me.

gjtempleton · 2022-02-28T00:13:26Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

+	}
+
+	if len(response.Activities) > 0 {
+		activity := response.Activities[0]


Do we only want to check the latest activity? Doesn't this potentially risk missing activities that have happened since the last time the ASG cache was updated if (for instance) an instance has been successfully taken out of service due to an instance rebalance recommendation in the same ASG that is unable to fulfill new spot requests?

Ahh good point. Let me see what happens here/if there's an easy way to resolve it.

I just turned this into a loop that iterates over all the scaling activities since the ASG was updated, and early-aborts if any of them failed.

I'm a little bit worried about this approach, for example if something that wasn't a scale-up attempt failed -- but the only way I can think of to work around this would be text-matching against the StatusMessage response from DescribeScalingActivities, which feels like it could be super-brittle if, for example, AWS decides to change that text slightly.

@gjtempleton WDYT?

I'm just an eager potential user of this feature but from my perspective it doesn't really matter why an ASG had a failure -- the ASG is still unhealthy. As long as you're checking, as you said, since the ASG cache was updated I think that it's going to be a win. If the error was InsufficientInstanceCapacity then it's the target situation but if it's something like InstanceLimitExceeded or a bad image / launch template misconfiguration then honestly the ASG is unhealthy and it should be marked as such. Ideally the reason would be logged but that's just a bonus for me. The main problem is AWS doesn't have a list of potential Failure scenarios but I think a failure is just a failure?

Thank you for making this!

Yes, I think I agree with you. The only case where I would be worried is if there was a Failure on scaling down, but it could still somehow scale up? I don't even know if that's a thing that can happen.

Anyways, logging the failed scaling activity is a great suggestion, I added that.

k8s-ci-robot · 2022-03-03T21:00:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: drmorr0, gjtempleton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/aws/OWNERS~~ [gjtempleton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…-aws-node-group-no-capacity Early abort if AWS node group has no capacity

ddelange · 2022-06-15T09:49:04Z

hi @drmorr0 👋

we observe this activity more often (GPUs in eu-central-1c are scarce), but the cluster-autoscaler (using 9.19.0 helm chart, running k8s 1.23.7) still patiently waits where I expected the early quit from this PR. Is it related to #4900?

$ aws autoscaling describe-scaling-activities
        {
            "ActivityId": "5856059a-a4b2-f9f8-cdc4-df643e82e519",
            "AutoScalingGroupName": "spot-amd-16GiB-4vCPU-1GPU-eu-central-1c",
            "Description": "Launching a new EC2 instance.  Status Reason: Could not launch Spot Instances. InsufficientInstanceCapacity - We currently do not have sufficient g4dn.xlarge capacity in the Availability Zone you requested (eu-central-1c). Our system will be working on provisioning additional capacity. You can currently get g4dn.xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1a, eu-central-1b. Launching EC2 instance failed.",
            "Cause": "At 2022-06-15T09:19:30Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 2.",
            "StartTime": "2022-06-15T09:19:32.030Z",
            "EndTime": "2022-06-15T09:19:32Z",
            "StatusCode": "Failed",
            "StatusMessage": "Could not launch Spot Instances. InsufficientInstanceCapacity - We currently do not have sufficient g4dn.xlarge capacity in the Availability Zone you requested (eu-central-1c). Our system will be working on provisioning additional capacity. You can currently get g4dn.xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1a, eu-central-1b. Launching EC2 instance failed.",
            "Progress": 100,
            "Details": "{\"Subnet ID\":\"subnet-7f488433\",\"Availability Zone\":\"eu-central-1c\"}",
            "AutoScalingGroupARN": "arn:aws:autoscaling:eu-central-1:570803259138:autoScalingGroup:69047ed0-0ec6-4261-ac8c-d4a6d366394f:autoScalingGroupName/spot-amd-16GiB-4vCPU-1GPU-eu-central-1c"
        },

gjtempleton · 2022-06-15T09:54:26Z

@ddelange What version of the CA image are you running? Are you overriding the image tag in the helm chart?

ddelange · 2022-06-15T09:55:48Z

not overriding the image tag, so it's currently running k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0

edit: ah thanks, I guess I need to override with 1.24? Any particular reasons it's not the default yet which I need to account for?

gjtempleton · 2022-06-16T14:54:33Z

Which default are you referring to? That this isn't the behaviour in all versions of the CA, or that 1.24.0 isn't the default image tag in the helm chart yet?

As you've seen, that v1.23.0 tag won't include this functionality, it's only included from v1.24.0 onwards as it's new functionality, not a bug fix, so hasn't been cherry-picked back to previous minor release branches.

ddelange · 2022-06-17T05:33:31Z

thanks for the replies, understood!

that 1.24.0 isn't the default image tag in the helm chart yet

this one was a slight surprise but no problem of course

…-aws-node-group-no-capacity Early abort if AWS node group has no capacity

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 30, 2021

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer November 30, 2021 18:43

drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch from c9cfaf8 to ffdcbeb Compare November 30, 2021 18:45

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2021

This was referenced Nov 30, 2021

[WIP] Early abort if AWS has no capacity airbnb/autoscaler#7

Closed

Backport patch for early abort on ASGs to cluster autoscaler 1.18.3 #4492

Closed

Backport early abort ASG patch to 1.18.3 airbnb/autoscaler#10

Merged

k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Dec 1, 2021

jbartosik added the area/cluster-autoscaler label Dec 2, 2021

drmorr0 mentioned this pull request Jan 6, 2022

Cluster Autoscaler should remove oldUnregistered placeholder nodes, even if this would go below minSize #4351

Closed

drmorr0 mentioned this pull request Feb 7, 2022

remove nodes stuck creating even if this takes us below nodeGroup minSize #4357

Closed

k8s-ci-robot assigned gjtempleton Feb 16, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 22, 2022

gjtempleton reviewed Feb 28, 2022

View reviewed changes

David Morrison added 5 commits March 2, 2022 11:34

convert registeredAsgs to a map

a0ae713

early abort if we get a failed scaling activity event from AWS

8ac87b3

adding tests

933065d

update the readmes with appropriate IAM permissions

aa381a8

fix linting errors

aebd984

drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch from d09a4b9 to aebd984 Compare March 2, 2022 19:43

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 2, 2022

David Morrison added 2 commits March 2, 2022 13:02

check all scaling activities instead of just the last one

8d608ac

log failure information

ad93c8b

drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch from 6ecee15 to ad93c8b Compare March 3, 2022 18:42

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2022

k8s-ci-robot merged commit dece0b2 into kubernetes:master Mar 3, 2022

evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Mar 24, 2022

Merge pull request kubernetes#4489 from airbnb/drmorr--early-abort-if…

ff7a96d

…-aws-node-group-no-capacity Early abort if AWS node group has no capacity

drmorr0 mentioned this pull request Jun 1, 2022

REQUEST: New membership for @drmorr0 kubernetes/org#3468

Closed

9 tasks

evansheng mentioned this pull request Jun 8, 2022

REQUEST: New membership for evansheng kubernetes/org#3492

Closed

9 tasks

drmorr0 mentioned this pull request Jun 9, 2022

Cluster autoscalar using ASG even after marking it unhealthy #4900

Closed

SahilKhullar11 mentioned this pull request Jun 16, 2022

priority expander not working as expected for AWS Spot failure #4958

Closed

jiancheung pushed a commit to airbnb/autoscaler that referenced this pull request Jul 29, 2022

Merge pull request kubernetes#4489 from airbnb/drmorr--early-abort-if…

30a0fe5

…-aws-node-group-no-capacity Early abort if AWS node group has no capacity

This was referenced Oct 17, 2022

AWS CN Pricing endpoint #3276

Closed

CA can not scale other IG group if spot unavailable #3042

Closed

akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Oct 27, 2022

Merge pull request kubernetes#4489 from airbnb/drmorr--early-abort-if…

e74e4f9

…-aws-node-group-no-capacity Early abort if AWS node group has no capacity

akirillov mentioned this pull request Oct 27, 2022

cluster autoscaler patch cluster autoscaler 1.21.3 airbnb0 airbnb/autoscaler#29

Merged

akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022

Merge pull request kubernetes#4489 from airbnb/drmorr--early-abort-if…

f31042d

…-aws-node-group-no-capacity Early abort if AWS node group has no capacity

akirillov mentioned this pull request Nov 2, 2022

cluster autoscaler patch cluster autoscaler 1.23.1 airbnb0 airbnb/autoscaler#30

Merged

bcostabatista pushed a commit to airbnb/autoscaler that referenced this pull request Nov 7, 2022

Merge pull request kubernetes#4489 from airbnb/drmorr--early-abort-if…

7db4d8c

…-aws-node-group-no-capacity Early abort if AWS node group has no capacity

bcostabatista mentioned this pull request Nov 7, 2022

cluster autoscaler 1.22.14 release airbnb/autoscaler#32

Merged

jlamillan mentioned this pull request Nov 28, 2022

oci provider: fail fast, recover fast, when instance-pool/node-group is out of capacity #5335

Merged

domenicbozzuto mentioned this pull request Feb 28, 2023

Azure: Fast nodegroup backoff on failed provisioning #5548

Merged

wllbo mentioned this pull request May 12, 2023

add option to keep node group backoff on OutOfResource error #5756

Merged

njuettner mentioned this pull request Nov 10, 2023

Update cluster-autoscaler permissions giantswarm/aws-operator#3650

Merged

1 task

ddelange mentioned this pull request Nov 29, 2023

[cluster-autoscaler] More quickly mark spot ASG in AWS as unavailable if InsufficientInstanceCapacity #3241

Closed

apy-liu mentioned this pull request Jan 31, 2024

[AWS] Unsafe decomissioning of nodes when ASGs are out of instances #5829

Closed

jlamillan mentioned this pull request Feb 5, 2024

Don't scale down ASGs when placeholder nodes are discovered #5846

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early abort if AWS node group has no capacity #4489

Early abort if AWS node group has no capacity #4489

drmorr0 commented Nov 30, 2021

gjtempleton commented Dec 1, 2021

drmorr0 commented Feb 7, 2022 •

edited

Loading

gjtempleton commented Feb 16, 2022

gjtempleton left a comment

gjtempleton Feb 28, 2022 •

edited

Loading

drmorr0 Mar 2, 2022

drmorr0 Mar 2, 2022

blamarvt Mar 3, 2022

drmorr0 Mar 3, 2022

k8s-ci-robot commented Mar 3, 2022

ddelange commented Jun 15, 2022

gjtempleton commented Jun 15, 2022

ddelange commented Jun 15, 2022 •

edited

Loading

gjtempleton commented Jun 16, 2022

ddelange commented Jun 17, 2022

Early abort if AWS node group has no capacity #4489

Early abort if AWS node group has no capacity #4489

Conversation

drmorr0 commented Nov 30, 2021

Summary

Notes for reviewer:

Testing details

gjtempleton commented Dec 1, 2021

drmorr0 commented Feb 7, 2022 • edited Loading

gjtempleton commented Feb 16, 2022

gjtempleton left a comment

Choose a reason for hiding this comment

gjtempleton Feb 28, 2022 • edited Loading

Choose a reason for hiding this comment

drmorr0 Mar 2, 2022

Choose a reason for hiding this comment

drmorr0 Mar 2, 2022

Choose a reason for hiding this comment

blamarvt Mar 3, 2022

Choose a reason for hiding this comment

drmorr0 Mar 3, 2022

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 3, 2022

ddelange commented Jun 15, 2022

gjtempleton commented Jun 15, 2022

ddelange commented Jun 15, 2022 • edited Loading

gjtempleton commented Jun 16, 2022

ddelange commented Jun 17, 2022

drmorr0 commented Feb 7, 2022 •

edited

Loading

gjtempleton Feb 28, 2022 •

edited

Loading

ddelange commented Jun 15, 2022 •

edited

Loading