Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early abort if AWS node group has no capacity #4489

Conversation

drmorr0
Copy link
Contributor

@drmorr0 drmorr0 commented Nov 30, 2021

Summary

This change makes it so that Cluster Autoscaler will early-abort if AWS does not have any capacity available (e.g., it is a spot node group, and there are no spot instances available). It works by polling the DescribeScalingActivities AWS EC2 endpoint, and setting the node group status to an error if the DescribeScalingActivities call returns "Failed". This short-circuits the 15-minute waiting period for nodes to join the cluster before the node group is added to the backoff list, which allows Cluster Autoscaler to more rapidly try a different node group that is more likely to succeed.

Notes for reviewer:

  • Recommend reviewing the commits separately
  • All existing tests pass
  • New unit tests for the new behaviour
  • Tested in a test cluster at Airbnb (see below for details)

Testing details

I created 3 c5ad.16xlarge ASGs in test-mc-a, and tried to scale up a test deployment in the cluster that had a "required during scheduling" node affinity for c5ad.16xlarge. The node group quickly ran out of capacity, and then deleted all of the "upcoming" nodes from the node group, as you can see in the follow series of graphs:

Screen Shot 2021-11-22 at 3 08 31 PM

Specifically, note that at multiple points in the time period, it tries to scale up, creates a bunch of un-registered nodes (yellow bars in the bottom graph), and then deletes them shortly thereafter. We can see from the following graphs that the reason for the failed scale-up is placeholder-cannot-be-fulfilled:

Screen Shot 2021-11-22 at 3 09 51 PM

We can also look at the logs for this time period:

F I1122 17:02:51.876677       1 auto_scaling_groups.go:432] Instance group kubernetes-minion-test-mc-a-amd-asg-us-east-1a has only 0 instances created while requested count is 14. Creating placeholder instances.
...
F W1122 17:02:51.935464       1 auto_scaling_groups.go:440] Instance group kubernetes-minion-test-mc-a-amd-asg-us-east-1a cannot provision any more nodes!
...
F W1122 17:02:52.094981       1 clusterstate.go:281] Disabling scale-up for node group kubernetes-minion-test-mc-a-amd-asg-us-east-1b until 2021-11-22 17:07:50.399430339 +0000 UTC m=+48821.870874859; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fullfilled
...
F I1122 17:02:52.095011       1 clusterstate.go:1043] Failed adding 14 nodes (14 unseen previously) to group kubernetes-minion-test-mc-a-amd-asg-us-east-1a due to OutOfResource.placeholder-cannot-be-fullfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
...
F I1122 17:02:52.095226       1 auto_scaling_groups.go:287] instance i-placeholder-kubernetes-minion-test-mc-a-amd-asg-us-east-1b-0 is detected as a placeholder, decreasing ASG requested size instead of deleting instance
...
F W1122 17:03:09.808770       1 scale_up.go:383] Node group kubernetes-minion-test-mc-a-amd-asg-us-east-1b is not ready for scaleup - backoff

Here we can see the node group errors out and is added to the backoff list, and the placeholder nodes are deleted, as desired.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 30, 2021
@drmorr0 drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch from c9cfaf8 to ffdcbeb Compare November 30, 2021 18:45
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2021
@gjtempleton
Copy link
Member

/area provider/aws

@drmorr0
Copy link
Contributor Author

drmorr0 commented Feb 7, 2022

@feiskyer @aleksandra-malinowska hi, any chance we could get some eyes on this?

@gjtempleton
Copy link
Member

/assign @gjtempleton

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 22, 2022
Copy link
Member

@gjtempleton gjtempleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question for you, but other than that this looks good to merge to me.

}

if len(response.Activities) > 0 {
activity := response.Activities[0]
Copy link
Member

@gjtempleton gjtempleton Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only want to check the latest activity? Doesn't this potentially risk missing activities that have happened since the last time the ASG cache was updated if (for instance) an instance has been successfully taken out of service due to an instance rebalance recommendation in the same ASG that is unable to fulfill new spot requests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh good point. Let me see what happens here/if there's an easy way to resolve it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just turned this into a loop that iterates over all the scaling activities since the ASG was updated, and early-aborts if any of them failed.

I'm a little bit worried about this approach, for example if something that wasn't a scale-up attempt failed -- but the only way I can think of to work around this would be text-matching against the StatusMessage response from DescribeScalingActivities, which feels like it could be super-brittle if, for example, AWS decides to change that text slightly.

@gjtempleton WDYT?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just an eager potential user of this feature but from my perspective it doesn't really matter why an ASG had a failure -- the ASG is still unhealthy. As long as you're checking, as you said, since the ASG cache was updated I think that it's going to be a win. If the error was InsufficientInstanceCapacity then it's the target situation but if it's something like InstanceLimitExceeded or a bad image / launch template misconfiguration then honestly the ASG is unhealthy and it should be marked as such. Ideally the reason would be logged but that's just a bonus for me. The main problem is AWS doesn't have a list of potential Failure scenarios but I think a failure is just a failure?

Thank you for making this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I agree with you. The only case where I would be worried is if there was a Failure on scaling down, but it could still somehow scale up? I don't even know if that's a thing that can happen.

Anyways, logging the failed scaling activity is a great suggestion, I added that.

@drmorr0 drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch from d09a4b9 to aebd984 Compare March 2, 2022 19:43
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 2, 2022
@drmorr0 drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch from 6ecee15 to ad93c8b Compare March 3, 2022 18:42
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: drmorr0, gjtempleton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2022
@k8s-ci-robot k8s-ci-robot merged commit dece0b2 into kubernetes:master Mar 3, 2022
evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Mar 24, 2022
…-aws-node-group-no-capacity

Early abort if AWS node group has no capacity
@ddelange
Copy link

hi @drmorr0 👋

we observe this activity more often (GPUs in eu-central-1c are scarce), but the cluster-autoscaler (using 9.19.0 helm chart, running k8s 1.23.7) still patiently waits where I expected the early quit from this PR. Is it related to #4900?

$ aws autoscaling describe-scaling-activities
        {
            "ActivityId": "5856059a-a4b2-f9f8-cdc4-df643e82e519",
            "AutoScalingGroupName": "spot-amd-16GiB-4vCPU-1GPU-eu-central-1c",
            "Description": "Launching a new EC2 instance.  Status Reason: Could not launch Spot Instances. InsufficientInstanceCapacity - We currently do not have sufficient g4dn.xlarge capacity in the Availability Zone you requested (eu-central-1c). Our system will be working on provisioning additional capacity. You can currently get g4dn.xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1a, eu-central-1b. Launching EC2 instance failed.",
            "Cause": "At 2022-06-15T09:19:30Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 2.",
            "StartTime": "2022-06-15T09:19:32.030Z",
            "EndTime": "2022-06-15T09:19:32Z",
            "StatusCode": "Failed",
            "StatusMessage": "Could not launch Spot Instances. InsufficientInstanceCapacity - We currently do not have sufficient g4dn.xlarge capacity in the Availability Zone you requested (eu-central-1c). Our system will be working on provisioning additional capacity. You can currently get g4dn.xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1a, eu-central-1b. Launching EC2 instance failed.",
            "Progress": 100,
            "Details": "{\"Subnet ID\":\"subnet-7f488433\",\"Availability Zone\":\"eu-central-1c\"}",
            "AutoScalingGroupARN": "arn:aws:autoscaling:eu-central-1:570803259138:autoScalingGroup:69047ed0-0ec6-4261-ac8c-d4a6d366394f:autoScalingGroupName/spot-amd-16GiB-4vCPU-1GPU-eu-central-1c"
        },

@gjtempleton
Copy link
Member

@ddelange What version of the CA image are you running? Are you overriding the image tag in the helm chart?

@ddelange
Copy link

ddelange commented Jun 15, 2022

not overriding the image tag, so it's currently running k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0

edit: ah thanks, I guess I need to override with 1.24? Any particular reasons it's not the default yet which I need to account for?

@gjtempleton
Copy link
Member

Which default are you referring to? That this isn't the behaviour in all versions of the CA, or that 1.24.0 isn't the default image tag in the helm chart yet?

As you've seen, that v1.23.0 tag won't include this functionality, it's only included from v1.24.0 onwards as it's new functionality, not a bug fix, so hasn't been cherry-picked back to previous minor release branches.

@ddelange
Copy link

thanks for the replies, understood!

that 1.24.0 isn't the default image tag in the helm chart yet

this one was a slight surprise but no problem of course

jiancheung pushed a commit to airbnb/autoscaler that referenced this pull request Jul 29, 2022
…-aws-node-group-no-capacity

Early abort if AWS node group has no capacity
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Oct 27, 2022
…-aws-node-group-no-capacity

Early abort if AWS node group has no capacity
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022
…-aws-node-group-no-capacity

Early abort if AWS node group has no capacity
bcostabatista pushed a commit to airbnb/autoscaler that referenced this pull request Nov 7, 2022
…-aws-node-group-no-capacity

Early abort if AWS node group has no capacity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants