-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early abort if AWS node group has no capacity #4489
Early abort if AWS node group has no capacity #4489
Conversation
c9cfaf8
to
ffdcbeb
Compare
/area provider/aws |
@feiskyer @aleksandra-malinowska hi, any chance we could get some eyes on this? |
/assign @gjtempleton |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question for you, but other than that this looks good to merge to me.
} | ||
|
||
if len(response.Activities) > 0 { | ||
activity := response.Activities[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only want to check the latest activity? Doesn't this potentially risk missing activities that have happened since the last time the ASG cache was updated if (for instance) an instance has been successfully taken out of service due to an instance rebalance recommendation in the same ASG that is unable to fulfill new spot requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh good point. Let me see what happens here/if there's an easy way to resolve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just turned this into a loop that iterates over all the scaling activities since the ASG was updated, and early-aborts if any of them failed.
I'm a little bit worried about this approach, for example if something that wasn't a scale-up attempt failed -- but the only way I can think of to work around this would be text-matching against the StatusMessage
response from DescribeScalingActivities, which feels like it could be super-brittle if, for example, AWS decides to change that text slightly.
@gjtempleton WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just an eager potential user of this feature but from my perspective it doesn't really matter why an ASG had a failure -- the ASG is still unhealthy. As long as you're checking, as you said, since the ASG cache was updated I think that it's going to be a win. If the error was InsufficientInstanceCapacity
then it's the target situation but if it's something like InstanceLimitExceeded
or a bad image / launch template misconfiguration then honestly the ASG is unhealthy and it should be marked as such. Ideally the reason would be logged but that's just a bonus for me. The main problem is AWS doesn't have a list of potential Failure
scenarios but I think a failure is just a failure?
Thank you for making this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think I agree with you. The only case where I would be worried is if there was a Failure on scaling down, but it could still somehow scale up? I don't even know if that's a thing that can happen.
Anyways, logging the failed scaling activity is a great suggestion, I added that.
d09a4b9
to
aebd984
Compare
6ecee15
to
ad93c8b
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: drmorr0, gjtempleton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…-aws-node-group-no-capacity Early abort if AWS node group has no capacity
hi @drmorr0 👋 we observe this activity more often (GPUs in $ aws autoscaling describe-scaling-activities
{
"ActivityId": "5856059a-a4b2-f9f8-cdc4-df643e82e519",
"AutoScalingGroupName": "spot-amd-16GiB-4vCPU-1GPU-eu-central-1c",
"Description": "Launching a new EC2 instance. Status Reason: Could not launch Spot Instances. InsufficientInstanceCapacity - We currently do not have sufficient g4dn.xlarge capacity in the Availability Zone you requested (eu-central-1c). Our system will be working on provisioning additional capacity. You can currently get g4dn.xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1a, eu-central-1b. Launching EC2 instance failed.",
"Cause": "At 2022-06-15T09:19:30Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 2.",
"StartTime": "2022-06-15T09:19:32.030Z",
"EndTime": "2022-06-15T09:19:32Z",
"StatusCode": "Failed",
"StatusMessage": "Could not launch Spot Instances. InsufficientInstanceCapacity - We currently do not have sufficient g4dn.xlarge capacity in the Availability Zone you requested (eu-central-1c). Our system will be working on provisioning additional capacity. You can currently get g4dn.xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1a, eu-central-1b. Launching EC2 instance failed.",
"Progress": 100,
"Details": "{\"Subnet ID\":\"subnet-7f488433\",\"Availability Zone\":\"eu-central-1c\"}",
"AutoScalingGroupARN": "arn:aws:autoscaling:eu-central-1:570803259138:autoScalingGroup:69047ed0-0ec6-4261-ac8c-d4a6d366394f:autoScalingGroupName/spot-amd-16GiB-4vCPU-1GPU-eu-central-1c"
}, |
@ddelange What version of the CA image are you running? Are you overriding the image tag in the helm chart? |
not overriding the image tag, so it's currently running edit: ah thanks, I guess I need to override with 1.24? Any particular reasons it's not the default yet which I need to account for? |
Which As you've seen, that |
thanks for the replies, understood!
this one was a slight surprise but no problem of course |
…-aws-node-group-no-capacity Early abort if AWS node group has no capacity
…-aws-node-group-no-capacity Early abort if AWS node group has no capacity
…-aws-node-group-no-capacity Early abort if AWS node group has no capacity
…-aws-node-group-no-capacity Early abort if AWS node group has no capacity
Summary
This change makes it so that Cluster Autoscaler will early-abort if AWS does not have any capacity available (e.g., it is a spot node group, and there are no spot instances available). It works by polling the
DescribeScalingActivities
AWS EC2 endpoint, and setting the node group status to an error if theDescribeScalingActivities
call returns "Failed". This short-circuits the 15-minute waiting period for nodes to join the cluster before the node group is added to the backoff list, which allows Cluster Autoscaler to more rapidly try a different node group that is more likely to succeed.Notes for reviewer:
Testing details
I created 3
c5ad.16xlarge
ASGs intest-mc-a
, and tried to scale up a test deployment in the cluster that had a "required during scheduling" node affinity forc5ad.16xlarge
. The node group quickly ran out of capacity, and then deleted all of the "upcoming" nodes from the node group, as you can see in the follow series of graphs:Specifically, note that at multiple points in the time period, it tries to scale up, creates a bunch of un-registered nodes (yellow bars in the bottom graph), and then deletes them shortly thereafter. We can see from the following graphs that the reason for the failed scale-up is
placeholder-cannot-be-fulfilled
:We can also look at the logs for this time period:
Here we can see the node group errors out and is added to the backoff list, and the placeholder nodes are deleted, as desired.