[WIP] Early abort if AWS has no capacity #7

drmorr0 · 2021-11-12T23:03:37Z

Summary

All existing tests pass
Recommend reviewing the two commits separately
New unit tests for the new behaviour
Tested in test-mc-a (see below for details)

Testing details

I created 3 c5ad.16xlarge ASGs in test-mc-a, and tried to scale up a test deployment in the cluster that had a "required during scheduling" node affinity for c5ad.16xlarge. The node group quickly ran out of capacity, and then deleted all of the "upcoming" nodes from the node group, as you can see in the follow series of graphs:

Specifically, note that at multiple points in the time period, it tries to scale up, creates a bunch of un-registered nodes (yellow bars in the bottom graph), and then deletes them shortly thereafter. We can see from the following graphs that the reason for the failed scale-up is placeholder-cannot-be-fulfilled:

We can also look at the logs for this time period:

F I1122 17:02:51.876677       1 auto_scaling_groups.go:432] Instance group kubernetes-minion-test-mc-a-amd-asg-us-east-1a has only 0 instances created while requested count is 14. Creating placeholder instances.
...
F W1122 17:02:51.935464       1 auto_scaling_groups.go:440] Instance group kubernetes-minion-test-mc-a-amd-asg-us-east-1a cannot provision any more nodes!
...
F W1122 17:02:52.094981       1 clusterstate.go:281] Disabling scale-up for node group kubernetes-minion-test-mc-a-amd-asg-us-east-1b until 2021-11-22 17:07:50.399430339 +0000 UTC m=+48821.870874859; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fullfilled
...
F I1122 17:02:52.095011       1 clusterstate.go:1043] Failed adding 14 nodes (14 unseen previously) to group kubernetes-minion-test-mc-a-amd-asg-us-east-1a due to OutOfResource.placeholder-cannot-be-fullfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
...
F I1122 17:02:52.095226       1 auto_scaling_groups.go:287] instance i-placeholder-kubernetes-minion-test-mc-a-amd-asg-us-east-1b-0 is detected as a placeholder, decreasing ASG requested size instead of deleting instance
...
F W1122 17:03:09.808770       1 scale_up.go:383] Node group kubernetes-minion-test-mc-a-amd-asg-us-east-1b is not ready for scaleup - backoff

Here we can see the node group errors out and is added to the backoff list, and the placeholder nodes are deleted, as desired.

evansheng

code mostly makes sense to me. One question though - most of this code is to update the status of the instance, or placeholder instance to unavailable if we're out of capacity.

Where does this status actually get used for the loop to short-circuit? Does this get handled already outside the aws specific code?

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

drmorr0 · 2021-11-15T17:25:34Z

Where does this status actually get used for the loop to short-circuit? Does this get handled already outside the aws specific code?

This is handled by the core logic. RunOnce calls the cloud provider's Refresh function (link), which eventually trickles down to the regenerate function in the AWS cloud provider. Next, RunOnce calls updateClusterState (link), which in turn calls handleInstanceCreationErrors (link), which, eventually calls backoffNodeGroup (link).

Anyways, the proof that this all works will be when I actually test it :)

evansheng · 2021-11-17T21:52:07Z

TODO: add comment about AWS describeScalingActivities permission in AWS cloudprovider readme somewhere

drmorr0 · 2021-11-22T23:22:55Z

TODO: add comment about AWS describeScalingActivities permission in AWS cloudprovider readme somewhere

Done.

drmorr0 · 2021-11-22T23:23:17Z

@evansheng @jtai This is ready for internal review before I open this upstream.

drmorr0 · 2021-11-30T18:38:04Z

Opening upstream.

drmorr0 · 2021-11-30T18:49:46Z

kubernetes#4489

convert registeredAsgs to a map

6b9f556

drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch 3 times, most recently from 4173854 to 4e4a9d5 Compare November 12, 2021 23:10

evansheng reviewed Nov 13, 2021

View reviewed changes

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go Show resolved Hide resolved

drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch from 4e4a9d5 to 2eaf42e Compare November 15, 2021 17:31

David Morrison added 2 commits November 22, 2021 14:28

early abort if we get a failed scaling activity event from AWS

ef7ae08

adding tests

9a30dec

drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch from 401e20c to 9a30dec Compare November 22, 2021 22:57

update the readmes with appropriate IAM permissions

c9cfaf8

evansheng approved these changes Nov 30, 2021

View reviewed changes

drmorr0 closed this Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Early abort if AWS has no capacity #7

[WIP] Early abort if AWS has no capacity #7

drmorr0 commented Nov 12, 2021 •

edited

Loading

evansheng left a comment

drmorr0 commented Nov 15, 2021

evansheng commented Nov 17, 2021

drmorr0 commented Nov 22, 2021

drmorr0 commented Nov 22, 2021

drmorr0 commented Nov 30, 2021

drmorr0 commented Nov 30, 2021

[WIP] Early abort if AWS has no capacity #7

[WIP] Early abort if AWS has no capacity #7

Conversation

drmorr0 commented Nov 12, 2021 • edited Loading

Summary

Testing details

evansheng left a comment

Choose a reason for hiding this comment

drmorr0 commented Nov 15, 2021

evansheng commented Nov 17, 2021

drmorr0 commented Nov 22, 2021

drmorr0 commented Nov 22, 2021

drmorr0 commented Nov 30, 2021

drmorr0 commented Nov 30, 2021

drmorr0 commented Nov 12, 2021 •

edited

Loading