Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Early abort if AWS has no capacity #7

Closed

Conversation

drmorr0
Copy link

@drmorr0 drmorr0 commented Nov 12, 2021

Summary

  • All existing tests pass
  • Recommend reviewing the two commits separately
  • New unit tests for the new behaviour
  • Tested in test-mc-a (see below for details)

Testing details

I created 3 c5ad.16xlarge ASGs in test-mc-a, and tried to scale up a test deployment in the cluster that had a "required during scheduling" node affinity for c5ad.16xlarge. The node group quickly ran out of capacity, and then deleted all of the "upcoming" nodes from the node group, as you can see in the follow series of graphs:

Screen Shot 2021-11-22 at 3 08 31 PM

Specifically, note that at multiple points in the time period, it tries to scale up, creates a bunch of un-registered nodes (yellow bars in the bottom graph), and then deletes them shortly thereafter. We can see from the following graphs that the reason for the failed scale-up is placeholder-cannot-be-fulfilled:

Screen Shot 2021-11-22 at 3 09 51 PM

We can also look at the logs for this time period:

F I1122 17:02:51.876677       1 auto_scaling_groups.go:432] Instance group kubernetes-minion-test-mc-a-amd-asg-us-east-1a has only 0 instances created while requested count is 14. Creating placeholder instances.
...
F W1122 17:02:51.935464       1 auto_scaling_groups.go:440] Instance group kubernetes-minion-test-mc-a-amd-asg-us-east-1a cannot provision any more nodes!
...
F W1122 17:02:52.094981       1 clusterstate.go:281] Disabling scale-up for node group kubernetes-minion-test-mc-a-amd-asg-us-east-1b until 2021-11-22 17:07:50.399430339 +0000 UTC m=+48821.870874859; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fullfilled
...
F I1122 17:02:52.095011       1 clusterstate.go:1043] Failed adding 14 nodes (14 unseen previously) to group kubernetes-minion-test-mc-a-amd-asg-us-east-1a due to OutOfResource.placeholder-cannot-be-fullfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
...
F I1122 17:02:52.095226       1 auto_scaling_groups.go:287] instance i-placeholder-kubernetes-minion-test-mc-a-amd-asg-us-east-1b-0 is detected as a placeholder, decreasing ASG requested size instead of deleting instance
...
F W1122 17:03:09.808770       1 scale_up.go:383] Node group kubernetes-minion-test-mc-a-amd-asg-us-east-1b is not ready for scaleup - backoff

Here we can see the node group errors out and is added to the backoff list, and the placeholder nodes are deleted, as desired.

@drmorr0 drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch 3 times, most recently from 4173854 to 4e4a9d5 Compare November 12, 2021 23:10
Copy link
Collaborator

@evansheng evansheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code mostly makes sense to me. One question though - most of this code is to update the status of the instance, or placeholder instance to unavailable if we're out of capacity.

Where does this status actually get used for the loop to short-circuit? Does this get handled already outside the aws specific code?

@drmorr0
Copy link
Author

drmorr0 commented Nov 15, 2021

Where does this status actually get used for the loop to short-circuit? Does this get handled already outside the aws specific code?

This is handled by the core logic. RunOnce calls the cloud provider's Refresh function (link), which eventually trickles down to the regenerate function in the AWS cloud provider. Next, RunOnce calls updateClusterState (link), which in turn calls handleInstanceCreationErrors (link), which, eventually calls backoffNodeGroup (link).

Anyways, the proof that this all works will be when I actually test it :)

@drmorr0 drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch from 4e4a9d5 to 2eaf42e Compare November 15, 2021 17:31
@evansheng
Copy link
Collaborator

TODO: add comment about AWS describeScalingActivities permission in AWS cloudprovider readme somewhere

@drmorr0 drmorr0 force-pushed the drmorr--early-abort-if-aws-node-group-no-capacity branch from 401e20c to 9a30dec Compare November 22, 2021 22:57
@drmorr0
Copy link
Author

drmorr0 commented Nov 22, 2021

TODO: add comment about AWS describeScalingActivities permission in AWS cloudprovider readme somewhere

Done.

@drmorr0
Copy link
Author

drmorr0 commented Nov 22, 2021

@evansheng @jtai This is ready for internal review before I open this upstream.

@drmorr0
Copy link
Author

drmorr0 commented Nov 30, 2021

Opening upstream.

@drmorr0 drmorr0 closed this Nov 30, 2021
@drmorr0
Copy link
Author

drmorr0 commented Nov 30, 2021

kubernetes#4489

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants