-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Early abort if AWS has no capacity #7
Conversation
4173854
to
4e4a9d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code mostly makes sense to me. One question though - most of this code is to update the status of the instance, or placeholder instance to unavailable if we're out of capacity.
Where does this status actually get used for the loop to short-circuit? Does this get handled already outside the aws specific code?
This is handled by the core logic. Anyways, the proof that this all works will be when I actually test it :) |
4e4a9d5
to
2eaf42e
Compare
TODO: add comment about AWS describeScalingActivities permission in AWS cloudprovider readme somewhere |
401e20c
to
9a30dec
Compare
Done. |
@evansheng @jtai This is ready for internal review before I open this upstream. |
Opening upstream. |
Summary
Testing details
I created 3
c5ad.16xlarge
ASGs intest-mc-a
, and tried to scale up a test deployment in the cluster that had a "required during scheduling" node affinity forc5ad.16xlarge
. The node group quickly ran out of capacity, and then deleted all of the "upcoming" nodes from the node group, as you can see in the follow series of graphs:Specifically, note that at multiple points in the time period, it tries to scale up, creates a bunch of un-registered nodes (yellow bars in the bottom graph), and then deletes them shortly thereafter. We can see from the following graphs that the reason for the failed scale-up is
placeholder-cannot-be-fulfilled
:We can also look at the logs for this time period:
Here we can see the node group errors out and is added to the backoff list, and the placeholder nodes are deleted, as desired.