-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't scale down ASGs when placeholder nodes are discovered #5846
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: theintz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @theintz! |
|
/assign @gjtempleton |
We've deployed this patch on our clusters a week ago (patched into v1.26.3) and it appears to have stopped the problems outlined in #5829. Looking forward to your feedback. |
I'm not familiar with AWS's ASG, but I think one use case should be consider(correct me if I'm wrong): If EC2 is out of stock and ASG can not create new instances, CA will delete these nodes after MaxNodeProvisionTime(default |
@jaypipes @gjtempleton any feedback? |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@@ -308,11 +304,9 @@ func (m *asgCache) DeleteInstances(instances []*AwsInstanceRef) error { | |||
|
|||
for _, instance := range instances { | |||
// check if the instance is a placeholder - a requested instance that was never created by the node group | |||
// if it is, just decrease the size of the node group, as there's no specific instance we can remove | |||
// if it is, simply remove it from the cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By removing the node(s) in the local cache, but not actually scaling down the corresponding ASG, the ongoing/pending request for additional capacity on the provider may impact what capacity is accessible subsequent requests. Basically the ongoing request is at best a capacity spoof and capacity dead-lock at worst.
Put another way, if you request a 100 instances in ASG_1, and get 25 before running into a InsufficientInstanceCapacity
, it's probably not the right answer to leave that request on-going. It's not unreasonable to think that the ongoing request for an additional 75 nodes in ASG_1 could affect subsequent requests to provision instances of that type in the same or other ASGs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for taking a look at this. I'm not very familiar with what the placeholder instances are intended to do. There is probably a good reason why they are being created, but when I wrote this patch, they were only ever checked in this one location. Maybe this has changed since then?
If you take a look at the linked bug report, I've mapped how under certain circumstances the flow of the code leads to unsafe decommisioning of instances. I see your point as well, but from our experience, once a request runs into the insufficient capacity issue, it's considered completed (or better failed) and no longer ongoing. So by sending requests to scale the ASG back down, we're not canceling the failed request, but rather removing already existing capacity.
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@Shubham82: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #5829
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: