remove nodes stuck creating even if this takes us below nodeGroup minSize #4357

drmorr0 · 2021-09-28T05:14:45Z

This PR is an attempt to resolve #4351. The AWS cloudprovider doesn't currently set the InstanceStatus field on cloudprovider.Instance, so in this change we start populating this field. We set it to InstanceRunning except when the node is a placeholder node, in which case we set it to InstanceCreating. Then, in the removeOldUnregisteredNodes function, if removing the node would take us below the node group's minSize, but the status is "creating", we go ahead and remove the node anyways. The logic here is that if the node is stuck in creating for 15 minutes (or whatever the max node provision time is), we're already below the node group's minimum size, but CA just doesn't think it is.

I've added a case to the TestStaticAutoscalerRunOnceWithLongUnregisteredNodes test case to confirm that this works as expected.

I think this is a relatively simple/safe change. My only slight concern is that I'm not sure how other cloud providers that use the InstanceStatus field will respond to this change. It looks like the other cloud providers that use the InstanceCreating value are:

Azure
Bizfly
DigitalOcean
Exoscale
GCE
Hetzner
Huawei
Ionos
Magnum
OVH

I don't know a good way to test the behaviour of these other cloud providers. :\

The other thing I was unsure about was what to do in the ClusterStateRegistry when an instance changes state. My intuition was that if a node transitions from (say) "creating" to "running", that it would make sense to reset the node's "unregistered-since" timer, but since we currently don't track the state transitions at all, this would be a change from the existing behaviour that might be unexpected.

k8s-ci-robot · 2021-09-28T05:15:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: drmorr0
To complete the pull request process, please assign aleksandra-malinowska after the PR has been reviewed.
You can assign the PR to them by writing /assign @aleksandra-malinowska in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…Size

k8s-ci-robot · 2021-10-27T22:42:19Z

@drmorr0: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2022-01-25T23:29:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

drmorr0 · 2022-02-07T21:18:57Z

This is superseded by #4489

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 28, 2021

k8s-ci-robot requested review from feiskyer and Jeffwan September 28, 2021 05:15

jbartosik added the area/cluster-autoscaler label Sep 28, 2021

remove nodes stuck creating even if this takes us below nodeGroup min…

6f00826

…Size

drmorr0 force-pushed the drmorr--airbnb--add-instance-state-aws-placeholders branch from 6b3483e to 6f00826 Compare October 4, 2021 16:16

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 27, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2022

drmorr0 closed this Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove nodes stuck creating even if this takes us below nodeGroup minSize #4357

remove nodes stuck creating even if this takes us below nodeGroup minSize #4357

drmorr0 commented Sep 28, 2021 •

edited

Loading

k8s-ci-robot commented Sep 28, 2021

k8s-ci-robot commented Oct 27, 2021

k8s-triage-robot commented Jan 25, 2022

drmorr0 commented Feb 7, 2022

remove nodes stuck creating even if this takes us below nodeGroup minSize #4357

remove nodes stuck creating even if this takes us below nodeGroup minSize #4357

Conversation

drmorr0 commented Sep 28, 2021 • edited Loading

k8s-ci-robot commented Sep 28, 2021

k8s-ci-robot commented Oct 27, 2021

k8s-triage-robot commented Jan 25, 2022

drmorr0 commented Feb 7, 2022

drmorr0 commented Sep 28, 2021 •

edited

Loading