Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove nodes stuck creating even if this takes us below nodeGroup minSize #4357

Conversation

drmorr0
Copy link
Contributor

@drmorr0 drmorr0 commented Sep 28, 2021

This PR is an attempt to resolve #4351. The AWS cloudprovider doesn't currently set the InstanceStatus field on cloudprovider.Instance, so in this change we start populating this field. We set it to InstanceRunning except when the node is a placeholder node, in which case we set it to InstanceCreating. Then, in the removeOldUnregisteredNodes function, if removing the node would take us below the node group's minSize, but the status is "creating", we go ahead and remove the node anyways. The logic here is that if the node is stuck in creating for 15 minutes (or whatever the max node provision time is), we're already below the node group's minimum size, but CA just doesn't think it is.

I've added a case to the TestStaticAutoscalerRunOnceWithLongUnregisteredNodes test case to confirm that this works as expected.

I think this is a relatively simple/safe change. My only slight concern is that I'm not sure how other cloud providers that use the InstanceStatus field will respond to this change. It looks like the other cloud providers that use the InstanceCreating value are:

  • Azure
  • Bizfly
  • DigitalOcean
  • Exoscale
  • GCE
  • Hetzner
  • Huawei
  • Ionos
  • Magnum
  • OVH

I don't know a good way to test the behaviour of these other cloud providers. :\

The other thing I was unsure about was what to do in the ClusterStateRegistry when an instance changes state. My intuition was that if a node transitions from (say) "creating" to "running", that it would make sense to reset the node's "unregistered-since" timer, but since we currently don't track the state transitions at all, this would be a change from the existing behaviour that might be unexpected.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 28, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: drmorr0
To complete the pull request process, please assign aleksandra-malinowska after the PR has been reviewed.
You can assign the PR to them by writing /assign @aleksandra-malinowska in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@drmorr0 drmorr0 force-pushed the drmorr--airbnb--add-instance-state-aws-placeholders branch from 6b3483e to 6f00826 Compare October 4, 2021 16:16
@k8s-ci-robot
Copy link
Contributor

@drmorr0: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 27, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2022
@drmorr0
Copy link
Contributor Author

drmorr0 commented Feb 7, 2022

This is superseded by #4489

@drmorr0 drmorr0 closed this Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster Autoscaler should remove oldUnregistered placeholder nodes, even if this would go below minSize
4 participants