Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster autoscalar using ASG even after marking it unhealthy #4900

Closed
tarun-asthana opened this issue May 19, 2022 · 5 comments
Closed

Cluster autoscalar using ASG even after marking it unhealthy #4900

tarun-asthana opened this issue May 19, 2022 · 5 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tarun-asthana
Copy link

tarun-asthana commented May 19, 2022

Which component are you using?:
Cluster autoscalar

What version of the component are you using?:

Component version:v1.18.2

What k8s version are you using (kubectl version)?:

$ kubectl version

v1.18.9

What environment is this in?:

AWS/kops

What did you expect to happen?:

Cluster autoscalar should not use the ASG which it has marked unhealthy

What happened instead?:

Cluster autoscalar maintained the desired count of the ASG
Screenshot 2022-05-19 at 6 34 24 PM

How to reproduce it (as minimally and precisely as possible):

The issue happened due to spot interruption hence you cannot reproduce it.

Anything else we need to know?:
Priority Expander:
`priorities:

10:

  • ondemand-for-spot-a.k8s2
  • ondemand-for-spot-b.k8s2
  • ondemand-for-spot-c.k8s2
  • spot-16-128-r5-4x-a.k8s2
  • spot-16-128-r5-4x-b.k8s2
  • spot-16-128-r5-4x-c.k8s2
  • spot-8-16-c5-2x-a.k8s2
  • spot-8-16-c5-2x-b.k8s2
  • spot-8-16-c5-2x-c.k8s2
  • ondemand-16-32-c5-4x-a.k8s2
  • ondemand-16-32-c5-4x-b.k8s2
  • ondemand-16-32-c5-4x-c.k8s2
    15:
  • spot-32-128-m5-8x-c-new.k8s2
  • spot-32-128-m5-8x-b-new.k8s2
  • ondemand-32-128-m5-8x-b-new.k8s2
  • ondemand-32-128-m5-8x-c-new.k8s2
    50:
  • spot-32-128-m5-8x-a-new.k8s2
  • ondemand-32-128-m5-8x-a-new.k8s2`

Logs:

cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart I0519 11:36:33.928882       1 auto_scaling_groups.go:409] Instance group spot-32-128-m5-8x-c-new.k8s2 has only 7 instances created while requested count is 42. Creating placeholder instance with ID i-placeholder-spot-32-128-m5-8x-c-new.k8s2-39.
cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart I0519 11:36:33.928887       1 auto_scaling_groups.go:409] Instance group spot-32-128-m5-8x-c-new.k8s2 has only 7 instances created while requested count is 42. Creating placeholder instance with ID i-placeholder-spot-32-128-m5-8x-c-new.k8s2-40.
cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart I0519 11:36:33.928893       1 auto_scaling_groups.go:409] Instance group spot-32-128-m5-8x-c-new.k8s2 has only 7 instances created while requested count is 42. Creating placeholder instance with ID i-placeholder-spot-32-128-m5-8x-c-new.k8s2-41.
cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart W0519 11:36:39.232584       1 scale_up.go:380] Node group spot-32-128-m5-8x-c-new.k8s2 is not ready for scaleup - unhealthy
cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart W0519 11:36:49.795781       1 scale_up.go:380] Node group spot-32-128-m5-8x-c-new.k8s2 is not ready for scaleup - unhealthy
cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart W0519 11:37:00.593376       1 scale_up.go:380] Node group spot-32-128-m5-8x-c-new.k8s2 is not ready for scaleup - unhealthy
@drmorr0
Copy link
Contributor

drmorr0 commented Jun 9, 2022

There is a feature in more recent Kubernetes that will "early-abort" for spot ASGs that are unavailable. See #4489. I think this will solve (or at least significantly reduce) the issues you're having here?

This isn't available in 1.18 but there is a custom fork of Cluster Autoscaler that includes this feature: https://github.com/airbnb/autoscaler/tree/cluster-autoscaler-release-1.18-airbnb (this is not a supported fork though, so you're on your own if you decide to use it).

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 7, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants