Cluster autoscalar using ASG even after marking it unhealthy #4900

tarun-asthana · 2022-05-19T13:09:56Z

Which component are you using?:
Cluster autoscalar

What version of the component are you using?:

Component version:v1.18.2

What k8s version are you using (kubectl version)?:

$ kubectl version

v1.18.9

What environment is this in?:

AWS/kops

What did you expect to happen?:

Cluster autoscalar should not use the ASG which it has marked unhealthy

What happened instead?:

Cluster autoscalar maintained the desired count of the ASG

How to reproduce it (as minimally and precisely as possible):

The issue happened due to spot interruption hence you cannot reproduce it.

Anything else we need to know?:
Priority Expander:
`priorities:

10:

ondemand-for-spot-a.k8s2
ondemand-for-spot-b.k8s2
ondemand-for-spot-c.k8s2
spot-16-128-r5-4x-a.k8s2
spot-16-128-r5-4x-b.k8s2
spot-16-128-r5-4x-c.k8s2
spot-8-16-c5-2x-a.k8s2
spot-8-16-c5-2x-b.k8s2
spot-8-16-c5-2x-c.k8s2
ondemand-16-32-c5-4x-a.k8s2
ondemand-16-32-c5-4x-b.k8s2
ondemand-16-32-c5-4x-c.k8s2
15:
spot-32-128-m5-8x-c-new.k8s2
spot-32-128-m5-8x-b-new.k8s2
ondemand-32-128-m5-8x-b-new.k8s2
ondemand-32-128-m5-8x-c-new.k8s2
50:
spot-32-128-m5-8x-a-new.k8s2
ondemand-32-128-m5-8x-a-new.k8s2`

Logs:

cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart I0519 11:36:33.928882       1 auto_scaling_groups.go:409] Instance group spot-32-128-m5-8x-c-new.k8s2 has only 7 instances created while requested count is 42. Creating placeholder instance with ID i-placeholder-spot-32-128-m5-8x-c-new.k8s2-39.
cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart I0519 11:36:33.928887       1 auto_scaling_groups.go:409] Instance group spot-32-128-m5-8x-c-new.k8s2 has only 7 instances created while requested count is 42. Creating placeholder instance with ID i-placeholder-spot-32-128-m5-8x-c-new.k8s2-40.
cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart I0519 11:36:33.928893       1 auto_scaling_groups.go:409] Instance group spot-32-128-m5-8x-c-new.k8s2 has only 7 instances created while requested count is 42. Creating placeholder instance with ID i-placeholder-spot-32-128-m5-8x-c-new.k8s2-41.
cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart W0519 11:36:39.232584       1 scale_up.go:380] Node group spot-32-128-m5-8x-c-new.k8s2 is not ready for scaleup - unhealthy
cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart W0519 11:36:49.795781       1 scale_up.go:380] Node group spot-32-128-m5-8x-c-new.k8s2 is not ready for scaleup - unhealthy
cluster-autoscaler-aws-cluster-autoscaler-chart-665597d969fwbkd aws-cluster-autoscaler-chart W0519 11:37:00.593376       1 scale_up.go:380] Node group spot-32-128-m5-8x-c-new.k8s2 is not ready for scaleup - unhealthy

The text was updated successfully, but these errors were encountered:

drmorr0 · 2022-06-09T21:16:30Z

There is a feature in more recent Kubernetes that will "early-abort" for spot ASGs that are unavailable. See #4489. I think this will solve (or at least significantly reduce) the issues you're having here?

This isn't available in 1.18 but there is a custom fork of Cluster Autoscaler that includes this feature: https://github.com/airbnb/autoscaler/tree/cluster-autoscaler-release-1.18-airbnb (this is not a supported fork though, so you're on your own if you decide to use it).

k8s-triage-robot · 2022-09-07T22:09:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-10-07T22:26:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-11-06T22:59:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2022-11-06T22:59:33Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tarun-asthana added the kind/bug Categorizes issue or PR as related to a bug. label May 19, 2022

tarun-asthana mentioned this issue May 19, 2022

Max Node Provision Time + Priority Expander + Node Unavailability #3490

Closed

jbartosik added the area/cluster-autoscaler label May 20, 2022

ddelange mentioned this issue Jun 15, 2022

Early abort if AWS node group has no capacity #4489

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 7, 2022

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2022

himanshu-kun mentioned this issue Apr 6, 2023

Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments gardener/autoscaler#154

Closed

wllbo mentioned this issue May 12, 2023

add option to keep node group backoff on OutOfResource error #5756

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster autoscalar using ASG even after marking it unhealthy #4900

Cluster autoscalar using ASG even after marking it unhealthy #4900

tarun-asthana commented May 19, 2022 •

edited

Loading

drmorr0 commented Jun 9, 2022 •

edited

Loading

k8s-triage-robot commented Sep 7, 2022

k8s-triage-robot commented Oct 7, 2022

k8s-triage-robot commented Nov 6, 2022

k8s-ci-robot commented Nov 6, 2022

Cluster autoscalar using ASG even after marking it unhealthy #4900

Cluster autoscalar using ASG even after marking it unhealthy #4900

Comments

tarun-asthana commented May 19, 2022 • edited Loading

Anything else we need to know?: Priority Expander: `priorities:

drmorr0 commented Jun 9, 2022 • edited Loading

k8s-triage-robot commented Sep 7, 2022

k8s-triage-robot commented Oct 7, 2022

k8s-triage-robot commented Nov 6, 2022

k8s-ci-robot commented Nov 6, 2022

tarun-asthana commented May 19, 2022 •

edited

Loading

Anything else we need to know?:
Priority Expander:
`priorities:

drmorr0 commented Jun 9, 2022 •

edited

Loading