[AWS] Unsafe decomissioning of nodes when ASGs are out of instances #5829

theintz · 2023-06-02T11:39:32Z

Which component are you using?: cluster-autoscaler

What version of the component are you using?: cluster-autoscaler 1.25.1

Component version: chart cluster-autoscaler-9.28.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.9-eks-0a21954", GitCommit:"eb82cd845d007ae98d215744675dcf7ff024a5a3", GitTreeState:"clean", BuildDate:"2023-04-15T00:37:59Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: AWS EKS

What did you expect to happen?: We are running EKS on managed nodes using cluster-autoscaler to dynamically scale up and down the sizes of the node groups. Unfortunately, sometimes AWS is unable to fulfil requests for scaling up (Could not launch On-Demand Instances. InsufficientInstanceCapacity). When that happens, cluster-autoscaler gets confused about the size of the ASG and sends unnecessary SetDesiredCapacity requests, which result in unsafe scaling down of the ASG (details below).
The expectation is that CA handles these situations transparently and does not scale down the ASG.

What happened instead?: We are observing the following behavior, by looking at both the CA logs and the CloudTrail logs of the requests sent. I don't know enough about the internal workings of the CA, so some of this is an assumption.

Pods are pending, waiting for capacity in a given node group.
CA wants to scale up this node group and sends a SetDesiredCapacity request to the ASG to increase the size (say from 3 -> 5). This happens async, so the CA fires off the request and goes back to doing different things. However it updates its internal representation of the ASG to be at capacity 5.
ASG can’t scale up and returns that info to the CA.
CA comes back around to the request and realizes the ASG is blocked. It updates its internal representation (back to 3) but crucially for some reason it also updates the ASG, sending a number of SetDesiredCapacity requests, to revert the capacity back down to 3.
In the meantime, a new instance might have successfully started, so the actual capacity of the ASG is now 4.
So as a result to receiving the SetDesiredCapacity requests from the CA, the ASG terminates one (or more) of its running instances resulting in the unsafe scale down.

See the attached image that shows the sequence of events as well.

How to reproduce it (as minimally and precisely as possible): Difficult to reproduce as it only happens when the AWS capacity issues occur. We are regularly observing this behavior though, I am happy to assist with more information.

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

theintz · 2023-06-08T13:36:10Z

I investigated this some more. Here is what I believe is happening:

CA recognizes that there are unschedulable pods and initiates a scale up of the relevant ASG. This sends a SetDesiredSize request to the ASG which then attempts to start new instances according to the request. The method also updates the internal representation of the ASG (asgCache) to reflect the desired size (

autoscaler/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

Line 242 in da40911

func (m *asgCache) setAsgSizeNoLock(asg *asg, size int) error {

).
During the next iteration of the main loop, the asgCache is regenerated, during which placeholder instances are added to make up for the difference in desired size and actual instances in the group (

autoscaler/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

Line 430 in da40911

func (m *asgCache) createPlaceholdersForDesiredNonStartedInstances(groups []*autoscaling.Group) []*autoscaling.Group {

). In this method the CA also discovers that the ASG was unable to scale up and assigns an error code to the health status of the created placeholder instances.
Further down in the same main loop, the CA tries to clean up dangling nodes that have been deleted already. In order to do this it looks at all the nodes it knows and checks their health status (

autoscaler/cluster-autoscaler/clusterstate/clusterstate.go

Line 1090 in da40911

func (csr *ClusterStateRegistry) GetCreatedNodesWithErrors() []*apiv1.Node {

). This lookup returns exactly those placeholder instances created just before.
The CA proceeds to delete the nodes, calling the DeleteInstances method and passing the placeholder instances as argument. (

autoscaler/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

Line 268 in da40911

func (m *asgCache) DeleteInstances(instances []*AwsInstanceRef) error {

).
That method looks at the instances to be deleted and recognizes that they are just placeholders. Yet it calls a method that sends a SetDesiredSize request with asgSize-1 to the ASG, thereby telling the ASG to scale down a random instance, resulting in the observed behavior.

I can imagine a number of fixes for this, however I don't know the code well enough to comment on side effects.

To my knowledge the placeholder instances are only ever checked in the DeleteInstances method of the asgCache, maybe that functionality could be removed altogether?
A straightforward fix is to simply remove the placeholder instance from the asgCache when it is encountered without sending the request to scale down.

Will create a PR to implement 2).

steelecliftonberry · 2023-06-14T15:44:47Z

Would be great to get some insight/review here. It frequently (roughly daily) kills long-running batch jobs related to our model training and optimisation, due to causing accidental node scale down some times when AWS is out of instances (which is a constant thing).

This happens regardless of the fact we set the do not evict annotation on these Pods, because as above the AWS out of instances issue is not properly factored in.

gicuuu3 · 2023-07-19T14:25:45Z

Hello, hello!

We're also being impacted by this issue and would be nice to get some updates on when it can be expected. If not, are there any known workarounds to avoid the problems? (we've tried using stuff like the scaled down after scale up delay but it didn't seem to work)

Our setup has stateful pods so having the scaling group kill active nodes breaks the sessions.

Thank you for the time and effort!

theintz · 2023-07-26T08:26:16Z

Hi @gicuuu3 there is a PR open with a fix that's working for us, however I'm not sure whether it's going to get merged since it might have implications for other use cases.
One more thing we did that has helped is to configure the CA such that it can pick from a number of EC2 instance classes rather than just one: instance_types = ["m6a.xlarge", "m5a.xlarge", "m6i.xlarge", "m5.xlarge", "m4.xlarge"] This allows the underlying ASG to pick from any of these classes (with descending priority) so that even when m6a and m5a are broken, it can still spin up a new instance.
With these two measures in place, we haven't seen any nodes lost. I believe that either of these approaches would also work.

gicuuu3 · 2023-07-26T12:48:07Z

Understood.

We've started using mixed instances and it also helped in our scenario. I guess we'll keep track of this PR and if we ever start hitting the problem again we can use this PR for our own setup also.

Thank you for the help!

gicuuu3 · 2023-10-25T13:42:53Z

In case anyone is still struggling with this issue, another thing we did that helped us is enabled the aws auto scaling group scale-in protection

https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-instance-protection.html

Since our autoscaler always does targeted removal requests, this toggle protects us from random removal due to capacity reduction while allowing the autoscaler to still removes nodes it doesn't need in the cluster by "name"

k8s-triage-robot · 2024-01-31T01:32:16Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

apy-liu · 2024-01-31T23:06:52Z

We've also observed this happening multiple times across different clusters using cluster autoscaler v1.24.3.

The cluster autoscaler attempts to scale up the ASG and hits an OutOfCapacity error creating the placeholder instances as mentioned.
The ASG which received the scale-up request continues to attempt launching the instance and successfully does so after about 2 minutes.
Cluster autoscaler has already considered it failed and tries to clean up the placeholder instances with no knowledge of the successful launch

I suspect this started happening more frequently because of this feature to early abort on OutOfCapacity errors: #4489.

The logs from the intended behavior of the PR match what we see on the cluster:

1 clusterstate.go:1043] Failed adding 14 nodes (14 unseen previously) to group kubernetes-minion-test-mc-a-amd-asg-us-east-1a due to OutOfResource.placeholder-cannot-be-fullfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"} ... 1 auto_scaling_groups.go:287] instance i-placeholder-kubernetes-minion-test-mc-a-amd-asg-us-east-1b-0 is detected as a placeholder, decreasing ASG requested size instead of deleting instance

This means cluster autoscaler doesn't recognize there was a successfully launched instance due to the early abort on the first FAILED status and the decreased desiredCount removes a running instance with potential workloads.

If that's the case, it seems like the proposed PR above could be one fix.

Another one may be exposing a flag to respect the backoff duration even when hitting an OutOfCapacity error: #5756

apy-liu · 2024-01-31T23:09:35Z

/remove-lifecycle stale

jlamillan · 2024-02-05T19:46:28Z

This means cluster autoscaler doesn't recognize there was a successfully launched instance due to the early abort on the first FAILED status and the decreased desiredCount removes a running instance with potential workloads.

@apy-liu if requested instance(s) do not show up in DescribeAutoScalingGroup, it's added as a "placeholder" instance - meaning it is requested by not yet created.

For example, if you attempt to scale up an ASG from 3 -> 6, but only 1 of the 3 instances are provisioned before a InsufficientInstanceCapacity is detected, placeholder instances will be created for the 2 missing instances which the ASG reportedly has no capacity for. Looking at the code, once that InsufficientInstanceCapacity occurs, any requested-but-not-allocated instances for that ASG are written off as unfulfillable. DeleteInstances will be called with the "unfulfillable" instances, which in this case is handled simply by decreasing the size of the ASG by 2 (from 6 -> 4). It does seem possible that one or more of the so-called "unfulfillable" instances could actually be provisioned on the provider before DeleteInstances is called to effectively abort the original request.

If instead DeleteInstances did not reduce the size of the ASG by 2, and just deleted the placeholder instance from its cache in the corresponding PR, that ongoing/pending request for additional capacity on the provider may affect what capacity is available to them in a different ASG basically creating a capacity dead-lock.

apy-liu · 2024-02-05T23:46:10Z

Hey @jlamillan thanks for taking a look and providing a detailed explanation. That makes sense to me, in that case, the PR which adds a flag to respect the backoff duration seems a more stable approach.

It does seem possible that one or more of the so-called "unfulfillable" instances could actually be provisioned on the provider before DeleteInstances is called to effectively abort the original request.

Attaching some sample logs to provide some context. This was what led us to that conclusion.

CA Logs:

23:42:59.970791 1 auto_scaling_groups.go:438] Instance group <> has only 3 instances created while the requested count is 4. Creating placeholder instances.
23:43:00.248897 1 auto_scaling_groups.go:481] ASG <> scaling failed with {AutoScalingGroupARN: "<>", AutoScalingGroupName: "<>",
23:43:00.248971 1 auto_scaling_groups.go:446] Instance group <> cannot provision any more nodes!
23:43:50.698297 1 clusterstate.go:1008] Found 1 instances with errorCode OutOfResource.placeholder-cannot-be-fulfilled in nodeGroup <>
23:43:50.698330 1 clusterstate.go:1026] Failed adding 1 nodes (1 unseen previously) to group <> due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
23:43:50.698422 1 static_autoscaler.go:692] Deleting 1 from <> node group because of create errors
23:43:50.698440 1 auto_scaling_groups.go:297] instance i-placeholder- is detected as a placeholder, decreasing ASG requested size instead of deleting instance
23:43:50.698446 1 auto_scaling_groups.go:248] Setting asg <> size to 3

AWS console:

23:41:33 - 23:41:33. Failed - Launching a new EC2 instance. Status reason: We currently do not have sufficient...
23:41:43 - 23:41:43. Failed - Launching a new EC2 instance. Status reason: We currently do not have sufficient...
23:41:53 - 23:41:53. Failed - Launching a new EC2 instance. Status reason: We currently do not have sufficient...
23:42:03 - 23:42:03. Failed - Launching a new EC2 instance. Status reason: We currently do not have sufficient...
23:43:07 - 23:43:23. Successful - launch of new EC2 instance in response to difference between desired and actual
23:43:53 - 23:45:35. Successful - User request (CA) set group desired capacity changing the desired capacity from 4 to 3.

Right before the successful start at 23:43:07, the CA checks the status of the node group which has a FAILED status at 23:43:00 and does a clean up through the early abort feature. It does look like this causes a race between the ASG finishing the scale-up and the CA cleaning up placeholder instances.

jlamillan · 2024-02-07T02:00:17Z

@apy-liu or @theintz - curious if either of you have looked into using termination policies or scale-in protection as @gicuuu3 suggested as an alternative way to resolve this issue?

apy-liu · 2024-02-12T20:11:15Z

Yes, looking into the suggested mitigations though still hoping to get to a resolution on this through one of the proposed fixes.

apy-liu · 2024-03-13T20:55:37Z

Hi @gjtempleton this was the issue we raised during the SIG meeting. Would be interested to hear your feedback on this issue and the suggested PR to fix it or potentially a different way as we're still looking for a long-term solution. Thanks

kmsarabu · 2024-04-02T23:08:39Z

Hi @gjtempleton Could you please outline the next steps and recommend the optimal approach to address this issue? Additionally, is there any restriction on the range of older versions to which a fix can be backported when requesting a backport fix? Thanks

apy-liu · 2024-04-11T17:48:09Z

Hi @gjtempleton, thank you for providing your feedback at this week's SIG meeting. It sounds like the proposed solution would be adding a check before cleaning up the placeholder instance and updating the internal asgCache to reflect the new instance if it does come up.

You had mentioned drafting up a PR for the above, so I wanted to check where we could follow along. Also wanted to check on the question @kmsarabu posed above about the range of older versions that backporting is supported. Thanks!

drmorr0 · 2024-05-16T22:10:59Z

@gjtempleton I am uncertain about the proposed fix (checking for recent scaling activity before changing the ASG size). I missed the SIG meeting unfortunately where this was discussed, but I think this doesn't solve the problem, it just narrows the window between which a race can occur (see my comment on #6818).

I think the ideal way to handle this (and how I've seen it done in the past) is to enable scale-in protection on the ASGs (as mentioned by @gicuuu3). This is the only way I know of to safely change the size of an ASG without accidentally terminating running nodes.

I will try to join the SIG meeting this Monday to discuss further.

daimaxiaxie · 2024-05-17T03:14:41Z

The problem is not just AWS, it also exists on other clouds. For example AliCloud. The root of the problem is that scaling activities on the cloud are asynchronous.

This merge resolves an issue in the Kubernetes Cluster Autoscaler where actual instances within AWS Auto Scaling Groups (ASGs) were incorrectly decommissioned instead of placeholders. The updates ensure that placeholders are exclusively targeted for scaling down under conditions where recent scaling activities have failed. This prevents the accidental termination of active nodes and enhances the reliability of the autoscaler in AWS environments. Fixes kubernetes#5829

This change expands on pr kubernetes#6818 This merge resolves an issue in the Kubernetes Cluster Autoscaler where actual instances within AWS Auto Scaling Groups (ASGs) were incorrectly decommissioned instead of placeholders. The updates ensure that placeholders are exclusively targeted for scaling down under conditions where recent scaling activities have failed. This prevents the accidental termination of active nodes and enhances the reliability of the autoscaler in AWS environments. Fixes kubernetes#5829

…ubernetes#6911

PR#6911 Backport for 1.28: Fix/aws asg unsafe decommission #5829

…#5829

PR#6911 Backport for 1.29: Fix/aws asg unsafe decommission #5829

PR#6911 Backport for 1.30: Fix/aws asg unsafe decommission #5829

* Update with make generate * Add pdb filtering to remainingPdbTracker * Convert replicated, system, not-safe-to-evict, and local storage pods to drainability rules * Convert scale-down pdb check to drainability rule * Pass DeleteOptions once during default rule creation * Split out custom controller and common checks into separate drainability rules * Filter out disabled drainability rules during creation * Refactor GetPodsForDeletion logic and tests into simulator * Fix custom controller drainability rule and add test coverage * Add unit test for long-terminating pod past grace period * Removed node drainer, kept node termination handler * Add HasNodeGroupStartedScaleUp to cluster state registry. - HasNodeGroupStartedScaleUp checks wheter a scale up request exists without checking any upcoming nodes. * Add kwiesmueller to OWNERS jbartosik et al are transitioning off of workload autoscalers (incl vpa and addon-resizer). kwiesmueller is on the new team and has agreed to take on reviewer/approver responsibilities. * Add information about provisioning-class-name annotation. * Remove redundant if branch * Add mechanism to override drainability status * Log drainability override * fix(cluster-autoscaler-chart): if secretKeyRefNameOverride is true, don't create secret Signed-off-by: Jonathan Raymond <[email protected]> * fix: correct version bump Signed-off-by: Jonathan Raymond <[email protected]> * Initialize default drainability rules * feat: each node pool can now have different init configs * ClusterAPI: Allow overriding the kubernetes.io/arch label set by the scale from zero method via environment variable The architecture label in the build generic labels method of the cluster API (CAPI) provider is now populated using the GetDefaultScaleFromZeroArchitecture().Name() method. The method allows CAPI users deploying the cluster-autoscaler to define the default architecture to be used by the cluster-autoscaler for scale up from zero via the env var CAPI_SCALE_ZERO_DEFAULT_ARCH. Amd64 is kept as a fallback for historical reasons. The introduced changes will not take into account the case of nodes heterogeneous in architecture. The labels generation to infer properties like the cpu architecture from the node groups' features should be considered as a CAPI provider specific implementation. * Update image builder to use Go 1.21.3 Some of Cluster Autoscaler code is now using features only available in Go 1.21. * Add node-delete-delay-after-taint to FAQ * Reports node taints. * Add debugging-snapshot-enabled back * Rename comments, logs, structs, and vars from packet to equinix metal * Rename types * fix: provider name to be used in builder to provide backward compatibility Signed-off-by: Ayush Rangwala <[email protected]> * Rename comments, logs, structs, and vars from packet to equinix metal * Created a new env var for metal to replace/support packet env vars as usual * Support backward compatibility for PACKET_MANAGER env var Signed-off-by: Ayush Rangwala <[email protected]> * fix: refactor cloud provider names Signed-off-by: Ayush Rangwala <[email protected]> * Documents startup/status/ignore node taints. * Adding price info for c3d (Price for preemptible instances is calculated as: (Spot price / On-demand price) * instance prices) * Bump CA golang to 1.21.3 * cloudprovider/exoscale: update limits/quotas URL https://portal.exoscale.com/account/limits has been deprecated in favor of https://portal.exoscale.com/organization/quotas. Update README accordingly. * Add the AppVersion to cluster-autoscaler.labels as app.kubernetes.io/version * Bump version in chart.yaml * add note for CRD and RBAC handling for VPA (>=1.0.0) * feat(helm): add support for exoscale provider Signed-off-by: Thomas Stadler <[email protected]> * Add TOC link in README for EvictionRequirement example * Fix 'evictionRequirements.resources' to be plural in yaml * Run 'hack/generate-crd-yamls.sh' * Adapt AEP to have 'resources' in plural * Remove deprecated dependency: gogo/protobuf * Fix klog formating directives in cluster-autoscaler package. * Update kubernetes dependencies to 1.29.0-alpha.3. * Change scheduler framework function names after recent refactor in kubernetes scheduler. * chore(helm): bump version of cluster-autoscaler Signed-off-by: Thomas Stadler <[email protected]> * chore(helm): docs, update README template Signed-off-by: Thomas Stadler <[email protected]> * Fix capacityType label in AWS ManagedNodeGroup Fixes an issue where the capacityType label inferred from an empty EKS ManagedNodeGroup does not match the same label on the node after it is created and joins the cluster * Cleanup: Remove deprecated github.com/golang/protobuf usage - Regenerate cloudprovider/externalgrpc proto - go mod tidy * Remove maps.Copy usage. * chore: upgrade vpa go and k8s dependencies Signed-off-by: Amir Alavi <[email protected]> * ScaleUp is only ever called when there are unscheduled pods * Bump golang from 1.21.2 to 1.21.4 in /vertical-pod-autoscaler/builder Bumps golang from 1.21.2 to 1.21.4. --- updated-dependencies: - dependency-name: golang dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * Disambiguate the resource usage node removal eligibility messages * Cleanup: Remove separate client for k8s events Remove RateLimiting options - replay on APF for apiserver protection. Details: kubernetes/kubernetes#111880 * Update Chart.yaml * Remove gce-expander-ephemeral-storage-support flag Always enable the feature * Add min/max/current asg size to log * Clarify that log line updates cache, now AWS * Update README.md: Link to Cluster-API Add Link to Cluster API. * azure: add owner-jackfrancis * Update OWNERS - typo * Update README.md * Template the autoDiscovery.clusterName variable in the Helm chart * fix: Add revisionHistoryLimit override to cluster-autoscaler Signed-off-by: Matt Dainty <[email protected]> * allow users to avoid aws instance not found spam * fix: alicloud the function NodeGroupForNode is incorrect * Update README.md Fix error in text * fix: handle error when listing machines Signed-off-by: Cyrill Troxler <[email protected]> * AWS: cache instance requirements * fix: update node annotation used to limit log spam with valid key * Removes unnecessary check * Allow overriding domain suffix in GCE cloud provider. * chore(deps): update vendored hcloud-go to 2.4.0 Generated by: ``` UPSTREAM_REF=v2.4.0 hack/update-vendor.sh ``` * Add new pod list processors for clearing TPU requests & filtering out expendable pods Treat non-processed pods yet as unschedulable * Fix multiple comments and update flags * Add new test for new behaviour and revert changes made to other tests * Allow users to specify which schedulers to ignore * Update flags, Improve tests readability & use Bypass instead of ignore in naming * Update static_autoscaler tests & handle pod list processors errors as warnings * Fix: Include restartable init containers in Pod utilization calc Reuse k/k resourcehelper func * Implement ProvReq service * Set Go versions to the same settings kubernetes/kubernetes uses Looks like specifying the Go patch version in go.mod might've been a mistake: kubernetes/kubernetes#121808. * feat: implement kwok cloudprovider feat: wip implement `CloudProvider` interface boilerplate for `kwok` provider Signed-off-by: vadasambar <[email protected]> feat: add builder for `kwok` - add logic to scale up and scale down nodes in `kwok` provider Signed-off-by: vadasambar <[email protected]> feat: wip parse node templates from file Signed-off-by: vadasambar <[email protected]> docs: add short README Signed-off-by: vadasambar <[email protected]> feat: implement remaining things - to get the provider in a somewhat working state Signed-off-by: vadasambar <[email protected]> docs: add in-cluster `kwok` as pre-requisite in the README Signed-off-by: vadasambar <[email protected]> fix: templates file not correctly marshalling into node list Signed-off-by: vadasambar <[email protected]> fix: `invalid leading UTF-8 octet` error during template parsing - remove encoding using `gob` - not required Signed-off-by: vadasambar <[email protected]> fix: use lister to get and list - instead of uncached kube client - add lister as a field on the provider and nodegroup struct Signed-off-by: vadasambar <[email protected]> fix: `did not find nodegroup annotation` error - CA was thinking the annotation is not present even though it is - fix a bug with parsing annotation Signed-off-by: vadasambar <[email protected]> fix: CA node recognizing fake nodegroups - add provider ID to nodes in the format `kwok:<node-name>` - fix invalid `KwokManagedAnnotation` - sanitize template nodes (remove `resourceVersion` etc.,) - not sanitizing the node leads to error during creation of new nodes - abstract code to get NG name into a separate function `getNGNameFromAnnotation` Signed-off-by: vadasambar <[email protected]> fix: node not getting deleted Signed-off-by: vadasambar <[email protected]> test: add empty test file Signed-off-by: vadasambar <[email protected]> chore: add OWNERS file Signed-off-by: vadasambar <[email protected]> feat: wip kwok provider config - add samples for static and dynamic template nodes Signed-off-by: vadasambar <[email protected]> feat: wip implement pulling node templates from cluster - add status field to kwok provider config - this is to capture how the nodes would be grouped by (can be annotation or label) - use kwok provider config status to get ng name from the node template Signed-off-by: vadasambar <[email protected]> fix: syntax error in calling `loadNodeTemplatesFromCluster` Signed-off-by: vadasambar <[email protected]> feat: first draft of dynamic node templates - this allows node templates to be pulled from the cluster - instead of having to specify static templates manually Signed-off-by: vadasambar <[email protected]> fix: syntax error Signed-off-by: vadasambar <[email protected]> refactor: abstract out related code into separate files - use named constants instead of hardcoded values Signed-off-by: vadasambar <[email protected]> feat: cleanup kwok nodes when CA is exiting - so that the user doesn't have to cleanup the fake nodes themselves Signed-off-by: vadasambar <[email protected]> refactor: return `nil` instead of err for `HasInstance` - because there is no underlying cloud provider (hence no reason to return `cloudprovider.ErrNotImplemented` Signed-off-by: vadasambar <[email protected]> test: start working on tests for kwok provider config Signed-off-by: vadasambar <[email protected]> feat: add `gpuLabelKey` under `nodes` field in kwok provider config - fix validation for kwok provider config Signed-off-by: vadasambar <[email protected]> docs: add motivation doc - update README with more details Signed-off-by: vadasambar <[email protected]> feat: update kwok provider config example to support pulling gpu labels and types from existing providers - still needs to be implemented in the code Signed-off-by: vadasambar <[email protected]> feat: wip update kwok provider config to get gpu label and available types Signed-off-by: vadasambar <[email protected]> feat: wip read gpu label and available types from specified provider - add available gpu types in kwok provider config status Signed-off-by: vadasambar <[email protected]> feat: add validation for gpu fields in kwok provider config - load gpu related fields in kwok provider config status Signed-off-by: vadasambar <[email protected]> feat: implement `GetAvailableGPUTypes` Signed-off-by: vadasambar <[email protected]> feat: add support to install and uninstall kwok - add option to disable installation - add option to manually specify kwok release tag - add future scope in readme Signed-off-by: vadasambar <[email protected]> docs: add future scope 'evaluate adding support to check if kwok controller already exists' Signed-off-by: vadasambar <[email protected]> fix: vendor conflict and cyclic import - remove support to get gpu config from the specified provider (can't be used because leads to cyclic import) Signed-off-by: vadasambar <[email protected]> docs: add a TODO 'get gpu config from other providers' Signed-off-by: vadasambar <[email protected]> refactor: rename `file` -> `configmap` - load config and templates from configmap instead of file - move `nodes` and `nodegroups` config to top level - add helper to encode configmap data into `[]bytes` - add helper to get current pod namespace Signed-off-by: vadasambar <[email protected]> feat: add new options to the kwok provider config - auto install kwok only if the version is >= v0.4.0 - add test for `GPULabel()` - use `kubectl apply` way of installing kwok instead of kustomize - add test for kwok helpers - add test for kwok config - inject service account name in CA deployment - add example configmap for node templates and kwok provider config in CA helm chart - add permission to create `clusterrolebinding` (so that kwok provider can create a clusterrolebinding with `cluster-admin` role and create/delete upstream manifests) - update kwok provider sample configs - update `README` Signed-off-by: vadasambar <[email protected]> chore: update go.mod to use v1.28 packages Signed-off-by: vadasambar <[email protected]> chore: `go mod tidy` and `go mod vendor` (again) Signed-off-by: vadasambar <[email protected]> refactor: kwok installation code - add functions to create and delete clusterrolebinding to create kwok resources - refactor kwok install and uninstall fns - delete manifests in the opposite order of install ] - add cleaning up left-over kwok installation to future scope Signed-off-by: vadasambar <[email protected]> fix: nil ptr error - add `TODO` in README for adding docs around kwok config fields Signed-off-by: vadasambar <[email protected]> refactor: remove code to automatically install and uninstall `kwok` - installing/uninstalling requires strong permissions to be granted to `kwok` - granting strong permissions to `kwok` means granting strong permissions to the entire CA codebase - this can pose a security risk - I have removed the code related to install and uninstall for now - will proceed after discussion with the community Signed-off-by: vadasambar <[email protected]> chore: run `go mod tidy` and `go mod vendor` Signed-off-by: vadasambar <[email protected]> fix: add permission to create nodes - to fix permissions error for kwok provider Signed-off-by: vadasambar <[email protected]> test: add more unit tests - add tests for kwok helpers - fix and update kwok config tests - fix a bug where gpu label was getting assigned to `kwokConfig.status.key` - expose `loadConfigFile` -> `LoadConfigFile` - throw error if templates configmap does not have `templates` key (value of which is node templates) - finish test for `GPULabel()` - add tests for `NodeGroupForNode()` - expose `loadNodeTemplatesFromConfigMap` -> `LoadNodeTemplatesFromConfigMap` - fix `KwokCloudProvider`'s kwok config was empty (this caused `GPULabel()` to return empty) Signed-off-by: vadasambar <[email protected]> refactor: abstract provider ID code into `getProviderID` fn - fix provider name in test `kwok` -> `kwok:kind-worker-xxx` Signed-off-by: vadasambar <[email protected]> chore: run `go mod vendor` and `go mod tidy Signed-off-by: vadasambar <[email protected]> docs(cloudprovider/kwok): update info on creating nodegroups based on `hostname/label` Signed-off-by: vadasambar <[email protected]> refactor(charts): replace fromLabelKey value `"kubernetes.io/hostname"` -> `"kwok-nodegroup"` - `"kubernetes.io/hostname"` leads to infinite scale-up Signed-off-by: vadasambar <[email protected]> feat: support running CA with kwok provider locally Signed-off-by: vadasambar <[email protected]> refactor: use global informer factory Signed-off-by: vadasambar <[email protected]> refactor: use `fromNodeLabelKey: "kwok-nodegroup"` in test templates Signed-off-by: vadasambar <[email protected]> refactor: `Cleanup()` logic - clean up only nodes managed by the kwok provider Signed-off-by: vadasambar <[email protected]> fix/refactor: nodegroup creation logic - fix issue where fake node was getting created which caused fatal error - use ng annotation to keep track of nodegroups - (when creating nodegroups) don't process nodes which don't have the right ng nabel - suffix ng name with unix timestamp Signed-off-by: vadasambar <[email protected]> refactor/test(cloudprovider/kwok): write tests for `BuildKwokProvider` and `Cleanup` - pass only the required node lister to cloud provider instead of the entire informer factory - pass the required configmap name to `LoadNodeTemplatesFromConfigMap` instead of passing the entire kwok provider config - implement fake node lister for testing Signed-off-by: vadasambar <[email protected]> test: add test case for dynamic templates in `TestNodeGroupForNode` - remove non-required fields from template node Signed-off-by: vadasambar <[email protected]> test: add tests for `NodeGroups()` - add extra node template without ng selector label to add more variability in the test Signed-off-by: vadasambar <[email protected]> test: write tests for `GetNodeGpuConfig()` Signed-off-by: vadasambar <[email protected]> test: add test for `GetAvailableGPUTypes` Signed-off-by: vadasambar <[email protected]> test: add test for `GetResourceLimiter()` Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add tests for nodegroup's `IncreaseSize()` - abstract error msgs into variables to use them in tests Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add test for ng `DeleteNodes()` fn - add check for deleting too many nodes - rename err msg var names to make them consistent Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add tests for ng `DecreaseTargetSize()` - abstract error msgs into variables (for easy use in tests) Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add test for ng `Nodes()` - add extra test case for `DecreaseTargetSize()` to check lister error Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add test for ng `TemplateNodeInfo` Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): improve tests for `BuildKwokProvider()` - add more test cases - refactor lister for `TestBuildKwokProvider()` and `TestCleanUp()` Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add test for ng `GetOptions` Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): unset `KWOK_CONFIG_MAP_NAME` at the end of the test - not doing so leads to failure in other tests - remove `kwokRelease` field from kwok config (not used anymore) - this was causing the tests to fail Signed-off-by: vadasambar <[email protected]> chore: bump CA chart version - this is because of changes made related to kwok - fix type `everwhere` -> `everywhere` Signed-off-by: vadasambar <[email protected]> chore: fix linting checks Signed-off-by: vadasambar <[email protected]> chore: address CI lint errors Signed-off-by: vadasambar <[email protected]> chore: generate helm docs for `kwokConfigMapName` - remove `KWOK_CONFIG_MAP_KEY` (not being used in the code) - bump helm chart version Signed-off-by: vadasambar <[email protected]> docs: revise the outline for README - add AEP link to the motivation doc Signed-off-by: vadasambar <[email protected]> docs: wip create an outline for the README - remove `kwok` field from examples (not needed right now) Signed-off-by: vadasambar <[email protected]> docs: add outline for ascii gifs Signed-off-by: vadasambar <[email protected]> refactor: rename env variable `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP` Signed-off-by: vadasambar <[email protected]> docs: update README with info around installation and benefits of using kwok provider - add `Kwok` as a provider in main CA README Signed-off-by: vadasambar <[email protected]> chore: run `go mod vendor` - remove TODOs that are not needed anymore Signed-off-by: vadasambar <[email protected]> docs: finish first draft of README Signed-off-by: vadasambar <[email protected]> fix: env variable in chart `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP` Signed-off-by: vadasambar <[email protected]> refactor: remove redundant/deprecated code Signed-off-by: vadasambar <[email protected]> chore: bump chart version `9.30.1` -> `9.30.2` - because of kwok provider related changes Signed-off-by: vadasambar <[email protected]> chore: fix typo `offical` -> `official` Signed-off-by: vadasambar <[email protected]> chore: remove debug log msg Signed-off-by: vadasambar <[email protected]> docs: add links for getting help Signed-off-by: vadasambar <[email protected]> refactor: fix type in log `external cluster` -> `cluster` Signed-off-by: vadasambar <[email protected]> chore: add newline in chart.yaml to fix CI lint Signed-off-by: vadasambar <[email protected]> docs: fix mistake `sig-kwok` -> `sig-scheduling` - kwok is a part if sig-scheduling (there is no sig-kwok) Signed-off-by: vadasambar <[email protected]> docs: fix type `release"` -> `"release"` Signed-off-by: vadasambar <[email protected]> refactor: pass informer instead of lister to cloud provider builder fn Signed-off-by: vadasambar <[email protected]> * add unit test for function getScalingInstancesByGroup * Azure: Remove AKS vmType Signed-off-by: Jack Francis <[email protected]> * Implement TemplateNodeInfo for civo cloudprovider Signed-off-by: Vishal Anarse <[email protected]> * Add comment for type and function Signed-off-by: Vishal Anarse <[email protected]> * refactor(*): move getKubeClient to utils/kubernetes (cherry picked from commit b9f636d) Signed-off-by: qianlei.qianl <[email protected]> refactor: move logic to create client to utils/kubernetes pkg - expose `CreateKubeClient` as public function - make `GetKubeConfig` into a private `getKubeConfig` function (can be exposed as a public function in the future if needed) Signed-off-by: vadasambar <[email protected]> fix: CI failing because cloudproviders were not updated to use new autoscaling option fields Signed-off-by: vadasambar <[email protected]> refactor: define errors as constants Signed-off-by: vadasambar <[email protected]> refactor: pass kube client options by value Signed-off-by: vadasambar <[email protected]> * Calculate real value for template using node group Signed-off-by: Vishal Anarse <[email protected]> * Fix lint error * Fix tests Signed-off-by: Vishal Anarse <[email protected]> * Update aws-sdk-go to 1.48.7 via tarball Remove *_test.go, models/, examples * + Added SDK version in the log + Update version in README + command * Switch to multistage build Dockerfiles for VPA * Adding 33 instances types * heml chart - update cluster-autoscaler to 1.28 * Bump builder images to go 1.21.5 * feat: add metrics to show target size of every node group * deprecate unused node-autoprovisioning-enabled and max-autoprovisioned-node-group-count flags Signed-off-by: Prashant Rewar <[email protected]> * fix(hetzner): insufficient nodes when boot fails The Hetzner Cloud API returns "Actions" for anything asynchronous that happens inside the backend. When creating a new server multiple actions are returned: `create_server`, `start_server`, `attach_to_network` (if set). Our current code waits for the `create_server` and if it fails, it makes sure to delete the server so cluster-autoscaler can create a new one immediately to provide the required capacity. If one of the "follow up" actions fails though, we do not handle this. This causes issues when the server for whatever reason did not start properly on the first try, as then the customer has a shutdown server, is paying for it, but does not receive the additional capacity for their Kubernetes cluster. This commit fixes the bug, by awaiting all actions returned by the create server API call, and deleting the server if any of them fail. * Add VSCode workspace files to .gitignore * Remove vpa/builder and switch dependabot updates to component Dockerfiles * fix: updated readme for hetzner cloud provider * Add error details to autoscaling backoff. Change-Id: I3b5c62ba13c2e048ce2d7170016af07182c11eee * Make backoff.Status.ErrorInfo non-pointer. Change-Id: I1f812d4d6f42db97670ef7304fc0e895c837a13b * allow specifing grpc timeout rather than hardcoded 5 seconds Signed-off-by: lizhen <[email protected]> * [GCE] Support paginated instance listing * azure: fix chart bugs after AKS vmType deprecation Signed-off-by: Jack Francis <[email protected]> * Update VPA release README to reference 1.X VPA versions. * implement priority based evictor and refactor drain logic * Update dependencies to kubernetes 1.29.0 * [civo] Add Gpu count to node template Signed-off-by: Vishal Anarase <[email protected]> (cherry picked from commit 8703ff9) * Restore flags for setting QPS limit in CA Partially undo kubernetes#6274. I noticed that with this change CA get rate limited and slows down significantly (especially during large scale downs). * Pass Burst and QPS client params to capi k8s clients * Dependency update for CA 1.29.1 * feat: support `--scale-down-delay-after-*` per nodegroup Signed-off-by: vadasambar <[email protected]> feat: update scale down status after every scale up - move scaledown delay status to cluster state/registry - enable scale down if `ScaleDownDelayTypeLocal` is enabled - add new funcs on cluster state to get and update scale down delay status - use timestamp instead of booleans to track scale down delay status Signed-off-by: vadasambar <[email protected]> refactor: use existing fields on clusterstate - uses `scaleUpRequests`, `scaleDownRequests` and `scaleUpFailures` instead of `ScaleUpDelayStatus` - changed the above existing fields a little to make them more convenient for use - moved initializing scale down delay processor to static autoscaler (because clusterstate is not available in main.go) Signed-off-by: vadasambar <[email protected]> refactor: remove note saying only `scale-down-after-add` is supported - because we are supporting all the flags Signed-off-by: vadasambar <[email protected]> fix: evaluate `scaleDownInCooldown` the old way only if `ScaleDownDelayTypeLocal` is set to `false` Signed-off-by: vadasambar <[email protected]> refactor: remove line saying `--scale-down-delay-type-local` is only supported for `--scale-down-delay-after-add` - because it is not true anymore - we are supporting all `--scale-down-delay-after-*` flags per nodegroup Signed-off-by: vadasambar <[email protected]> test: fix clusterstate tests failing Signed-off-by: vadasambar <[email protected]> refactor: move back initializing processors logic to from static autoscaler to main - we don't want to initialize processors in static autoscaler because anyone implementing an alternative to static_autoscaler has to initialize the processors - and initializing specific processors is making static autoscaler aware of an implementation detail which might not be the best practice Signed-off-by: vadasambar <[email protected]> refactor: revert changes related to `clusterstate` - since I am going with observer pattern Signed-off-by: vadasambar <[email protected]> feat: add observer interface for state of scaling - to implement observer pattern for tracking state of scale up/downs (as opposed to using clusterstate to do the same) - refactor `ScaleDownCandidatesDelayProcessor` to use fields from the new observer Signed-off-by: vadasambar <[email protected]> refactor: remove params passed to `clearScaleUpFailures` - not needed anymore Signed-off-by: vadasambar <[email protected]> refactor: revert clusterstate tests - approach has changed - I am not making any changes in clusterstate now Signed-off-by: vadasambar <[email protected]> refactor: add accidentally deleted lines for clusterstate test Signed-off-by: vadasambar <[email protected]> feat: implement `Add` fn for scale state observer - to easily add new observers - re-word comments - remove redundant params from `NewDefaultScaleDownCandidatesProcessor` Signed-off-by: vadasambar <[email protected]> fix: CI complaining because no comments on fn definitions Signed-off-by: vadasambar <[email protected]> feat: initialize parent `ScaleDownCandidatesProcessor` - instead of `ScaleDownCandidatesSortingProcessor` and `ScaleDownCandidatesDelayProcessor` separately Signed-off-by: vadasambar <[email protected]> refactor: add scale state notifier to list of default processors - initialize processors for `NewDefaultScaleDownCandidatesProcessor` outside and pass them to the fn - this allows more flexibility Signed-off-by: vadasambar <[email protected]> refactor: add observer interface - create a separate observer directory - implement `RegisterScaleUp` function in the clusterstate - TODO: resolve syntax errors Signed-off-by: vadasambar <[email protected]> feat: use `scaleStateNotifier` in place of `clusterstate` - delete leftover `scale_stateA_observer.go` (new one is already present in `observers` directory) - register `clustertstate` with `scaleStateNotifier` - use `Register` instead of `Add` function in `scaleStateNotifier` - fix `go build` - wip: fixing tests Signed-off-by: vadasambar <[email protected]> test: fix syntax errors - add utils package `pointers` for converting `time` to pointer (without having to initialize a new variable) Signed-off-by: vadasambar <[email protected]> feat: wip track scale down failures along with scale up failures - I was tracking scale up failures but not scale down failures - fix copyright year 2017 -> 2023 for the new `pointers` package Signed-off-by: vadasambar <[email protected]> feat: register failed scale down with scale state notifier - wip writing tests for `scale_down_candidates_delay_processor` - fix CI lint errors - remove test file for `scale_down_candidates_processor` (there is not much to test as of now) Signed-off-by: vadasambar <[email protected]> test: wip tests for `ScaleDownCandidatesDelayProcessor` Signed-off-by: vadasambar <[email protected]> test: add unit tests for `ScaleDownCandidatesDelayProcessor` Signed-off-by: vadasambar <[email protected]> refactor: don't track scale up failures in `ScaleDownCandidatesDelayProcessor` - not needed Signed-off-by: vadasambar <[email protected]> test: better doc comments for `TestGetScaleDownCandidates` Signed-off-by: vadasambar <[email protected]> refactor: don't ignore error in `NGChangeObserver` - return it instead and let the caller decide what to do with it Signed-off-by: vadasambar <[email protected]> refactor: change pointers to values in `NGChangeObserver` interface - easier to work with - remove `expectedAddTime` param from `RegisterScaleUp` (not needed for now) - add tests for clusterstate's `RegisterScaleUp` Signed-off-by: vadasambar <[email protected]> refactor: conditions in `GetScaleDownCandidates` - set scale down in cool down if the number of scale down candidates is 0 Signed-off-by: vadasambar <[email protected]> test: use `ng1` instead of `ng2` in existing test Signed-off-by: vadasambar <[email protected]> feat: wip static autoscaler tests Signed-off-by: vadasambar <[email protected]> refactor: assign directly instead of using `sdProcessor` variable - variable is not needed Signed-off-by: vadasambar <[email protected]> test: first working test for static autoscaler Signed-off-by: vadasambar <[email protected]> test: continue working on static autoscaler tests Signed-off-by: vadasambar <[email protected]> test: wip second static autoscaler test Signed-off-by: vadasambar <[email protected]> refactor: remove `Println` used for debugging Signed-off-by: vadasambar <[email protected]> test: add static_autoscaler tests for scale down delay per nodegroup flags Signed-off-by: vadasambar <[email protected]> chore: rebase off the latest `master` - change scale state observer interface's `RegisterFailedScaleup` to reflect latest changes around clusterstate's `RegisterFailedScaleup` in `master` Signed-off-by: vadasambar <[email protected]> test: fix clusterstate test failing Signed-off-by: vadasambar <[email protected]> test: fix failing orchestrator test Signed-off-by: vadasambar <[email protected]> refactor: rename `defaultScaleDownCandidatesProcessor` -> `combinedScaleDownCandidatesProcessor` - describes the processor better Signed-off-by: vadasambar <[email protected]> refactor: replace `NGChangeObserver` -> `NodeGroupChangeObserver` - makes it easier to understand for someone not familiar with the codebase Signed-off-by: vadasambar <[email protected]> docs: reword code comment `after` -> `for which` Signed-off-by: vadasambar <[email protected]> refactor: don't return error from `RegisterScaleDown` - not needed as of now (no implementer function returns a non-nil error for this function) Signed-off-by: vadasambar <[email protected]> refactor: address review comments around ng change observer interface - change dir structure of nodegroup change observer package - stop returning errors wherever it is not needed in the nodegroup change observer interface - rename `NGChangeObserver` -> `NodeGroupChangeObserver` interface (makes it easier to understand) Signed-off-by: vadasambar <[email protected]> refactor: make nodegroupchange observer thread-safe Signed-off-by: vadasambar <[email protected]> docs: add TODO to consider using multiple mutexes in nodegroupchange observer Signed-off-by: vadasambar <[email protected]> refactor: use `time.Now()` directly instead of assigning a variable to it Signed-off-by: vadasambar <[email protected]> refactor: share code for checking if there was a recent scale-up/down/failure Signed-off-by: vadasambar <[email protected]> test: convert `ScaleDownCandidatesDelayProcessor` into table tests Signed-off-by: vadasambar <[email protected]> refactor: change scale state notifier's `Register()` -> `RegisterForNotifications()` - makes it easier to understand what the function does Signed-off-by: vadasambar <[email protected]> test: replace scale state notifier `Register` -> `RegisterForNotifications` in test - to fix syntax errors since it is already renamed in the actual code Signed-off-by: vadasambar <[email protected]> refactor: remove `clusterStateRegistry` from `delete_in_batch` tests - not needed anymore since we have `scaleStateNotifier` Signed-off-by: vadasambar <[email protected]> refactor: address PR review comments Signed-off-by: vadasambar <[email protected]> fix: add empty `RegisterFailedScaleDown` for clusterstate - fix syntax error in static autoscaler test Signed-off-by: vadasambar <[email protected]> (cherry picked from commit 5de49a1) * Backport kubernetes#6522 [CA] Bump go version into CA1.29 * Backport kubernetes#6491 and kubernetes#6494 [CA] Add informer argument to the CloudProviders builder into CA1.29 * Merge pull request kubernetes#6617 from ionos-cloud/update-ionos-sdk ionoscloud: Update ionos-cloud sdk-go and add metrics * CA - Update k/k vendor to 1.29.3 * [v1.29][Hetzner] Fix missing ephemeral storage definition This fixed requests for pods with ephemeral storage requests being denied due to insufficient ephemeral storage for the Hetzner provider. Backport of kubernetes#6574 to `v1.29` branch. * Use cache to track vms pools * fx * Add UTs * Fx boilder plate header * Add const * Rename vmsPoolSet * [v1.29][Hetzner] Fix Autoscaling for worker nodes with invalid ProviderID This change fixes a bug that arises when the user's cluster includes worker nodes not from Hetzner Cloud, such as a Hetzner Dedicated server or any server resource other than Hetzner. It also corrects the behavior when a server has been physically deleted from Hetzner Cloud. Signed-off-by: Maksim Paskal <[email protected]> * [v1.29] fix(hetzner): hostname label is not considered The Node Group info we currently return does not include the `kubernetes.io/hostname` label, which is usually set on every node. This causes issues when the user has an unscheduled pod with a `topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`. cluster-autoscaler is unable to fulfill this constraint and does not scale up any of the node groups. Related to kubernetes#6715 * Remove shadow err variable in deleteCreatedNodesWithErros func * fix: scale up broken for providers not implementing NodeGroup.GetOptions() Properly handle calls to `NodeGroup.GetOptions()` that return `cloudprovider.ErrNotImplemented` in the scale up path. * Update k/k vendor to 1.29.5 for CA 1.29 * Rebase * Fx gomock * Rename ARM_BASE_URL * Backport kubernetes#6528 [CA] Fix expectedToRegister to respect instances with nil status into CA1.29 * Backport kubernetes#6750 [CA] fix(hetzner): missing error return in scale up/down into CA1.29 * PR#6911 Backport for 1.29: Fix/aws asg unsafe decommission kubernetes#5829 * CA - 1.29.4 Pre-release AWS Instance Types Update * Update vendor to use k8s 1.29.6 * adjust logs --------- Signed-off-by: Jonathan Raymond <[email protected]> Signed-off-by: Ayush Rangwala <[email protected]> Signed-off-by: Thomas Stadler <[email protected]> Signed-off-by: Amir Alavi <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Matt Dainty <[email protected]> Signed-off-by: Cyrill Troxler <[email protected]> Signed-off-by: vadasambar <[email protected]> Signed-off-by: Jack Francis <[email protected]> Signed-off-by: Vishal Anarse <[email protected]> Signed-off-by: Prashant Rewar <[email protected]> Signed-off-by: lizhen <[email protected]> Signed-off-by: Maksim Paskal <[email protected]> Co-authored-by: Mathieu Bruneau <[email protected]> Co-authored-by: Artem Minyaylov <[email protected]> Co-authored-by: Kubernetes Prow Robot <[email protected]> Co-authored-by: Dumlu Timuralp <[email protected]> Co-authored-by: Hakan Bostan <[email protected]> Co-authored-by: Rich Gowman <[email protected]> Co-authored-by: Daniel Gutowski <[email protected]> Co-authored-by: mikutas <[email protected]> Co-authored-by: Jonathan Raymond <[email protected]> Co-authored-by: Johnnie Ho <[email protected]> Co-authored-by: aleskandro <[email protected]> Co-authored-by: Kuba Tużnik <[email protected]> Co-authored-by: lisenet <[email protected]> Co-authored-by: Piotr Wrótniak <[email protected]> Co-authored-by: Ayush Rangwala <[email protected]> Co-authored-by: Dixita Narang <[email protected]> Co-authored-by: Artur Żyliński <[email protected]> Co-authored-by: Alexandros Afentoulis <[email protected]> Co-authored-by: jw-maynard <[email protected]> Co-authored-by: xiaoqing <[email protected]> Co-authored-by: Thomas Stadler <[email protected]> Co-authored-by: Marco Voelz <[email protected]> Co-authored-by: Aleksandra Gacek <[email protected]> Co-authored-by: Luis Ramirez <[email protected]> Co-authored-by: piotrwrotniak <[email protected]> Co-authored-by: Amir Alavi <[email protected]> Co-authored-by: Michael Grosser <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: shapirus <[email protected]> Co-authored-by: Guy Templeton <[email protected]> Co-authored-by: Mads Hartmann <[email protected]> Co-authored-by: Thomas Güttler <[email protected]> Co-authored-by: Prachi Gandhi <[email protected]> Co-authored-by: Prachi Gandhi <[email protected]> Co-authored-by: Mike Tougeron <[email protected]> Co-authored-by: Matt Dainty <[email protected]> Co-authored-by: Guo Peng <[email protected]> Co-authored-by: Alex Serbul <[email protected]> Co-authored-by: Cyrill Troxler <[email protected]> Co-authored-by: alexanderConstantinescu <[email protected]> Co-authored-by: Brydon Cheyney <[email protected]> Co-authored-by: Julian Tölle <[email protected]> Co-authored-by: Mahmoud Atwa <[email protected]> Co-authored-by: Yaroslava Serdiuk <[email protected]> Co-authored-by: vadasambar <[email protected]> Co-authored-by: Jack Francis <[email protected]> Co-authored-by: Vishal Anarse <[email protected]> Co-authored-by: qianlei.qianl <[email protected]> Co-authored-by: Andrea Scarpino <[email protected]> Co-authored-by: Prashant Rewar <[email protected]> Co-authored-by: Jont828 <[email protected]> Co-authored-by: Pascal <[email protected]> Co-authored-by: Walid Ghallab <[email protected]> Co-authored-by: lizhen <[email protected]> Co-authored-by: Daniel Kłobuszewski <[email protected]> Co-authored-by: Luiz Antonio <[email protected]> Co-authored-by: damikag <[email protected]> Co-authored-by: Maciek Pytel <[email protected]> Co-authored-by: Joachim Bartosik <[email protected]> Co-authored-by: Kyle Weaver <[email protected]> Co-authored-by: shubham82 <[email protected]> Co-authored-by: Kubernetes Prow Robot <[email protected]> Co-authored-by: wenxuanW <[email protected]> Co-authored-by: Maksim Paskal <[email protected]> Co-authored-by: Bartłomiej Wróblewski <[email protected]> Co-authored-by: Krishna Sarabu <[email protected]>

theintz added the kind/bug Categorizes issue or PR as related to a bug. label Jun 2, 2023

jbartosik added the area/cluster-autoscaler label Jun 5, 2023

theintz mentioned this issue Jun 9, 2023

Don't scale down ASGs when placeholder nodes are discovered #5846

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024

songminglong mentioned this issue Feb 5, 2024

cluster-autoscaler gets stuck with "Failed to fix node group sizes" error #6128

Open

towca added the area/provider/aws Issues or PRs related to aws provider label Mar 21, 2024

This was referenced May 10, 2024

fix: Handle placeholder instance decommission safely in AWS ASGs #6816

Closed

Fix/aws asg unsafe decommission 5829 #6818

Closed

ravisinha0506 mentioned this issue Jun 7, 2024

Fix/aws asg unsafe decommission 5829 #6902

Closed

ravisinha0506 mentioned this issue Jun 10, 2024

Fix/aws asg unsafe decommission 5829 #6911

Merged

k8s-ci-robot closed this as completed in #6911 Jun 26, 2024

kmsarabu added a commit to kmsarabu/autoscaler that referenced this issue Jul 8, 2024

1.28 Backport fix - Fix/aws asg unsafe decommission kubernetes#5829 k…

eb43120

…ubernetes#6911

kmsarabu added a commit to kmsarabu/autoscaler that referenced this issue Jul 8, 2024

1.28 Backport fix - Fix/aws asg unsafe decommission kubernetes#5829 k…

423d0e6

…ubernetes#6911

kmsarabu mentioned this issue Jul 8, 2024

PR#6911 Backport for 1.28: Fix/aws asg unsafe decommission #5829 #7012

Merged

k8s-ci-robot added a commit that referenced this issue Jul 9, 2024

Merge pull request #7012 from kmsarabu/cluster-autoscaler-release-1.28

b35497d

PR#6911 Backport for 1.28: Fix/aws asg unsafe decommission #5829

kmsarabu added a commit to kmsarabu/autoscaler that referenced this issue Jul 10, 2024

PR#6911 Backport for 1.29: Fix/aws asg unsafe decommission kubernetes…

a5b23b3

…#5829

kmsarabu mentioned this issue Jul 10, 2024

PR#6911 Backport for 1.29: Fix/aws asg unsafe decommission #5829 #7025

Merged

kmsarabu added a commit to kmsarabu/autoscaler that referenced this issue Jul 10, 2024

PR#6911 Backport for 1.30: Fix/aws asg unsafe decommission kubernetes…

a51b870

…#5829

kmsarabu mentioned this issue Jul 10, 2024

PR#6911 Backport for 1.30: Fix/aws asg unsafe decommission #5829 #7026

Merged

k8s-ci-robot added a commit that referenced this issue Jul 11, 2024

Merge pull request #7025 from kmsarabu/cluster-autoscaler-release-1.29

4fb12fd

PR#6911 Backport for 1.29: Fix/aws asg unsafe decommission #5829

k8s-ci-robot added a commit that referenced this issue Jul 11, 2024

Merge pull request #7026 from kmsarabu/cluster-autoscaler-release-1.30

ce1913e

PR#6911 Backport for 1.30: Fix/aws asg unsafe decommission #5829

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AWS] Unsafe decomissioning of nodes when ASGs are out of instances #5829

[AWS] Unsafe decomissioning of nodes when ASGs are out of instances #5829

theintz commented Jun 2, 2023

theintz commented Jun 8, 2023

steelecliftonberry commented Jun 14, 2023

gicuuu3 commented Jul 19, 2023

theintz commented Jul 26, 2023

gicuuu3 commented Jul 26, 2023

gicuuu3 commented Oct 25, 2023

k8s-triage-robot commented Jan 31, 2024

apy-liu commented Jan 31, 2024 •

edited

Loading

apy-liu commented Jan 31, 2024

jlamillan commented Feb 5, 2024 •

edited

Loading

apy-liu commented Feb 5, 2024 •

edited

Loading

jlamillan commented Feb 7, 2024 •

edited

Loading

apy-liu commented Feb 12, 2024

apy-liu commented Mar 13, 2024 •

edited

Loading

kmsarabu commented Apr 2, 2024

apy-liu commented Apr 11, 2024

drmorr0 commented May 16, 2024

daimaxiaxie commented May 17, 2024

[AWS] Unsafe decomissioning of nodes when ASGs are out of instances #5829

[AWS] Unsafe decomissioning of nodes when ASGs are out of instances #5829

Comments

theintz commented Jun 2, 2023

theintz commented Jun 8, 2023

steelecliftonberry commented Jun 14, 2023

gicuuu3 commented Jul 19, 2023

theintz commented Jul 26, 2023

gicuuu3 commented Jul 26, 2023

gicuuu3 commented Oct 25, 2023

k8s-triage-robot commented Jan 31, 2024

apy-liu commented Jan 31, 2024 • edited Loading

apy-liu commented Jan 31, 2024

jlamillan commented Feb 5, 2024 • edited Loading

apy-liu commented Feb 5, 2024 • edited Loading

jlamillan commented Feb 7, 2024 • edited Loading

apy-liu commented Feb 12, 2024

apy-liu commented Mar 13, 2024 • edited Loading

kmsarabu commented Apr 2, 2024

apy-liu commented Apr 11, 2024

drmorr0 commented May 16, 2024

daimaxiaxie commented May 17, 2024

apy-liu commented Jan 31, 2024 •

edited

Loading

jlamillan commented Feb 5, 2024 •

edited

Loading

apy-liu commented Feb 5, 2024 •

edited

Loading

jlamillan commented Feb 7, 2024 •

edited

Loading

apy-liu commented Mar 13, 2024 •

edited

Loading