-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Autoscaler Forgets Nodes Scheduled for Deletion during Restart #5048
Comments
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Issue still exists. /remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Still exists and happening multiple times per week for us. /remove-lifecycle stale |
@jabdoa2 I am a just a random person on GitHub but was wondering what what version of the Cluster Autoscaler are you using now? You mentioned 1.20.2 when you opened the bug last year. Have you updated since then? We use 1.22.3 ourselves so just wondering if this is something we should keep an eye on as well. Thank you for your information. |
We updated to 1.22 by now. The issue still persists. You can work around it by running the autoscaler on nodes which are not scaled by the autoscaler (i.e. a master or a dedicated node group). However, this issue still occurs when those nodes are upgraded or experience disruptions for other reasons. This issue is still 100% reproducible on all our clusters if you delete the autoscaler within the 10min grace period before deleting a node. We strongly recommend to monitor for nodes which have been cordoned for more than a few minutes. Those prevent scale ups in that node group later on and will cost you money without any benefit. You might also want to monitor for nodes which are not part of the cluster which have been an issue earlier. However, we have not seen this recently as the autoscaler seem to remove those nodes after a few hours (if they still got the correct tags). |
This issue can be seen in 1.21.x as well. |
As a short term solution, removing the cordoned node manually fixes the issue. |
I wasn't able to reproduce the problem with the steps in the description. I wonder if it happens only sometimes or maybe I am doing something wrong. |
For us this happens 100% reliably in multiple clusters. At which step did it behave differently for you? |
@jabdoa2 I used slightly different flags in an attempt to perform the test quickly:
|
Maybe the issue shows up with the default values since you are using default values. When I saw the issue on my end, I saw default values being used as well. |
New CA pod was able to delete the node after some time. :( |
I noticed this issue happens when the cluster-autoscaler pod tries to scale down the node it's running on in which case it drains itself out from the node and leaves the node in cordoned and tainted state. The real problem starts a when new cluster-autoscaler pod comes up, it sees an unschedulable pod and thinks it can schedule that pod on the cordoned and tainted node. This disables the scale down and makes the scale down go into cooldown thereby effectively skipping the code which does actual scale down until the cool down is lifted (which will never happen because the unschedulable pod would never get scheduled on a cordoned and tainted node it can't tolerate) |
It seems like cluster-autoscaler doesn't consider the tainted and cordoned state of the node when running simulations. |
One quick fix for this can be to make sure cluster-autoscaler pod is never drained from the node on which it is running. This can be done by adding a strict PDB (e.g, maxUnavailable: 0) or making sure the cluster-autoscaler pod satisfies the criteria for blocking draining of the node it is running on. |
Yeah it reports the node but simply never acts on it. Looks weird and it can cause a lot of havoc in your cluster when important workload can no longer be scheduled. |
It helps most of the time. You can also run autoscaler on the master nodes or set |
@jabdoa2 a dedicated nodegroup with taints so that nothing else gets scheduled on it except cluster-autoscaler should solve the issue (for all cases I think) until we've a better solution. |
Unless you roll that group last during a cluster upgrade or you experience disruptions of any kind ;-). So it wont happen in the happy case but when things go south this tends to persist breakage and prevent clusters from recovering (i.e. because you can no longer schedule to a certain AZ). |
Sounds like this is the root cause and should be fixed. |
Sorry, I am not sure I understand this fully. My understanding is,
Do you see any problem with this approach (just trying to understand what I am missing)
Agreed. |
The issue can still happen in other node groups. If a scale down has been ongoing and a disruption happens to the current autoscaler there is a chance that this will happen. You can make those disruptions less likely by either a dedicated node group or by running the autoscaler on the master nodes but that will only reduce the chance. Rolling node groups, upgrading the autoscaler or node disruptions still trigger this. We got a few clusters which use spot instances and scale a lot so it keeps happening. |
I see the problem with the solution I proposed. Thanks for explaining. |
Brought this up in the SIG meeting today. Based on discussion with @MaciekPytel, there seem to be 2 ways of going about fixing this:
|
We would need another PR on top of #5054 as explained in #5054 (comment) to actually fix the issue. |
We have logic for removing all taints on the node and uncordon the nodes every time cluster-autoscaler restarts but that is not called when If the flag is removed, taints should be removed for all nodes every time the cluster-autoscaler pod restarts. |
Hi @vadasambar, I'm not sure if it solves the issue mentioned, but there was a separate PR #5200. This was merged last year in September. It changed the behavior so that taints should be removed from All nodes instead of only those that were Ready. I've been reviewing the code to check, and the |
@fookenc thanks for replying. Looks like we've already fixed the issue in 1.26 :)
You are right. Your PR should fix the issue mentioned in #5048 (comment) i.e., the problem described in the description of this issue. There is an overarching issue around scale up preventing scale down because CA thinks it can schedule pods on an existing node (when it can't because the node has taints or is cordoned) for which we already have your PR #5054 merged. My understanding is implementing those interfaces for specific cloud provider should fix the issue in that cloud provider. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale Bug still exists |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
AWS using Kops
What did you expect to happen?:
When cluster-autoscaler selects a node for deletion it cordones it and then after 10 minutes deletes it under any circumstances.
What happened instead?:
When cluster-autoscaler is restarted (typically due to scheduling) it "forgets" about the cordoned node. Now we end up with nodes which are unused and no longer considered by cluster-autoscheduler. We have seen this happen multiple times in different clusters. If always (and only) happens when cluster-autoscaler restarts after tainting/cordoning the node.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Config:
Log on "old" pod instance:
Logs after the node has been "forgotten" in the new pod instance:
Autoscaler clearly still "sees" the node but it does not act on it anymore.
The text was updated successfully, but these errors were encountered: