-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scaling from 0 only scales one of the ASG's detected in spite of balance-similar-node-groups #5352
Comments
we do face the same problem, but @dindurthy the balance-similar-node-groups default value is false, could you try passing true explicitly? |
This sounds like a know issue (see attempts to address it in #1021 , #2892 , #3609 , #3608 , #3761 , #4000 and probably more): Recent cluster-autoscaler versions have means to work around the issue in many cases:
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We encountered the same issue as #4516.
Which component are you using?:
Cluster Autoscaler
What version of the component are you using?:
v1.22.0
Component version:
What k8s version are you using (
kubectl version
)?:Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.15-eks-fb459a0", GitCommit:"be82fa628e60d024275efaa239bfe53a9119c2d9", GitTreeState:"clean", BuildDate:"2022-10-24T20:33:23Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
What environment is this in?:
AWS
What did you expect to happen?:
All ASG's that the cluster-autoscaler detects will be used for scale-up from 0 in a balanced way
What happened instead?:
even when tagging ASG (with propagation) with the "k8s.io/cluster-autoscaler/node-template/label/" tag
and using the flag "--balance-similar-node-groups=true"
when scaled, it's using only one of the ASG's, causing it to be single-AZ(if ASG is per AZ)
so all except 1 ASG are not in use
Here's an image before/after we set the ASG min to 1 for all ASGs (~9:30)
How to reproduce it (as minimally and precisely as possible):
create 3 ASG's, set to 0.
raise pods that need to be on the instances from that ASG.
see the CA raise instances only on one of the ASG's
We ran the command like so:
Node labels on the node are as follows:
Taints:
ASG cluster-autoscaler tags
Pod schedules with:
Anything else we need to know?:
My first thought is that there are more tags we need to represent on the ASG, but I didn't see clear documentation on what's calculated, e.g. kubernetes.io/hostname, and what we need to explicitly set. All the node tags and taints that we manage are established on the ASGs. For an ASG scaled >0, I understand the CA picks a node and generates a template of labels and taints for the corresponding group. This is highly reliable, because the node is running and is a live representation. For an ASG scaled to zero, we attempt to set some labels and taints on the ASG to help the CA infer what a node would look like if it scaled up, but many labels are set via the EKS bootstrap that we don’t directly manage and so maybe the CA isn’t interpolating the full list of labels. This is unfortunate because even if we set the ASG accurately with every label on a node today, that is fragile. That list may change unpredictably as eks changes and it will be hard for us to detect the breakage.
My second thought is that this is least-waste specific, as we had that setting in common with issue 4516.
The text was updated successfully, but these errors were encountered: