Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore topology.ebs.csi.aws.com/zone label for Balance-Similar-NodeGroups #3230

Closed
TBBle opened this issue Jun 18, 2020 · 25 comments · Fixed by #4458
Closed

Ignore topology.ebs.csi.aws.com/zone label for Balance-Similar-NodeGroups #3230

TBBle opened this issue Jun 18, 2020 · 25 comments · Fixed by #4458

Comments

@TBBle
Copy link

TBBle commented Jun 18, 2020

The topology.ebs.csi.aws.com/zone label is added by the AWS EBS CSI driver as a target for Persistent Volume Node Affinity.

If I understand correctly, a Pod that claims such a volume will only be considered by CA for node groups that have this label, but Pods that do not claim such a volume can be scheduled anywhere, so this should not block node-group-similarity in CreateAwsNodeInfoComparator.

It probably should be added to buildGenericLabels like failure-domain.beta.kubernetes.io/zone, so that scale-from-zero works correctly, assuming this won't break systems that don't have the AWS EBS CSI driver installed. That would save having to add the label by-hand to the per-AZ ASGs needed when using EBS. I guess it might go odd if only some of the nodegroups run the CSI Daemonset, but I'm not sure that's a rational/expected setup?

I just saw #2928 and #2493 (which to be fair, is not listed on the FAQ args-list), so I'd understand if the call is to leave this to user-config, and just document it on the AWS page for AWS EBS CSI users. It would still need to be manually added to the ASG tags for the scale-from-zero case.

I'm on k8s 1.15 so unless this or #2493 is backported a long way, it's not going to affect me immediately anyway.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2020
@TBBle
Copy link
Author

TBBle commented Sep 16, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2020
@bcha
Copy link

bcha commented Sep 23, 2020

Any progress or ETA on this? After some initial troubleshooting I think this label is the cause of the balancing issues that we've been seeing in our EKS clusters.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2020
@TBBle
Copy link
Author

TBBle commented Dec 22, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2020
@westernspion
Copy link

Bump.

This is rather inconvenient, but my workaround is to manually provision the first node of an ASG thats set to 0

@westernspion
Copy link

That would save having to add the label by-hand to the per-AZ ASGs needed when using EBS
Does this mean there is a way to manually label an ASG for the cluster-autoscaler to consider? I'm not using eks node groups (to get around 1 node minimum)

@TBBle
Copy link
Author

TBBle commented Feb 5, 2021

@westernspion Yes, that's #2493 that I mentioned in the original request.

From Autoscaler 1.19, there is a --balancing-ignore-label parameter, you can see an example of its use (for precisely this issue) at #3515 (comment), and the equivalent of this ticket is

--balancing-ignore-label=topology.ebs.csi.aws.com/zone

If you're using this for multiple labels, and deploying with Helm, I believe you'll hit #3673, but it should work fine with Helm to just add a single label to ignore.

@westernspion
Copy link

westernspion commented Feb 5, 2021

--balancing-ignore-label=topology.ebs.csi.aws.com/zone solves a different problem I believe.

I've tested the ignore label with v1.19.x and v1.20.0 of the controller and it has no effect on the scheduling (still getting volume affinity errors for my statefulset after it's ASG reduces to 0 and I scale up again). I reviewed the source and this makes sense, as this controls which labels should be ignored on the autoscaling groups considered, not outright omitting affinities/selectors on resources such as PVs in the cluster

So the autoscaler is intelligent enough to consider nodeAffinity on a PV, but it is unable to resolve that information from the autoscaling group. In the case of the EBS CSI provisioner, the affinity is to a node label of topology.ebs.csi.aws.com/zone - but there no reason this label couldn't be arbitrary.

So looking through the code, and I haven't seen documentation for this, but the autoscaler appears to consider anything prefixed with 'k8s.io/cluster-autoscaler/node-template/label/' a custom label to resolve

I'm going to test this out as it appears to answer my original question That would save having to add the label by-hand to the per-AZ ASGs needed when using EBS

@westernspion
Copy link

k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone works like a charm for my needs.

It would be nice to have this label automatically generated from the asg zone lst.

@TBBle
Copy link
Author

TBBle commented Feb 6, 2021

@westernspion Oh, I'm sorry. I totally misread your original question, and also had forgotten the buildGenericLabels part of my original request. >_<

The documentation for the scale-from-zero ASG Tags -> Node Labels mapping is in the AWS-specific documentation, and is also talked about in the scale to 0 FAQ entry.

You are correct, that part of this feature request is not user-doable using #2493, and I see now that I had noted that in the original request: "It would still need to be manually added to the ASG tags for the scale-from-zero case."

It's been a while, but as best I recall, I was not using managed node groups, but using eksctl 'unmanaged' node groups, and so had manually added that tag to the nodegroup definition, since I was defining a nodegroup per-AZ anyway.

It is a little problematic to include in the buildGenericLabels code, as if the EBS CSI Daemonset is only deployed to a subset of nodes for some reason, then a Pod that required that label (i.e with a PVC for an existing EBS volume) would potentially trigger a spin-up from 0 of a nodegroup that was excluded from the EBS CSI Daemonset, and would not be able to schedule until another node was spun up.

The complexity here is because for whatever reason, the EBS CSI Daemonset didn't just use the standard k8s region labels, although I guess in non-EKS setups, they might not be appropriate?

So adding it to the default-ignored list seems pretty safe, but can also be done by the user as --balancing-ignore-label=topology.ebs.csi.aws.com/zone.

And I'm not sure of a good way to ensure this EBS CSI-specific label is visible to CA on scaled-to-zero setups, without either some significant assumptions being made, or direct administrator action.

Edit: I opened kubernetes-sigs/aws-ebs-csi-driver#729 in case we can solve this completely by (optionally) making EBS CSI use a standard node label instead of its own label.

@TBBle
Copy link
Author

TBBle commented Mar 29, 2021

Good news. Per kubernetes-sigs/aws-ebs-csi-driver#729 (comment), the AWS EBS CSI driver 0.10 release will include changes that fix this use-case, along with the use of --balancing-ignore-label=topology.ebs.csi.aws.com/zone.

Basically, the CSI driver will prefer the well-known zone label to the above specific label, and ensure both are applied identically in its CSI API; this means that although the topology.ebs.csi.aws.com/zone will exist, and be honoured by the csi-translation-lib (pending it being updated), the rest of the system can use the well-known zone label that Cluster Autoscaler already knows about.

Once topology.ebs.csi.aws.com/zone is dropped completely, the --balancing-ignore-label=topology.ebs.csi.aws.com/zone can be dropped too.

This means that this proposed change (or the local config, if CA maintainers so-prefer) is sufficient to allow correct behaviour with AWS EBS CSI driver, i.e. no overly-ambitious balanced node groups which scale up but cannot schedule a Pod with an EBS volume, and no falsely-distinguished node-groups when an EBS volume is not present.

That doesn't close this ticket, because I still think it would be nice if this label was automatically ignored on AWS, as originally suggested. The old label may be around for a while, and the AWS EBS CSI driver is going to become much more widespread once AWS starts deploying it by default on EKS and k8s removes the in-tree driver, targeted at 1.21 in the 1.17 release timeframe.

@dlaidlaw
Copy link

dlaidlaw commented May 5, 2021

In my not-so-humble opinion, both the eks.amazonaws.com/sourceLaunchTemplateId and topology.ebs.csi.aws.com/zone should always be ignored by the AWS CloudProvider portion of the cluster-autoscaler code. The solution of using an addition CLI arg not ideal since it requires adding settings to the command line args for the cluster-autoscaler that should be defaults.

AWS customers are very likely to discover the hard way that running the AWS CSI Driver will break the cluster-autoscaler --balance-similar-nodegroups feature. And if any of those AWS customers are running a cluster older than 1.19 they are completely out of luck. Most will probably be like me and adopt the AWS CSI Driver, then discover that the cluster-autoscaler no longer balances similar nodegroups (hopefully without a serious production issues) and then start searching for why. If they are lucky they will end up here.

It is far better to just do the right thing at the outset, and have the cluster-autoscaler ignore the labels that break it. Or maybe even use a whitelist of labels instead of blacklisting some. A whitelist would probably solve this issue and the ones that would arise in the future as other software adds labels to nodes. A good start for labels to whitelist would be:

  • kubernetes.io/arch
  • topology.kubernetes.io/region
  • kubernetes.io/os
  • node.kubernetes.io/instance-type

The eks.amazonaws.com/sourceLaunchTemplateId label is added to nodes by the AWS Nodegroup. It too causes issues with the cluster-autoscaler balance-similar-nodegroups if the nodegroups do not all use the exact same launch template, regardless of whether the templates are similar.

@TBBle
Copy link
Author

TBBle commented May 5, 2021

Just so I understand, your suggestion is that instead of having a list of labels to ignore when considering nodegroups for balancing (as we have now, a combination of generic and AWS-specific), there should be a list of labels to consider for differentiation, and other labels should not be considered, unless explicitly added to say "The value of this label must be the same for the nodegroups to be considered the same"?

I ask because your example list there seems to mix both "must differentiate" (e.g., node.kubernetes.io/instance-type) with "never differentiate" (e.g., topology.kubernetes.io/zone), and also "may differentiate, depends on the cluster admin" (e.g., kubernetes.io/arch)


I'm somewhat wary of ignoring eks.amazonaws.com/sourceLaunchTemplateId by default, because there's plenty of things you could do in a Launch Template that would make the nodegroups non-interchangable, but I haven't actually played with custom launch templates in EKS, so am unaware if there's a reason you cannot use the same launch template for different nodegroups.

Adding a label to the default list is basically saying "No value we find here will affect whether this node is the same as some other node"; where "different launch templates produce interchangable nodes" seems more like a cluster-admin commitment than a generically true statement. But I am aware that I may be wrong about this.

@dlaidlaw
Copy link

dlaidlaw commented May 6, 2021

Yes, that is it exactly. Sorry about the topology.kubernetes.io/zone in the list, you are correct that that one should not be in a whitelist of labels. I will edit my list above to keep that clear.

What this approach is designed to do is to keep the cluster-autoscaler working even as new labels are added to nodes. The new labels may be added manually, or via controllers, operators, or deployment automation for any number of reasons. I would prefer if that did not break the cluster-autoscaler when it happens.

I could live with the eks.amazonaws.com/sourceLaunchTemplateId being in a whitelist. I can see how that would be useful in some circumstances. For our use cases, the ID of the launch template is not a differentiator for autoscaling. Similarly, I can see how kubernetes.io/arch could be a problem for others. So definitely, it would be good to be able to tweek the list with other options to add labels, and also to remove some of the default labels from the list.

@TBBle
Copy link
Author

TBBle commented May 6, 2021

It's an interesting idea. I don't expect it'll see much traction here, so might be worth opening as a feature request/change suggestion if you want to pursue it further.

@k8s-triage-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 4, 2021
@TBBle
Copy link
Author

TBBle commented Aug 4, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 4, 2021
@ArchiFleKs
Copy link

ArchiFleKs commented Sep 2, 2021

When using custom launch template, more labels are added.

I guess there should be ignore also for this to work ?

I see there is some work done here : https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/aws_nodegroups.go#L30

Is this where we should add the ignore label set.

For example here is an EKS managed node group with custom launch template:

Labels:         arch=arm64                                                                                                                                                    
                    beta.kubernetes.io/arch=arm64                                                                                                                                 
                    beta.kubernetes.io/instance-type=t4g.medium                                                                                                                   
                    beta.kubernetes.io/os=linux                                                                                                                                   
                    eks.amazonaws.com/capacityType=ON_DEMAND                                                                                                                      
                    eks.amazonaws.com/nodegroup=default-c-20210828222002582300000005                                                                        
                    eks.amazonaws.com/nodegroup-image=ami-037d1e5f9638c6c6f                                                                                                       
                    eks.amazonaws.com/sourceLaunchTemplateId=lt-0d60b9506f387d2d6                                                                                                 
                    eks.amazonaws.com/sourceLaunchTemplateVersion=1                                                                                                               
                    failure-domain.beta.kubernetes.io/region=eu-west-1                                                                                                            
                    failure-domain.beta.kubernetes.io/zone=eu-west-1c                                                                                                             
                    kubernetes.io/arch=arm64                                                                                                                                      
                    kubernetes.io/hostname=ip-10-0-84-153.eu-west-1.compute.internal                                                                                              
                    kubernetes.io/os=linux                                                                                                                                        
                    network=private                                                                                                                                               
                    node.kubernetes.io/instance-type=t4g.medium                                                                                                                   
                    size=medium                                                                                                                                                   
                    topology.ebs.csi.aws.com/zone=eu-west-1c                                                                                                                      
                    topology.kubernetes.io/region=eu-west-1                                                                                                                       
                    topology.kubernetes.io/zone=eu-west-1c                    

I have 3 similar node pool like this, but in addition of CSI, these might differ also:

                    eks.amazonaws.com/nodegroup-image=ami-037d1e5f9638c6c6f                                                                                                       
                    eks.amazonaws.com/sourceLaunchTemplateId=lt-0d60b9506f387d2d6                                                                                                 
                    eks.amazonaws.com/sourceLaunchTemplateVersion=1 

@TBBle
Copy link
Author

TBBle commented Sep 3, 2021

That would be up to the local administrator to use with --balancing-ignore-label, as they are best-placed to know if their different launch templates lead to materially different nodes, as far as region-balancing is concerned.

The defaults should be very conservative, and only include labels that can never be material to the equivalence of the nodes for balancing purposes.

Particularly because I don't think the command-line API for this lets you override an existing default, so if eks.amazonaws.com/nodegroup-image was added to the defaults, it'd be impossible for the local administrator to specific that this label does matter for balancing.

@elmiko
Copy link
Contributor

elmiko commented Nov 3, 2021

fwiw, this is something we are running into with the Cluster API provider as AWS is one of the possible infrastructure providers for CAPI. given the discussion that has happened here kubernetes-sigs/aws-ebs-csi-driver#729 , i wonder if we couldn't make a patch to ignore this label as part of the "well-known labels" until the aws-ebs-csi-driver has completed deprecating the old value in favor of the common one (topology.kubernetes.io/zone)?

edit:
this breaks all scale ups on CAPI using AWS with --balance-similar-node-groups as the nodes do not properly balance between the groups.

elmiko added a commit to elmiko/cluster-autoscaler-operator that referenced this issue Nov 4, 2021
when balancing similar nodes, the autoscaler will look for differences
in labels as one of its checks for similarity. by default it will ignore
the well known kubernetes zone label (topology.kubernetes.io/zone), but
the aws-ebs-csi-driver will add a unique label for its zone,
`topology.ebs.csi.aws.com/zone`. this label will eventually be
deprecated in favor of the common well known topology label. for more
information about this see the references.

ref: kubernetes/autoscaler#3230
ref: kubernetes-sigs/aws-ebs-csi-driver#729
elmiko added a commit to elmiko/kubernetes-autoscaler that referenced this issue Nov 10, 2021
This change adds the aforementioned label to the list of ignored labels
in the AWS nodegroupset processor. This change is being made in response
to the addition of this label by the aws-ebs-csi-driver. This label will
eventually be deprecated by the driver, but its use will prevent AWS
users from properly balancing similar nodes. Also adds unit test for the
AWS processor.

ref: kubernetes#3230
ref: kubernetes-sigs/aws-ebs-csi-driver#729
@TBBle
Copy link
Author

TBBle commented Nov 16, 2021

I'm glad to see #4458, and it addresses the title of the issue, but there's still the issue in the second paragraph, which can't be worked-around by CA command-line:

It probably should be added to buildGenericLabels like failure-domain.beta.kubernetes.io/zone, so that scale-from-zero works correctly, assuming this won't break systems that don't have the AWS EBS CSI driver installed. That would save having to add the label by-hand to the per-AZ ASGs needed when using EBS. I guess it might go odd if only some of the nodegroups run the CSI Daemonset, but I'm not sure that's a rational/expected setup?

However, unlike the label-ignore change, I'm not as-certain it's the right fix, and it's probably not as-problematic anyway, since it only affects scale-from-zero, and after that, balancing should work correctly now. (And it can be worked around by adding AWS tags to the ASG if needed anyway).

Perhaps it needs to be a new issue with more consideration. It's also possible I just misdiagnosed it. I honestly haven't thought about it since #3230 (comment), when it was shown useful.

@elmiko I'm not sure if CAPI has the same problem or not. but a quick look suggests CAPI doesn't yet support scale-from-zero anyway.

@elmiko
Copy link
Contributor

elmiko commented Nov 16, 2021

@TBBle i think you are correct that there could be an underlying issue with scale from zero, i haven't tested it thoroughly yet. that said...

I'm not sure if CAPI has the same problem or not. but a quick look suggests CAPI doesn't yet support scale-from-zero anyway.

it does not yet, but i am currently working on a patch for it. i will try to play around with the zone labels when i am testing it, perhaps i can find the issue you are talking about.

@TBBle
Copy link
Author

TBBle commented Nov 17, 2021

Sounds good. To avoid confusion, that should probably be a separate ticket to isolate just that behaviour.

@elmiko
Copy link
Contributor

elmiko commented Nov 17, 2021

To avoid confusion, that should probably be a separate ticket to isolate just that behaviour.

agreed

piotrnosek pushed a commit to piotrnosek/autoscaler that referenced this issue Nov 30, 2021
This change adds the aforementioned label to the list of ignored labels
in the AWS nodegroupset processor. This change is being made in response
to the addition of this label by the aws-ebs-csi-driver. This label will
eventually be deprecated by the driver, but its use will prevent AWS
users from properly balancing similar nodes. Also adds unit test for the
AWS processor.

ref: kubernetes#3230
ref: kubernetes-sigs/aws-ebs-csi-driver#729
piotrnosek pushed a commit to piotrnosek/autoscaler that referenced this issue Nov 30, 2021
This change adds the aforementioned label to the list of ignored labels
in the AWS nodegroupset processor. This change is being made in response
to the addition of this label by the aws-ebs-csi-driver. This label will
eventually be deprecated by the driver, but its use will prevent AWS
users from properly balancing similar nodes. Also adds unit test for the
AWS processor.

ref: kubernetes#3230
ref: kubernetes-sigs/aws-ebs-csi-driver#729
MaxRink pushed a commit to MaxRink/autoscaler that referenced this issue Feb 14, 2022
This change adds the aforementioned label to the list of ignored labels
in the AWS nodegroupset processor. This change is being made in response
to the addition of this label by the aws-ebs-csi-driver. This label will
eventually be deprecated by the driver, but its use will prevent AWS
users from properly balancing similar nodes. Also adds unit test for the
AWS processor.

ref: kubernetes#3230
ref: kubernetes-sigs/aws-ebs-csi-driver#729
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants