Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cluster-autoscaler] CriticalAddonsOnly taint ignored #4097

Closed
fullykubed opened this issue May 23, 2021 · 58 comments · Fixed by #5838
Closed

[cluster-autoscaler] CriticalAddonsOnly taint ignored #4097

fullykubed opened this issue May 23, 2021 · 58 comments · Fixed by #5838
Assignees
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@fullykubed
Copy link

fullykubed commented May 23, 2021

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

1.20.0

Component version:

What k8s version are you using (kubectl version)?:

1.20.5

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:28:42Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"54684493f8139456e5d2f963b23cb5003c4d8055", GitTreeState:"clean", BuildDate:"2021-03-22T23:02:59Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

Azure w/ VMSS scaling configured via auto-discovery

What did you expect to happen?:

When I apply taint CriticalAddonsOnly=true:NoSchedule to a node group, it should be respected.

What happened instead?:

This particular taint is not respected during scaling calculations.

How to reproduce it (as minimally and precisely as possible):

Create a node group with this taint and one without, scale a deployment without a matching toleration, notice that the tainted node group will get scaled occasionally.

Anything else we need to know?:

This appears to be a result of this taint filtering logic;

case ReschedulerTaintKey:

I believe this is the problem in this old issue that went unresolved: #2434

After searching the repository, I could not find any references to this value in the core logic. While I may be missing something, I am not sure what the intended purpose of this filter is. The only thing I can find is a reference to a "rescheduler". I think it may be related to this KEP but I'm not sure if its still relevant: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/rescheduling.md

It is hindering my use case in the Azure/AKS b/c that is the only taint that can be added to the default system node pool. Frequently the scaling is very delayed as the autoscaler will scale this node pool incorrectly and repeatedly. My hacky workaround is to just have the autoscaler ignore this node pool, but I am afraid this unexpected behavior will continue to surprise people. Moreover, I'd like to be able to not ignore this pool in the future.

(1) Can it be removed?
(2) If not, can a warning and explanation be added to the FAQ?

@fullykubed fullykubed added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2021
@fullykubed fullykubed changed the title CriticalAddonsOnly taint ignored [cluster-autoscaler] CriticalAddonsOnly taint ignored May 23, 2021
@stevehipwell
Copy link
Contributor

stevehipwell commented Jul 20, 2021

We've just discovered this issue, are there any plans to resolve or document it?

After reading up on the use of CriticalAddonsOnly this behaviour looks to be expected and desired.

@fullykubed
Copy link
Author

fullykubed commented Aug 25, 2021

I think you may have misread. The documentation you linked references a Kubernetes component called the "rescheduler" that does not exist, at least in modern kubernetes distributions. This seems to be a result of that documentation being 5+ years out of date. To my knowledge the modern Kubernetes scheduler never ignores node taints so it does not make sense for the autoscaler to ignore this one taint in particular. It looks to be legacy code built in a time when there were proposals to make something called the "rescheduler" a main Kubernetes component. Since that did not happen, this should be removed.

@pierluigilenoci
Copy link
Contributor

The only official documentation on Critical Add-On that I could find is this:
https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/

@fullykubed
Copy link
Author

Yep! There is that, but the scheduler still is not going to ignore node taints even for systen-X-critical priority classes. Ignoring this node taint still does make sense even in that context.

@stevehipwell
Copy link
Contributor

I agree with @pierluigilenoci that the only official documentation found in a google search doesn't even mention the taint by name. There are some kubeadm docs that mention the taint that suggest how it's intended to be used. It is also present in a number of deployment patterns for core cluster components.

AWS EKS uses this taint when starting new nodes, which I suspect is to allow the aws-node CNI pod to start before allowing other workloads to schedule, and then removes it after the node is ready (has a CNI configured). Azure AKS uses this taint permanently on it's system node pool. I personally think that the EKS implementation is "correct", which makes the AKS implementation "incorrect"; although it would worth getting some clarification from the core team or kubeadm team as to the intention of the CriticalAddonsOnly taint. But based on the EKS behaviour and kubeadm docs I think that the current CA behaviour is correct and that CriticalAddonsOnly should be documented as a special case.

@fullykubed
Copy link
Author

fullykubed commented Aug 26, 2021

I appreciate that perspective, but there is no such thing as a "special" taint in Kubernetes. There were in the past proposals to make CriticalAddonsOnly have certain behavior with respect to a component that does not exist called the "rescheduler." Some cloud providers do use this taint as a convention for marking nodes that should only hold system critical pods, but this is not a part of any current specification and the cloud providers are free to do whatever they want with their taints.

To me, it does not logically follow that "no current documentation for this taint exists" leads to "AKS implements this taint incorrectly" leads to "the current CA behavior is correct." While I agree that I think the AKS team has made some interesting decisions here, I am not sure its fair to call their implementation incorrect.

EKS does not get to dictate the Kubernetes spec, and the CA already provides flags for ignoring taints such as CriticalAddonsOnly if you so choose. The current CA behavior of providing special undocumented behavior for this taint with no means to turn it off is not only hostile to those of us who do not use EKS, but it makes no sense to do this when you can very easily ignore taints via the existing CA flags.

I will continue to advocate that this legacy code path should be removed.

@stevehipwell
Copy link
Contributor

@jclangst I think there is some confusion around the use of this taint and it's association with the "rescheduler", which I read to be an additive improvement to an existing "special" taint. I'm also not sure if the EKS implementation of CriticalAddonsOnly is in EKS or something that they get from kubeadm (the taint is in the kubeadm documentation). RE Azure, it's "incorrect" in as much as it means that CA has never worked on these nodes and it's is in conflict with all other documented uses of this taint.

For the specifics of the CA check on this taint, the important consideration is would changing the code change the behaviour. I suspect it would for any K8s cluster where the taints are calculated directly from the nodes and the cluster is using a temporary CriticalAddonsOnly taint as part of it's lifecycle. It still might be worth removing the current default behaviour if the core team can confirm that the CriticalAddonsOnly taint isn't "special" as long as the documentation is updated and the code can be released under a new major version.

@fullykubed
Copy link
Author

fullykubed commented Aug 26, 2021

@stevehipwell You are making the claim that there is an "existing special taint." Kubernetes has specific documentation for all of common taints which you can find here; the taint is not listed in the spec. I have also done the due diligence of searching the source code for ALL of the core Kubernetes components (including kubeadm and kube-scheduler) and there is no indication that this taint is used. I highly encourage you to do a thorough investigation yourself. I am really not sure what more I can provide that would change your mind.

That said, I do appreciate the effort you have put into searching for the deployment patterns. This is a holdover from old (5+ years) versions of Kubernetes that DID use this taint (for the rescheduler proposal), I have linked to that proposal in a previous comment, and I encourage you to review the sequence of events yourself.

I definitely agree with your assessment that its possible that the CA has never worked on these AKS nodes. But with the context of all of the other evidence, I do not feel this is a bug with AKS but rather a bug with CA.

For the specifics of the CA check on this taint, the important consideration is would changing the code change the behaviour. I suspect it would for any K8s cluster where the taints are calculated directly from the nodes and the cluster is using a temporary CriticalAddonsOnly taint as part of it's lifecycle. It still might be worth removing the current default behaviour if the core team can confirm that the CriticalAddonsOnly taint isn't "special" as long as the documentation is updated and the code can be released under a new major version.

I think this is a cogent consideration. Fortunately, no documentation would need to be changed as the behavior is completely undocumented, hence why my teams ran into unexpected behavior in the first place. Given that the current versions of Kubernetes do not use this taint (again, please feel free to review the source), it is not in the official Kubernetes spec, and its undocumented behavior of the CA, I don't see how this could be anything other than a bug. I do not see the issue of putting bugfixes into current version (and even backporting the bugfix to previous versions).

That said, if you can provide an example of how this may be disruptive, I definitely want to make sure your use cases are considered. It could be that I am simply not understanding how this would affect your setup. What I can say is that the current behavior is negatively impacting the setups of most AKS customers, myself included.

@fullykubed
Copy link
Author

From what I am understanding, you are worried about the following sequence of events:

(1) EKS adds a temporary taint with CriticalAddonsOnly
(2) The CA picks that node as the representative for the node group and assumes all nodes in the group have the taint
(3) The CA then doesn't use those nodes as potential candidates for pod scheduling during the scale up algo even though on EKS the taint will be removed eventually

Is this a fair assessment of your concern?

@stevehipwell
Copy link
Contributor

@jclangst I'm coming at this from a similar position as you as I'm building both EKS and AKS clusters as part of a single platform and the lack of consistency in this area is painful; we got hit with this on both EKS & AKS due to trying to align system node pool taints. I have reviewed the Kubernetes code and like you can't find any references within the code, but I have been able to find a number of places such as Helm charts where it's treated as a special case.

From what I am understanding, you are worried about the following sequence of events:

(1) EKS adds a temporary taint with CriticalAddonsOnly
(2) The CA picks that node as the representative for the node group and assumes all nodes in the group have the taint
(3) The CA then doesn't use those nodes as potential candidates for pod scheduling during the scale up algo even though on EKS the taint will be removed eventually

Is this a fair assessment of your concern?

That is my concern and unless this can be addressed it would be a breaking change. I'd love to see the code removed and the correct documentation of CriticalAddonsOnly, even if it was a major version.

@pierluigilenoci
Copy link
Contributor

pierluigilenoci commented Aug 26, 2021

@jclangst I saw the issue you opened on the AKS repository. 👍🏻
Good luck with that. 😜
The answers I usually get from them are "it's not a bug, it's a feature" or "it's in this way by design choice".
Or they ignore me. 🤦🏻
And sadly I'm not joking.

Ie: Azure/AKS#2306
or oauth2-proxy/oauth2-proxy#1231 (comment)
And I have an endless collection of responses from Azure internal support.

@stevehipwell
Copy link
Contributor

@jclangst what's your AKS issue link?

@fullykubed
Copy link
Author

fullykubed commented Aug 26, 2021

I definitely understand and it sucks that EKS and AKS have such different behavior on this front.

What about this solution:

(1) For future versions, remove this behavior
(2) For current and past versions, have the CA actually respect the taint with --respect-taint=CriticalAddonsOnly

This way it would be opt-in and shouldn't affect customers without --respect-taint=CriticalAddonsOnly.

Edit:

Typo

@pierluigilenoci
Copy link
Contributor

@stevehipwell here: Azure/AKS#2513

@fullykubed
Copy link
Author

@pierluigilenoci Well I had to try. Ultimately, we just want our configurations to work regardless of where our cluster is hosted!

@fullykubed
Copy link
Author

Our team's core issue is that currently I can neither tell to the CA to respect the taint nor add additional taints to system nodes in AKS.

@stevehipwell
Copy link
Contributor

@jclangst we're in the same situation as you. In Azure we have made the default system node group single zone and added duplicates for the other zones so the CA should work correctly for workloads with PVCs but until this is resolved either way we have to manually scale these node groups as CA isn't working.

@fullykubed
Copy link
Author

@stevehipwell This is indeed the solution we aligned on as well.

@pierluigilenoci
Copy link
Contributor

pierluigilenoci commented Aug 26, 2021

We have also been trying to align the taint between EKS and AKS for months but it is not possible to date. For cluster-autoscale, we used a mixed approach. We install as Add-On in AKS and we install with Helm/Flux in EKS with the toleration for CriticalAddonsOnly added to the pod. In this way I can address Azure with my favorite motto: "YOU INSTALL IT, YOU CONFIGURE IT, IF IT BREAKS YOU REPAIR IT." 😜

@stevehipwell
Copy link
Contributor

@jclangst it's yet another area where AKS is making work, they should either support a group per zone or allow the default group to be disabled.

@fullykubed
Copy link
Author

fullykubed commented Aug 26, 2021

I agree with all of the above, but I also think the CA is also something that needs to change as it appears to be making assumptions that are really only relevant for EKS. What do you think of the above proposal?

@fullykubed
Copy link
Author

I'd also be very open to just doing (2) for both past and future as long as we make sure its documented appropriately.

@stevehipwell
Copy link
Contributor

@jclangst I'd suggest first implementing option 1 for v1.23.0 as that should be the correct behaviour moving forwards. Then if possible implementing option 2 to be backported as a patch to v1.21 & v1.22 would give a solution for Azure.

@fullykubed
Copy link
Author

Sorry. Closed wrong issue!

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 28, 2022
timuthy added a commit to timuthy/gardener that referenced this issue Jan 11, 2023
This taint is not documented to be involved in any special handling according
to the official Kubernetes documentation, see:
- kubernetes/autoscaler#4097
- https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/

Gardener as well doesn't use this taint in any way.
gardener-prow bot pushed a commit to gardener/gardener that referenced this issue Jan 12, 2023
…7304)

* Remove system components node selector

* Remove `CriticalAddonsOnly` toleration

This taint is not documented to be involved in any special handling according
to the official Kubernetes documentation, see:
- kubernetes/autoscaler#4097
- https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/

Gardener as well doesn't use this taint in any way.

* Remove general tolerations from deployments

Custom or user added taints are tolerated with the help of the System-Components-Config webhook
(#7204). In addition, it's not
desired to tolerate any `NoExecute` taint as this will prevent proper [taint based evictions](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions).
Only `DaemonSet`s should tolerate these taints.
timuthy added a commit to timuthy/gardener-extension-networking-calico that referenced this issue Jan 17, 2023
`CriticalAddonsOnly` taint is not documented to be involved in any special handling according
to the official Kubernetes documentation, see:
- kubernetes/autoscaler#4097
- https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
timuthy added a commit to timuthy/gardener-extension-networking-calico that referenced this issue Feb 14, 2023
`CriticalAddonsOnly` taint is not documented to be involved in any special handling according
to the official Kubernetes documentation, see:
- kubernetes/autoscaler#4097
- https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
@ib-ak
Copy link

ib-ak commented Mar 10, 2023

@pierluigilenoci @gjtempleton @stevehipwell any news on this issue?

@pierluigilenoci
Copy link
Contributor

pierluigilenoci commented Mar 10, 2023

@ib-ak I'm sorry, but I can't help you much.
Like you, I, too, am waiting for this issue to be solved.

@sftim
Copy link
Contributor

sftim commented Apr 8, 2023

BTW, CriticalAddonsOnly is not an official Kubernetes taint; if it were, you'd see it listed in [Well-Known Labels, Annotations and Taints](Well-Known Labels, Annotations and Taints) and it would be prefixed, eg with kubernetes.io/.

For example, nodes can come online tainted as node.cloudprovider.kubernetes.io/uninitialized. That's a real, registered taint and the prefix makes it public. Other organizations can register their own public taints with a prefix they control.

Using a private taint, CriticalAddonsOnly, is unhelpful and cluster operators should try not to do so. On the other hand, cluster-autoscaler is generally available.

I think that cluster-autoscaler's use of this private taint represents an architecture bug.

@vadasambar
Copy link
Member

Came to know about this issue today because of https://kubernetes.slack.com/archives/C09R1LV8S/p1685972225498489

I am not sure if I have fully grasped the details here. I have added it as an item on the agenda for the upcoming sig meeting to decide what to do about this. Once we reach a decision, I would be happy to raise a PR (I am working on 2 other PRs so some delay should be expected).

@vadasambar
Copy link
Member

/assign vadasambar

@vadasambar
Copy link
Member

vadasambar commented Jun 6, 2023

Doing the simplest thing first based on my limited understanding: #5838 (WIP)

@vadasambar
Copy link
Member

Link to agenda for upcoming sig meeting: https://docs.google.com/document/d/1RvhQAEIrVLHbyNnuaT99-6u9ZUMp7BfkPupT2LAZK7w/edit#bookmark=id.e9v2tiuubvsq

@discordianfish
Copy link
Contributor

My 2 cents: Since it's not an official taint, it shouldn't be treated different than any other taint. Therefor I'd remove it.
That being said, if @stevehipwell is right:

AWS EKS uses this taint when starting new nodes, which I suspect is to allow the aws-node CNI pod to start before allowing other workloads to schedule, and then removes it after the node is ready (has a CNI configured).

...this will cause problems as described above. But frankly this shouldn't be too much of a concern for this project. Might require a breaking/major release but EKS is in the wrong using this taint instead of the official e.g node.kubernetes.io/network-unavailable

@sftim
Copy link
Contributor

sftim commented Jun 7, 2023

If preferred, EKS could also mint a taint such as eks.aws/critical-addons-unavailable. Same for other providers, which helps if the meanings and implications aren't 100% congruent.

@pierluigilenoci
Copy link
Contributor

@discordianfish If you hope cloud providers will use "official" methods, you are naïve and wait forever. 😜
But this thing has to be solved somehow. 🤷🏻‍♂️
Maybe it could be planned to configure a list of custom taints. 💡
So that the default tool uses the correct taints, but you can bend the operation to use custom ones to follow the "creative flair" of the cloud provider developers. 🤔

@discordianfish
Copy link
Contributor

@pierluigilenoci Precisely my point, this project should assume the 'official' methods and leave it to downstream to converge to that (aka break it)

@stevehipwell
Copy link
Contributor

@discordianfish I'm not 100% sure if EKS still uses this taint but I would say they're more correct than AKS who use it as their system node taint. AFAIR the CriticalAddonsOnly was used in an official or semi-official way for the same purpose as EKS is/was using it.

@pierluigilenoci
Copy link
Contributor

@discordianfish The project must assume the "official" methods as default behavior but could give the possibility to use "custom" methods on request (for example, through some configuration).

@stevehipwell between AKS and EKS, they have an internal competition over who does things less orthodox. 🤣

@vadasambar
Copy link
Member

Brought this up in the sig meeting yesterday (12th June 2023). Let's wait and see if we get a response from someone in the community who has historical context on this. Otherwise I will bring this up again in the upcoming sig meeting (Mon, 19th June 2023)

@fullykubed
Copy link
Author

Can you be more specific about what historical context you are looking for? We have already done a deep dive in the current and historical codebases here.

@vadasambar
Copy link
Member

vadasambar commented Jun 13, 2023

@jclangst I basically want someone from CA community with better historical context (than me) comment around why we haven't removed this taint yet and if it's okay to just go ahead and remove it. Based on the community's response, we either remove the taint completely from the code OR have an option to preserve the current behavior while giving an option to change the behavior based on the use-case (maybe using a flag)

@fullykubed
Copy link
Author

@vadasambar To that end, I would also be interested in what is holding us up from choosing a path on this.

@fullykubed
Copy link
Author

Given that the CA already has flags for ignoring certain taints, it seems to me implicit + immutable ignore behavior should be removed. It might generate some very minor work for cloud providers that have relied on the implicit, undocumented behavior, but as long as this is coupled with a major version bump that seems completely reasonable (in my personal opinion).

@MaciekPytel
Copy link
Contributor

Contrary to what comments above say - it's not a private extension, CriticalAddonsOnly was an official Kubernetes taint (to the extent anything was official in those days - it was used by core k8s code which I think was how we defined "official" back in 2017 :) ).
It's not in well-known labels and it doesn't have kubernetes.io/ prefix, because it predates organizing
well-known labels or standardizing on prefix (same reason why CA's own to be deleted taint doesn't use prefix).

This taint was used by 'rescheduler' which very much existed and wasn't just a proposal. I don't remember the details after all those years, but back than pod priority didn't exist and daemonset pods were scheduled directly by DaemonsetController, not by scheduler. IIRC Rescheduler was basically a separate controller doing equivalent of today pod preemption for system pods. Since there was no pod priority back than it had to temporarily taint the node while performing the preemption, otherwise any pod could schedule using the capacity freed by preemption. The only way to guarantee the intended pod would use the capacity was to temporarily taint a node so that non-system pods couldn't schedule there. Once the preemption was done, the taint would be removed. I'm sure I'm getting some details wrong, but I think that was the general idea.
Rescheduler seems to have been removed in 1.11: kubernetes/kubernetes#67687.

So yeah - this is very much legacy. I don't think CA should be required to maintain support for a feature removed from kubernetes back in 2018 and I'd be happy to approve a PR removing it.

Now - the issue brought on sig autoscaling seems to have been triggered by the idea of using this particular taint to mark control plane nodes. I don't think that's consistent with how CriticalAddonsOnly taint was originally intended to be used and I guess that could be considered a private extension (using an "official" taint for a different purpose). I have no idea what platforms out there may be using and whether they will be impacted by the change and it will technically be backward incompatible change, so I'd ask anyone making this change to make sure to include a release note that clearly calls out the backward-incompatible change in behavior.

@sftim
Copy link
Contributor

sftim commented Jun 26, 2023

I'm guessing CriticalAddonsOnly became a private taint when we introduced the registry for well known ones (Apr 20, 2017 according to Git history). We didn't go around and look for legacy uses across the project at that point.

As it's been around so long, I strongly agree on the value of some kind of deprecation process.

@vadasambar
Copy link
Member

vadasambar commented Jul 6, 2023

PR #5838 is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.