-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAPI-managed CoreDNS deployment prevents cluster-autoscaler scale-down #5606
Comments
This can happen for sure - just replicated on a test cluster. This issue might be better for cluster autoscaler though? I'm not sure what the approach would be from the Cluster API end for this problem. |
I'm not involved or even particularly familiar with either project, but given how elaborate the ruleset cluster-autoscaler operates on is, they surely a) have a reason for implementing the particular condition producing the reported behaviour, b) removing or otherwise altering it might be a breaking change, whereas solving this on the KCP side could be done in a number of ways and - based on my limited understanding - relatively little effort (e.g. using the autoscaler.kubernetes.io/safe-to-evict annotation or running CoreDNS as a DaemonSet, introducing a feature flag that'd allow cluster admins to opt out of KCP provisioning it altogether, etc). If I'm totally off-track with that assessment, pardon me - but either way I do think this is a CAPI issue, not a CA one. |
I don't think this can/should be done in KCP as it adds an odd dependency on the functioning of the autoscaler. Similarly running coredns as a daemonset doesn't seem to make sense as it would end up running on each node on the cluster, way over-provisioning unless there's additional logic. And this invisibly adds behaviour that depends on autoscaler implementation logic. You can already skip installing coredns to install in your preferred way by using --skip-addons through CAPI - see: kubernetes/kubeadm#2261 Could you raise this issue on the Cluster Autoscaler side and see what the response is? |
Thanks, I wasn't aware of the ability to skip stuff - will check that out. I'd agree that using a DaemonSet would be a bit of a crowbar approach to the problem. I'm not quite sure what I'd report to the cluster-autoscaler people, as it basically works as designed and documented - do you have any suggestions in that regard? |
I'm not certain - but this is an interaction between the autoscaler and Kubeadm's default behaviour so they might want it on their radar. For your specific issue - would it be possible to set that annotation with some other mechanism when you're setting up your cluster? That seems to be what the autoscaler expects here. |
That'll do, I didn't dig too deep so I thought the way CoreDNS is set up is specific to cluster-api.
I couldn't think of anything that'd allow me to do that in an automation-friendly manner without it being an awful hack, so I decided to open this issue. I'll check out the kubeadm phase skipping mechanism and how to deploy CoreDNS instead and (hopefully) report back with a solution to this problem. Thanks for your help! |
kubeadm deploys two replicas by default to allow them to spread on nodes if possible. long term kubeadm would like to stop hardcoding the coredns deployment and use an operator, currently alpha and located in the kubernetes-sigs/cluster-addons repository.
it seems that you should be setting this annotation in the pod template if you want to use the autoscaller and in case this happens. alternatively as mentioned just skip coredns from CAPI and manage it yourself. |
There's nothing to do from autoscaler pov, it's just working as designed and there're the mentioned annotations for users to tweak evicting behaviour. This seems all expected behaviour to satisfy kubeadm control plane coreDNS deployment topology settings. Does coreDNS need to run in a worker instance though? can't we enforce it's spread only across control plane machines so we reduce the "infra" pods surface on compute targeted for running workloads? cc @randomvariable |
I think it's reasonable to run CoreDNS pods on workload nodes if your cluster only has one control plane node. A compromise could be: |
I don't think this issue classifies as |
This appears to be an unrelated comment and copied from #5455 (comment) Related issue: #5627 |
This comment has been minimized.
This comment has been minimized.
Copied from: |
@killianmuldoon Sorry for the copy pasting the text from other issue, that was not my intent. I mean to say autoscaler#3196 might be related to this issue but we might need to do more update also But i copied the text from other place which created the confusion here. I am new to k8s community and learning the things here. I will take care of such things in future or talk to the team on slack for more question/comments etc. : |
Hi @sbarikdev welcome to the community! Don't worry about the above! :) Feel free to ask questions or join in the conversation here. When you're copying from somewhere else it's a good idea to post the link and comment why you think it's important. You can also put it in a quote
By placing a '>' as the first character in a line of text. |
I think that's correct. For AWS for example, to avoid the 500 QPS / node rate limit, we need to scale out CoreDNS quite a lot (or even do this) |
Following up on this, it seems that using |
I can't for the life of me figure out how to implement the
After digging through the CAPI CRD definitions, the corresponding kubeadm commit enabling skips, and even the generated cloud-init config I circled back to Github and all I could fine was these two issues: Am I missing something or is skipping phases not actually supported yet? |
/milestone v1.2 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What steps did you take and what happened:
Set up workload cluster, installed metrics-server, set up cluster-autoscaler with cluster-api provider as per docs.
Ran some tests for scaling up and down by adding a Deployment with an arbitrary number of replicas, everything went fine.
Currently, the
spec.replicas
field of thecoredns
deployment is set to 2 - I have a hunch this might be related to running a cluster version upgrade from 1.20 to 1.21 on friday.Because CoreDNS is run as a Deployment in the kube-system namespace (rather than a DaemonSet) I now have one pod on the control plane node and one on a worker node, the latter satisfying cluster-autoscalers conditions preventing node removal¹.
¹: "Kube-system pods that [...] are not run on the node by default, [...] unless the pod has the following annotation [...]: cluster-autoscaler.kubernetes.io/safe-to-evict: true"
What did you expect to happen:
When removing the workload which triggered the scale-up, the cluster should be able to scale down again. The way in which KCP deploys CoreDNS and/or handles kubernetes version upgrades in that deployment prevents that.
Anything else you would like to add:
CoreDNS deployment info:
Relevant excerpt from cluster-autoscaler's logs:
Environment:
/etc/os-release
): n/a/kind bug
/area networking
/area upgrades
/area dependency
The text was updated successfully, but these errors were encountered: