CAPI-managed CoreDNS deployment prevents cluster-autoscaler scale-down #5606

itspngu · 2021-11-08T11:52:26Z

What steps did you take and what happened:

Set up workload cluster, installed metrics-server, set up cluster-autoscaler with cluster-api provider as per docs.

Ran some tests for scaling up and down by adding a Deployment with an arbitrary number of replicas, everything went fine.
Currently, the spec.replicas field of the coredns deployment is set to 2 - I have a hunch this might be related to running a cluster version upgrade from 1.20 to 1.21 on friday.

Because CoreDNS is run as a Deployment in the kube-system namespace (rather than a DaemonSet) I now have one pod on the control plane node and one on a worker node, the latter satisfying cluster-autoscalers conditions preventing node removal¹.

¹: "Kube-system pods that [...] are not run on the node by default, [...] unless the pod has the following annotation [...]: cluster-autoscaler.kubernetes.io/safe-to-evict: true"

What did you expect to happen:

When removing the workload which triggered the scale-up, the cluster should be able to scale down again. The way in which KCP deploys CoreDNS and/or handles kubernetes version upgrades in that deployment prevents that.

Anything else you would like to add:

CoreDNS deployment info:

# $ kubectl -n kube-system get deployments/coredns -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2021-11-04T13:51:08Z"
  generation: 1
  labels:
    k8s-app: kube-dns
  name: coredns
  namespace: kube-system
# [...]
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kube-dns
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: kube-dns
    spec:
      containers:
      - args:
        - -conf
        - /etc/coredns/Corefile
        image: k8s.gcr.io/coredns:1.7.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: coredns
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        - containerPort: 9153
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: 8181
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            memory: 170Mi
          requests:
            cpu: 100m
            memory: 70Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - all
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/coredns
          name: config-volume
          readOnly: true
      dnsPolicy: Default
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: coredns
      serviceAccountName: coredns
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: Corefile
            path: Corefile
          name: coredns
        name: config-volume
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2021-11-04T13:51:08Z"
    lastUpdateTime: "2021-11-04T13:51:52Z"
    message: ReplicaSet "coredns-74ff55c5b" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2021-11-05T10:22:57Z"
    lastUpdateTime: "2021-11-05T10:22:57Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

Relevant excerpt from cluster-autoscaler's logs:

I1108 11:37:12.118398       1 cluster.go:148] Fast evaluation: playground-stage-md-0-q6cd9 for removal
I1108 11:37:12.118532       1 cluster.go:169] Fast evaluation: node playground-stage-md-0-q6cd9 cannot be removed: non-daemonset, non-mirrored, non-pdb-assigned kube-system pod present: coredns-74ff55c5b-664d5

Environment:

Cluster-api version: v0.4.4
Minikube/KIND version: n/a
Kubernetes version: v1.21.6 (workload) & v1.21.5-gke.1802 (management)
OS (e.g. from /etc/os-release): n/a

/kind bug

/area networking
/area upgrades
/area dependency

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2021-11-08T12:58:24Z

This can happen for sure - just replicated on a test cluster. This issue might be better for cluster autoscaler though? I'm not sure what the approach would be from the Cluster API end for this problem.

itspngu · 2021-11-08T13:20:18Z

I'm not involved or even particularly familiar with either project, but given how elaborate the ruleset cluster-autoscaler operates on is, they surely a) have a reason for implementing the particular condition producing the reported behaviour, b) removing or otherwise altering it might be a breaking change, whereas solving this on the KCP side could be done in a number of ways and - based on my limited understanding - relatively little effort (e.g. using the autoscaler.kubernetes.io/safe-to-evict annotation or running CoreDNS as a DaemonSet, introducing a feature flag that'd allow cluster admins to opt out of KCP provisioning it altogether, etc).

If I'm totally off-track with that assessment, pardon me - but either way I do think this is a CAPI issue, not a CA one.

killianmuldoon · 2021-11-08T13:44:42Z

autoscaler.kubernetes.io/safe-to-evict

I don't think this can/should be done in KCP as it adds an odd dependency on the functioning of the autoscaler.

Similarly running coredns as a daemonset doesn't seem to make sense as it would end up running on each node on the cluster, way over-provisioning unless there's additional logic. And this invisibly adds behaviour that depends on autoscaler implementation logic.

You can already skip installing coredns to install in your preferred way by using --skip-addons through CAPI - see: kubernetes/kubeadm#2261

Could you raise this issue on the Cluster Autoscaler side and see what the response is?

itspngu · 2021-11-08T13:51:11Z

Thanks, I wasn't aware of the ability to skip stuff - will check that out. I'd agree that using a DaemonSet would be a bit of a crowbar approach to the problem.

I'm not quite sure what I'd report to the cluster-autoscaler people, as it basically works as designed and documented - do you have any suggestions in that regard?

killianmuldoon · 2021-11-08T14:00:56Z

I'm not certain - but this is an interaction between the autoscaler and Kubeadm's default behaviour so they might want it on their radar. For your specific issue - would it be possible to set that annotation with some other mechanism when you're setting up your cluster? That seems to be what the autoscaler expects here.

itspngu · 2021-11-08T14:09:38Z

this is an interaction between the autoscaler and Kubeadm's default behaviour so they might want it on their radar

That'll do, I didn't dig too deep so I thought the way CoreDNS is set up is specific to cluster-api.

set that annotation with some other mechanism when you're setting up your cluster

I couldn't think of anything that'd allow me to do that in an automation-friendly manner without it being an awful hack, so I decided to open this issue. I'll check out the kubeadm phase skipping mechanism and how to deploy CoreDNS instead and (hopefully) report back with a solution to this problem. Thanks for your help!

neolit123 · 2021-11-08T15:20:16Z

kubeadm deploys two replicas by default to allow them to spread on nodes if possible.
users can certainly patch the coredns deployment as see fit.

long term kubeadm would like to stop hardcoding the coredns deployment and use an operator, currently alpha and located in the kubernetes-sigs/cluster-addons repository.

¹: "Kube-system pods that [...] are not run on the node by default, [...] unless the pod has the following annotation [...]: cluster-autoscaler.kubernetes.io/safe-to-evict: true"

it seems that you should be setting this annotation in the pod template if you want to use the autoscaller and in case this happens.

alternatively as mentioned just skip coredns from CAPI and manage it yourself.

enxebre · 2021-11-08T16:35:42Z

There's nothing to do from autoscaler pov, it's just working as designed and there're the mentioned annotations for users to tweak evicting behaviour.

This seems all expected behaviour to satisfy kubeadm control plane coreDNS deployment topology settings. Does coreDNS need to run in a worker instance though? can't we enforce it's spread only across control plane machines so we reduce the "infra" pods surface on compute targeted for running workloads? cc @randomvariable

sbueringer · 2021-11-08T16:56:18Z

I think it's reasonable to run CoreDNS pods on workload nodes if your cluster only has one control plane node.

A compromise could be: preferredDuringSchedulingIgnoredDuringExecution. (in case it is currently not used, I didn't check)

schrej · 2021-11-09T16:56:17Z

I don't think this issue classifies as help-wanted. While the initial issue is well written, the solution is still in discussion and there is no clear task laid out, which is a requirement for that label.
/remove-help

killianmuldoon · 2021-11-10T09:47:44Z

I have kicked of this work with https://kubernetes.io/docs/tasks/administer-cluster/dns-horizontal-autoscaling/
& kubernetes/autoscaler#3196 but there is definitely room for other people helping in this effort, which hopefully in most cases it is also more straight forward than the linked PR (move, adjust import without additional changes).
If someone want to help, it will be really useful to made a first pass on packages and propose a list of changes so than this issue can be labeled good-first-issue and the work federated.

This appears to be an unrelated comment and copied from #5455 (comment)

Related issue: #5627

killianmuldoon · 2021-11-10T11:39:31Z

The autoscaler has an option --skip-nodes-with-system-pods (default is true), which can be used to change this behavior. By default, pods in kube-system are considered as critical pods, hence non-daemonset pods are not drained.

Copied from:
#5606 (comment)

sbarikdev · 2021-11-11T04:49:34Z

@killianmuldoon Sorry for the copy pasting the text from other issue, that was not my intent. I mean to say autoscaler#3196 might be related to this issue but we might need to do more update also But i copied the text from other place which created the confusion here. I am new to k8s community and learning the things here. I will take care of such things in future or talk to the team on slack for more question/comments etc. :

killianmuldoon · 2021-11-11T09:09:27Z

Hi @sbarikdev welcome to the community! Don't worry about the above! :) Feel free to ask questions or join in the conversation here. When you're copying from somewhere else it's a good idea to post the link and comment why you think it's important. You can also put it in a quote

like this

By placing a '>' as the first character in a line of text.

randomvariable · 2021-11-11T09:46:52Z

I think it's reasonable to run CoreDNS pods on workload nodes if your cluster only has one control plane node.

I think that's correct. For AWS for example, to avoid the 500 QPS / node rate limit, we need to scale out CoreDNS quite a lot (or even do this)

itspngu · 2021-12-03T12:05:33Z

Following up on this, it seems that using preferredDuringSchedulingIgnoredDuringExecution has been discussed and they likely want to externalize CoreDNS entirely, eventually, so the corresponding issue regarding the node selection policy has been closed. I'll proceed with skipping the CoreDNS setup phase as suggested by @killianmuldoon and see what happens. Leaving the issue open for now so I can hopefully report back with results of that workaround, if anyone else happens to run into this problem.

itspngu · 2021-12-03T15:49:41Z

I can't for the life of me figure out how to implement the skipPhases workaround. It's not part of the KubeadmControlPlane or KubeadmConfig{Template} specs, and server side validation fails accordingly:

KubeadmControlPlane/infra-stage-cluster/infra-stage-control-plane dry-run failed, error: failed to create typed patch object: .spec.kubeadmConfigSpec.initConfiguration.skipPhases: field not declared in schema

After digging through the CAPI CRD definitions, the corresponding kubeadm commit enabling skips, and even the generated cloud-init config I circled back to Github and all I could fine was these two issues:

#1782
#5357

Am I missing something or is skipping phases not actually supported yet?

fabriziopandini · 2022-01-26T15:34:58Z

/milestone v1.2

k8s-triage-robot · 2022-04-26T16:18:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-05-26T16:54:29Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-06-25T17:15:23Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-06-25T17:15:32Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/networking Issues or PRs related to networking area/upgrades Issues or PRs related to upgrades area/dependency Issues or PRs related to dependency changes labels Nov 8, 2021

itspngu mentioned this issue Nov 8, 2021

cluster-autoscaler + kubeadm taking nodes hostage kubernetes/autoscaler#4449

Closed

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 9, 2021

k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 9, 2021

This comment has been minimized.

Sign in to view

k8s-ci-robot added this to the v1.2 milestone Jan 26, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2022

k8s-ci-robot closed this as completed Jun 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPI-managed CoreDNS deployment prevents cluster-autoscaler scale-down #5606

CAPI-managed CoreDNS deployment prevents cluster-autoscaler scale-down #5606

itspngu commented Nov 8, 2021

killianmuldoon commented Nov 8, 2021

itspngu commented Nov 8, 2021

killianmuldoon commented Nov 8, 2021 •

edited

Loading

itspngu commented Nov 8, 2021

killianmuldoon commented Nov 8, 2021

itspngu commented Nov 8, 2021

neolit123 commented Nov 8, 2021

enxebre commented Nov 8, 2021 •

edited

Loading

sbueringer commented Nov 8, 2021 •

edited

Loading

schrej commented Nov 9, 2021

killianmuldoon commented Nov 10, 2021

This comment has been minimized.

killianmuldoon commented Nov 10, 2021 •

edited

Loading

sbarikdev commented Nov 11, 2021 •

edited

Loading

killianmuldoon commented Nov 11, 2021

randomvariable commented Nov 11, 2021 •

edited

Loading

itspngu commented Dec 3, 2021

itspngu commented Dec 3, 2021

fabriziopandini commented Jan 26, 2022

k8s-triage-robot commented Apr 26, 2022

k8s-triage-robot commented May 26, 2022

k8s-triage-robot commented Jun 25, 2022

k8s-ci-robot commented Jun 25, 2022

CAPI-managed CoreDNS deployment prevents cluster-autoscaler scale-down #5606

CAPI-managed CoreDNS deployment prevents cluster-autoscaler scale-down #5606

Comments

itspngu commented Nov 8, 2021

killianmuldoon commented Nov 8, 2021

itspngu commented Nov 8, 2021

killianmuldoon commented Nov 8, 2021 • edited Loading

itspngu commented Nov 8, 2021

killianmuldoon commented Nov 8, 2021

itspngu commented Nov 8, 2021

neolit123 commented Nov 8, 2021

enxebre commented Nov 8, 2021 • edited Loading

sbueringer commented Nov 8, 2021 • edited Loading

schrej commented Nov 9, 2021

killianmuldoon commented Nov 10, 2021

This comment has been minimized.

killianmuldoon commented Nov 10, 2021 • edited Loading

sbarikdev commented Nov 11, 2021 • edited Loading

killianmuldoon commented Nov 11, 2021

randomvariable commented Nov 11, 2021 • edited Loading

itspngu commented Dec 3, 2021

itspngu commented Dec 3, 2021

fabriziopandini commented Jan 26, 2022

k8s-triage-robot commented Apr 26, 2022

k8s-triage-robot commented May 26, 2022

k8s-triage-robot commented Jun 25, 2022

k8s-ci-robot commented Jun 25, 2022

killianmuldoon commented Nov 8, 2021 •

edited

Loading

enxebre commented Nov 8, 2021 •

edited

Loading

sbueringer commented Nov 8, 2021 •

edited

Loading

killianmuldoon commented Nov 10, 2021 •

edited

Loading

sbarikdev commented Nov 11, 2021 •

edited

Loading

randomvariable commented Nov 11, 2021 •

edited

Loading