Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPI-managed CoreDNS deployment prevents cluster-autoscaler scale-down #5606

Closed
itspngu opened this issue Nov 8, 2021 · 23 comments
Closed
Labels
area/dependency Issues or PRs related to dependency changes area/networking Issues or PRs related to networking area/upgrades Issues or PRs related to upgrades kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Milestone

Comments

@itspngu
Copy link

itspngu commented Nov 8, 2021

What steps did you take and what happened:

Set up workload cluster, installed metrics-server, set up cluster-autoscaler with cluster-api provider as per docs.

Ran some tests for scaling up and down by adding a Deployment with an arbitrary number of replicas, everything went fine.
Currently, the spec.replicas field of the coredns deployment is set to 2 - I have a hunch this might be related to running a cluster version upgrade from 1.20 to 1.21 on friday.

Because CoreDNS is run as a Deployment in the kube-system namespace (rather than a DaemonSet) I now have one pod on the control plane node and one on a worker node, the latter satisfying cluster-autoscalers conditions preventing node removal¹.

¹: "Kube-system pods that [...] are not run on the node by default, [...] unless the pod has the following annotation [...]: cluster-autoscaler.kubernetes.io/safe-to-evict: true"

What did you expect to happen:

When removing the workload which triggered the scale-up, the cluster should be able to scale down again. The way in which KCP deploys CoreDNS and/or handles kubernetes version upgrades in that deployment prevents that.

Anything else you would like to add:

CoreDNS deployment info:

# $ kubectl -n kube-system get deployments/coredns -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2021-11-04T13:51:08Z"
  generation: 1
  labels:
    k8s-app: kube-dns
  name: coredns
  namespace: kube-system
# [...]
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kube-dns
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: kube-dns
    spec:
      containers:
      - args:
        - -conf
        - /etc/coredns/Corefile
        image: k8s.gcr.io/coredns:1.7.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: coredns
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        - containerPort: 9153
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: 8181
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            memory: 170Mi
          requests:
            cpu: 100m
            memory: 70Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - all
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/coredns
          name: config-volume
          readOnly: true
      dnsPolicy: Default
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: coredns
      serviceAccountName: coredns
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: Corefile
            path: Corefile
          name: coredns
        name: config-volume
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2021-11-04T13:51:08Z"
    lastUpdateTime: "2021-11-04T13:51:52Z"
    message: ReplicaSet "coredns-74ff55c5b" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2021-11-05T10:22:57Z"
    lastUpdateTime: "2021-11-05T10:22:57Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

Relevant excerpt from cluster-autoscaler's logs:

I1108 11:37:12.118398       1 cluster.go:148] Fast evaluation: playground-stage-md-0-q6cd9 for removal
I1108 11:37:12.118532       1 cluster.go:169] Fast evaluation: node playground-stage-md-0-q6cd9 cannot be removed: non-daemonset, non-mirrored, non-pdb-assigned kube-system pod present: coredns-74ff55c5b-664d5

Environment:

  • Cluster-api version: v0.4.4
  • Minikube/KIND version: n/a
  • Kubernetes version: v1.21.6 (workload) & v1.21.5-gke.1802 (management)
  • OS (e.g. from /etc/os-release): n/a

/kind bug

/area networking
/area upgrades
/area dependency

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/networking Issues or PRs related to networking area/upgrades Issues or PRs related to upgrades area/dependency Issues or PRs related to dependency changes labels Nov 8, 2021
@killianmuldoon
Copy link
Contributor

This can happen for sure - just replicated on a test cluster. This issue might be better for cluster autoscaler though? I'm not sure what the approach would be from the Cluster API end for this problem.

@itspngu
Copy link
Author

itspngu commented Nov 8, 2021

I'm not involved or even particularly familiar with either project, but given how elaborate the ruleset cluster-autoscaler operates on is, they surely a) have a reason for implementing the particular condition producing the reported behaviour, b) removing or otherwise altering it might be a breaking change, whereas solving this on the KCP side could be done in a number of ways and - based on my limited understanding - relatively little effort (e.g. using the autoscaler.kubernetes.io/safe-to-evict annotation or running CoreDNS as a DaemonSet, introducing a feature flag that'd allow cluster admins to opt out of KCP provisioning it altogether, etc).

If I'm totally off-track with that assessment, pardon me - but either way I do think this is a CAPI issue, not a CA one.

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Nov 8, 2021

autoscaler.kubernetes.io/safe-to-evict

I don't think this can/should be done in KCP as it adds an odd dependency on the functioning of the autoscaler.

Similarly running coredns as a daemonset doesn't seem to make sense as it would end up running on each node on the cluster, way over-provisioning unless there's additional logic. And this invisibly adds behaviour that depends on autoscaler implementation logic.

You can already skip installing coredns to install in your preferred way by using --skip-addons through CAPI - see: kubernetes/kubeadm#2261

Could you raise this issue on the Cluster Autoscaler side and see what the response is?

@itspngu
Copy link
Author

itspngu commented Nov 8, 2021

Thanks, I wasn't aware of the ability to skip stuff - will check that out. I'd agree that using a DaemonSet would be a bit of a crowbar approach to the problem.

I'm not quite sure what I'd report to the cluster-autoscaler people, as it basically works as designed and documented - do you have any suggestions in that regard?

@killianmuldoon
Copy link
Contributor

I'm not certain - but this is an interaction between the autoscaler and Kubeadm's default behaviour so they might want it on their radar. For your specific issue - would it be possible to set that annotation with some other mechanism when you're setting up your cluster? That seems to be what the autoscaler expects here.

@itspngu
Copy link
Author

itspngu commented Nov 8, 2021

this is an interaction between the autoscaler and Kubeadm's default behaviour so they might want it on their radar

That'll do, I didn't dig too deep so I thought the way CoreDNS is set up is specific to cluster-api.

set that annotation with some other mechanism when you're setting up your cluster

I couldn't think of anything that'd allow me to do that in an automation-friendly manner without it being an awful hack, so I decided to open this issue. I'll check out the kubeadm phase skipping mechanism and how to deploy CoreDNS instead and (hopefully) report back with a solution to this problem. Thanks for your help!

@neolit123
Copy link
Member

kubeadm deploys two replicas by default to allow them to spread on nodes if possible.
users can certainly patch the coredns deployment as see fit.

long term kubeadm would like to stop hardcoding the coredns deployment and use an operator, currently alpha and located in the kubernetes-sigs/cluster-addons repository.

¹: "Kube-system pods that [...] are not run on the node by default, [...] unless the pod has the following annotation [...]: cluster-autoscaler.kubernetes.io/safe-to-evict: true"

it seems that you should be setting this annotation in the pod template if you want to use the autoscaller and in case this happens.

alternatively as mentioned just skip coredns from CAPI and manage it yourself.

@enxebre
Copy link
Member

enxebre commented Nov 8, 2021

There's nothing to do from autoscaler pov, it's just working as designed and there're the mentioned annotations for users to tweak evicting behaviour.

This seems all expected behaviour to satisfy kubeadm control plane coreDNS deployment topology settings. Does coreDNS need to run in a worker instance though? can't we enforce it's spread only across control plane machines so we reduce the "infra" pods surface on compute targeted for running workloads? cc @randomvariable

@sbueringer
Copy link
Member

sbueringer commented Nov 8, 2021

I think it's reasonable to run CoreDNS pods on workload nodes if your cluster only has one control plane node.

A compromise could be: preferredDuringSchedulingIgnoredDuringExecution. (in case it is currently not used, I didn't check)

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 9, 2021
@schrej
Copy link
Member

schrej commented Nov 9, 2021

I don't think this issue classifies as help-wanted. While the initial issue is well written, the solution is still in discussion and there is no clear task laid out, which is a requirement for that label.
/remove-help

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 9, 2021
@killianmuldoon
Copy link
Contributor

I have kicked of this work with https://kubernetes.io/docs/tasks/administer-cluster/dns-horizontal-autoscaling/
& kubernetes/autoscaler#3196 but there is definitely room for other people helping in this effort, which hopefully in most cases it is also more straight forward than the linked PR (move, adjust import without additional changes).
If someone want to help, it will be really useful to made a first pass on packages and propose a list of changes so than this issue can be labeled good-first-issue and the work federated.

This appears to be an unrelated comment and copied from #5455 (comment)

Related issue: #5627

@sbarikdev

This comment has been minimized.

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Nov 10, 2021

The autoscaler has an option --skip-nodes-with-system-pods (default is true), which can be used to change this behavior. By default, pods in kube-system are considered as critical pods, hence non-daemonset pods are not drained.

Copied from:
#5606 (comment)

@sbarikdev
Copy link

sbarikdev commented Nov 11, 2021

@killianmuldoon Sorry for the copy pasting the text from other issue, that was not my intent. I mean to say autoscaler#3196 might be related to this issue but we might need to do more update also But i copied the text from other place which created the confusion here. I am new to k8s community and learning the things here. I will take care of such things in future or talk to the team on slack for more question/comments etc. :

@killianmuldoon
Copy link
Contributor

Hi @sbarikdev welcome to the community! Don't worry about the above! :) Feel free to ask questions or join in the conversation here. When you're copying from somewhere else it's a good idea to post the link and comment why you think it's important. You can also put it in a quote

like this

By placing a '>' as the first character in a line of text.

@randomvariable
Copy link
Member

randomvariable commented Nov 11, 2021

I think it's reasonable to run CoreDNS pods on workload nodes if your cluster only has one control plane node.

I think that's correct. For AWS for example, to avoid the 500 QPS / node rate limit, we need to scale out CoreDNS quite a lot (or even do this)

@itspngu
Copy link
Author

itspngu commented Dec 3, 2021

Following up on this, it seems that using preferredDuringSchedulingIgnoredDuringExecution has been discussed and they likely want to externalize CoreDNS entirely, eventually, so the corresponding issue regarding the node selection policy has been closed. I'll proceed with skipping the CoreDNS setup phase as suggested by @killianmuldoon and see what happens. Leaving the issue open for now so I can hopefully report back with results of that workaround, if anyone else happens to run into this problem.

@itspngu
Copy link
Author

itspngu commented Dec 3, 2021

I can't for the life of me figure out how to implement the skipPhases workaround. It's not part of the KubeadmControlPlane or KubeadmConfig{Template} specs, and server side validation fails accordingly:

KubeadmControlPlane/infra-stage-cluster/infra-stage-control-plane dry-run failed, error: failed to create typed patch object: .spec.kubeadmConfigSpec.initConfiguration.skipPhases: field not declared in schema

After digging through the CAPI CRD definitions, the corresponding kubeadm commit enabling skips, and even the generated cloud-init config I circled back to Github and all I could fine was these two issues:

#1782
#5357

Am I missing something or is skipping phases not actually supported yet?

@fabriziopandini
Copy link
Member

/milestone v1.2

@k8s-ci-robot k8s-ci-robot added this to the v1.2 milestone Jan 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dependency Issues or PRs related to dependency changes area/networking Issues or PRs related to networking area/upgrades Issues or PRs related to upgrades kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests