Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemonset Eviction during Scale down #4337

Closed
rangarb885 opened this issue Sep 16, 2021 · 8 comments
Closed

Daemonset Eviction during Scale down #4337

rangarb885 opened this issue Sep 16, 2021 · 8 comments
Labels
area/cluster-autoscaler lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@rangarb885
Copy link

Hello :
I want to understand how to handle the below scenario with CA.

  1. When EKS CA decides to scale down a node (which is a part of managed node-group) which has a daemonset like fluent-bit (shipping logs from apps) & SignalFx (tracing & metrics), what configuration i need to have on CA to make sure that daemonset are not evicted as app may be using this during this scaling down time (under their graceful timeout window).?

  2. Is there a config on CA setup to skip the daemonset eviction and allow them to run until the node is terminated.
    I am good even if these daemonset are not gracefully stopped as app using them are gracefully stopping (with their own graceful shutdown timeouts)

My current CA configuration (EKS1.21)

Image : k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0

            "./cluster-autoscaler",
            "--v=2",
            "--stderrthreshold=2",
            "--cloud-provider=aws",
            "--scan-interval=10s",
            "--skip-nodes-with-local-storage=false",
            "--aws-use-static-instance-list=true",
            "--expander=least-waste",
            "--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/${var.cluster_name}"

Thank you
Balaji

@jim-barber-he
Copy link

Hi.

If I'm right then I believe setting the new --daemonset-eviction-for-occupied-nodes=false parameter that was introduced in version 1.22 of the cluster-autoscaler might handle the scenario where the daemonsets aren't stopped at all and will not be gracefully stopped but killed when the node terminates.

I'm adding to this case because it is similar enough to the request I was about to make.
I would like a way to have the daemonsets stop gracefully, but only after all the "normal" pods have completed.
As far as I can tell there is no way to do that?

We have application deployments with a preStop lifecycle to make sure they complete their work before they shutdown.
We also have supporting daemonsets such as the node-local-dns, kiam, and fluent-bit running to provide DNS; AWS IAM access; and logging services for the application pods.
However when cluster autoscaler chooses a node to scale in, the daemonsets are terminated before the application pods have finished running, resulting in various errors (such as not being able to resolve hostnames for example).

In order to prove that the daemonsets were being terminated too early I set up a test.
I created a dedicated instance group, and then the following:

  • A daemonset called testing-daemonset
  • A deployment called testing-balloon configured like so:
    • 2 replicas
    • pod affinity so they are required to run together
    • annotation to tell cluster-autoscaler it is not allowed to evict them
    • resources tuned so that the 2x pods take up most of the memory on a node.
  • A deployment called testing-app configured like so:
    • resources tuned so that it needs enough memory that it won't fit on the node with the 2x testing-balloon pods.
    • a prestop lifecycle to sleep for 5 mins.

After deploying the above it looks like so:

$ kubectl get pods -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-xfdl8       1/1     Running   0          14s   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-w6sb8   1/1     Running   0          30s   10.194.39.197   ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running   0          30s   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-czfbr            1/1     Running   0          17m   10.194.49.193   ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running   0          97s   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

I then edit the testing-balloon deployment so that there is only 1 replica.

NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-xfdl8       1/1     Running   0          26m   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running   0          26m   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-czfbr            1/1     Running   0          43m   10.194.49.193   ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running   0          27m   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

I then waited for cluster autoscaler to start a scale in and caught it at this point:

NAME                               READY   STATUS        RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-dbmst       1/1     Running       0          30s   10.194.39.197   ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-app-75546f8c56-xfdl8       1/1     Terminating   0          28m   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running       0          28m   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-czfbr            1/1     Terminating   0          45m   10.194.49.193   ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running       0          29m   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

Here you can see a new testing-app has started on the same node where the testing-balloon pod is running.
And the old testing-app pod is Terminating, but the testing-daemonset pod is also Terminating as well.
So at this point cluster-autoscaler has evicted both the "normal" pods and the daemonsets.

Then a bit later I see this:

NAME                               READY   STATUS        RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-dbmst       1/1     Running       0          39s   10.194.39.197   ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-app-75546f8c56-xfdl8       1/1     Terminating   0          28m   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running       0          28m   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running       0          30m   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

The old testing-daemonset is gone, but the old testing-app pod is still there running its preStop.
At this point if it was an application pod that relied on those daemonsets, it would be broken and not able to perform its shutdown tasks properly causing problems.

The above was tested with cluster autoscaler version 1.21 using the following command in the deployment:

      - command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=cluster-autoscaler
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/prod3.he0.io
        - --balance-similar-node-groups=true
        - --expander=least-waste
        - --logtostderr=true
        - --max-graceful-termination-sec=6000
        - --scale-down-delay-after-delete=10m
        - --skip-nodes-with-local-storage=false
        - --skip-nodes-with-system-pods=false
        - --stderrthreshold=info
        - --v=4

Is it possible to provide a way for the daemonset evictions to wait until all other pods are gone or in the Completed state?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 29, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@RicardsRikmanis
Copy link

Encountered the same issue.

We have pods that have preStop hooks with sleep commands. In our case, we have statefulsets that that depend on aws-ebs-csi daemon set pods to de-attach and un-mount the volumes.

When CA scales down nodes all the pods are evicted including the ebs-csi-node pods, while our statefulsets are stuck in terminating state since they cant un-mount the attached volumes without ebs-csi-node pod.

From the previous comment I see mention of the --daemonset-eviction-for-occupied-nodes=false. We will try it, but as the comment said, graceful shutdown of daemonsets would be preferable instead of killing them.

If anyone has solved this issue, feel free to comment here, I would greatly appreciate it.

@x13n
Copy link
Member

x13n commented May 7, 2024

This can now be solved using --drain-priority-config flag to evict lower priority pods first (assuming DS are higher priority which generally is a reasonable setup).

@jim-barber-he
Copy link

This can now be solved using --drain-priority-config flag to evict lower priority pods first (assuming DS are higher priority which generally is a reasonable setup).

Thank you for pointing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants