Daemonset Eviction during Scale down #4337

rangarb885 · 2021-09-16T19:36:51Z

Hello :
I want to understand how to handle the below scenario with CA.

When EKS CA decides to scale down a node (which is a part of managed node-group) which has a daemonset like fluent-bit (shipping logs from apps) & SignalFx (tracing & metrics), what configuration i need to have on CA to make sure that daemonset are not evicted as app may be using this during this scaling down time (under their graceful timeout window).?
Is there a config on CA setup to skip the daemonset eviction and allow them to run until the node is terminated.
I am good even if these daemonset are not gracefully stopped as app using them are gracefully stopping (with their own graceful shutdown timeouts)

My current CA configuration (EKS1.21)

Image : k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0

            "./cluster-autoscaler",
            "--v=2",
            "--stderrthreshold=2",
            "--cloud-provider=aws",
            "--scan-interval=10s",
            "--skip-nodes-with-local-storage=false",
            "--aws-use-static-instance-list=true",
            "--expander=least-waste",
            "--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/${var.cluster_name}"

Thank you
Balaji

The text was updated successfully, but these errors were encountered:

jim-barber-he · 2021-10-01T06:30:44Z

Hi.

If I'm right then I believe setting the new --daemonset-eviction-for-occupied-nodes=false parameter that was introduced in version 1.22 of the cluster-autoscaler might handle the scenario where the daemonsets aren't stopped at all and will not be gracefully stopped but killed when the node terminates.

I'm adding to this case because it is similar enough to the request I was about to make.
I would like a way to have the daemonsets stop gracefully, but only after all the "normal" pods have completed.
As far as I can tell there is no way to do that?

We have application deployments with a preStop lifecycle to make sure they complete their work before they shutdown.
We also have supporting daemonsets such as the node-local-dns, kiam, and fluent-bit running to provide DNS; AWS IAM access; and logging services for the application pods.
However when cluster autoscaler chooses a node to scale in, the daemonsets are terminated before the application pods have finished running, resulting in various errors (such as not being able to resolve hostnames for example).

In order to prove that the daemonsets were being terminated too early I set up a test.
I created a dedicated instance group, and then the following:

A daemonset called testing-daemonset
A deployment called testing-balloon configured like so:
- 2 replicas
- pod affinity so they are required to run together
- annotation to tell cluster-autoscaler it is not allowed to evict them
- resources tuned so that the 2x pods take up most of the memory on a node.
A deployment called testing-app configured like so:
- resources tuned so that it needs enough memory that it won't fit on the node with the 2x testing-balloon pods.
- a prestop lifecycle to sleep for 5 mins.

After deploying the above it looks like so:

$ kubectl get pods -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-xfdl8       1/1     Running   0          14s   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-w6sb8   1/1     Running   0          30s   10.194.39.197   ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running   0          30s   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-czfbr            1/1     Running   0          17m   10.194.49.193   ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running   0          97s   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

I then edit the testing-balloon deployment so that there is only 1 replica.

NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-xfdl8       1/1     Running   0          26m   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running   0          26m   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-czfbr            1/1     Running   0          43m   10.194.49.193   ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running   0          27m   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

I then waited for cluster autoscaler to start a scale in and caught it at this point:

NAME                               READY   STATUS        RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-dbmst       1/1     Running       0          30s   10.194.39.197   ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-app-75546f8c56-xfdl8       1/1     Terminating   0          28m   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running       0          28m   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-czfbr            1/1     Terminating   0          45m   10.194.49.193   ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running       0          29m   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

Here you can see a new testing-app has started on the same node where the testing-balloon pod is running.
And the old testing-app pod is Terminating, but the testing-daemonset pod is also Terminating as well.
So at this point cluster-autoscaler has evicted both the "normal" pods and the daemonsets.

Then a bit later I see this:

NAME                               READY   STATUS        RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-dbmst       1/1     Running       0          39s   10.194.39.197   ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-app-75546f8c56-xfdl8       1/1     Terminating   0          28m   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running       0          28m   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running       0          30m   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

The old testing-daemonset is gone, but the old testing-app pod is still there running its preStop.
At this point if it was an application pod that relied on those daemonsets, it would be broken and not able to perform its shutdown tasks properly causing problems.

The above was tested with cluster autoscaler version 1.21 using the following command in the deployment:

      - command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=cluster-autoscaler
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/prod3.he0.io
        - --balance-similar-node-groups=true
        - --expander=least-waste
        - --logtostderr=true
        - --max-graceful-termination-sec=6000
        - --scale-down-delay-after-delete=10m
        - --skip-nodes-with-local-storage=false
        - --skip-nodes-with-system-pods=false
        - --stderrthreshold=info
        - --v=4

Is it possible to provide a way for the daemonset evictions to wait until all other pods are gone or in the Completed state?

k8s-triage-robot · 2021-12-30T06:43:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-01-29T07:09:59Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-02-28T08:05:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-02-28T08:06:14Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RicardsRikmanis · 2022-03-25T13:31:30Z

Encountered the same issue.

We have pods that have preStop hooks with sleep commands. In our case, we have statefulsets that that depend on aws-ebs-csi daemon set pods to de-attach and un-mount the volumes.

When CA scales down nodes all the pods are evicted including the ebs-csi-node pods, while our statefulsets are stuck in terminating state since they cant un-mount the attached volumes without ebs-csi-node pod.

From the previous comment I see mention of the --daemonset-eviction-for-occupied-nodes=false. We will try it, but as the comment said, graceful shutdown of daemonsets would be preferable instead of killing them.

If anyone has solved this issue, feel free to comment here, I would greatly appreciate it.

x13n · 2024-05-07T19:21:56Z

This can now be solved using --drain-priority-config flag to evict lower priority pods first (assuming DS are higher priority which generally is a reasonable setup).

jim-barber-he · 2024-05-08T01:27:51Z

This can now be solved using --drain-priority-config flag to evict lower priority pods first (assuming DS are higher priority which generally is a reasonable setup).

Thank you for pointing this out.

jbartosik added the area/cluster-autoscaler label Sep 17, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 29, 2022

k8s-ci-robot closed this as completed Feb 28, 2022

jmenan mentioned this issue Apr 27, 2022

backport "daemonset-eviction-for-empty-nodes" and "daemonset-eviction-for-occupied-nodes" from 1.22 to 1.21 #4830

Closed

jdn5126 mentioned this issue Oct 7, 2022

Behavior change in daemonset pod eviction #5240

Closed

jfcoz mentioned this issue Mar 28, 2024

fix daemonset eviction #6666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daemonset Eviction during Scale down #4337

Daemonset Eviction during Scale down #4337

rangarb885 commented Sep 16, 2021

jim-barber-he commented Oct 1, 2021

k8s-triage-robot commented Dec 30, 2021

k8s-triage-robot commented Jan 29, 2022

k8s-triage-robot commented Feb 28, 2022

k8s-ci-robot commented Feb 28, 2022

RicardsRikmanis commented Mar 25, 2022

x13n commented May 7, 2024

jim-barber-he commented May 8, 2024

Daemonset Eviction during Scale down #4337

Daemonset Eviction during Scale down #4337

Comments

rangarb885 commented Sep 16, 2021

jim-barber-he commented Oct 1, 2021

k8s-triage-robot commented Dec 30, 2021

k8s-triage-robot commented Jan 29, 2022

k8s-triage-robot commented Feb 28, 2022

k8s-ci-robot commented Feb 28, 2022

RicardsRikmanis commented Mar 25, 2022

x13n commented May 7, 2024

jim-barber-he commented May 8, 2024