Handle pod eviction errors correctly #5116

hintofbasil · 2022-04-12T18:42:42Z

Description

Currently any eviction error causes the draining of a node to stop and a new
node to start draining. Eviction errors are common, expected occurrences
especially when PDBs are used in the cluster.
By having any error abort the draining of a node we slow down the entire node
draining process as many of the pods further in the list could happily be
drained.

This change separates recoverable and irrecoverable eviction errors and retries
only the recoverable. Unrecoverable errors fail the entire command.

An important aspect of this is that the evictPods function becomes blocking
until a node is drained or the process times out. This is required as the
current implementation begins draining another node on the first eviction
error. We would rather keep trying and eventually time out than make a bad
situation worse by draining a new node.

Checklist

Added tests that cover your change (if possible)
Added/modified documentation as required (such as the README.md, or the userdocs directory)
Manually tested
Made sure the title of the PR is a good description that can go into the release notes
(Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

Backfilled missing tests for code in same general area 🎉
Refactored something and made the world a better place 🌟

Currently any eviction error causes the draining of a node to stop and a new node to start draining. Eviction errors are common, expected occurences especially when PDBs are used in the cluster. By having any error abort the draining of a node we slow down the entire node draining process as many of the pods further in the list could happily be drained. This change separates recoverable and irrecoverable eviction errors and retries only the recoverable. Unrecoverable errors fail the entire command. An important aspect of this is that the `evictPods` function becomes blocking until a node is drained or the process times out. This is required as the current implementation begins draining another node on the first eviction error. We would rather keep trying and eventually time out than make a bad situation worse by draining a new node.

hintofbasil · 2022-04-12T18:45:57Z

pkg/drain/evictor/pod.go

@@ -204,7 +203,7 @@ func (d *Evictor) daemonSetFilter(pod corev1.Pod) PodDeleteStatus {
 		if controllerRef.Name == ignoreDaemonSet.Name {
 			switch ignoreDaemonSet.Namespace {
 			case pod.Namespace, metav1.NamespaceAll:
-				return makePodDeleteStatusWithWarning(false, daemonSetWarning)
+				return makePodDeleteStatusSkip()


This error is super noisy and adds next to zero value.

There may be a better way to disable it / filter it out but this was the best I found

Yeah, that error came from some copied over eviction code from 3 years ago. I don't think we need it... :D

hintofbasil · 2022-04-12T18:46:23Z

pkg/drain/nodegroup.go

@@ -27,6 +29,13 @@ import (
 // retryDelay is how long is slept before retry after an error occurs during drainage
 const retryDelay = 5 * time.Second

+var recoverablePodEvictionErrors = [...]string{


This list may not be exhaustive. These are the ones I found through manual testing.

Hmm, I wonder if we could check more reliably than this, using some Kubernetes const for the error message. I'm a bit reluctant to rely on error messages that are subject to change, trimming, extra spaces, etc...

"TooManyRequests", "NotFound",

These two are definitely meta error StatusReasonNotFound StatusReason = "NotFound" and StatusReasonTooManyRequests StatusReason = "TooManyRequests".

The other two appear to be some aws specific things. How did you encounter them? Can you print the whole error so we can see what interface it implements at that point?

I've updated this to use the checker functions built into apimachinery.

I've re-ran a manual test and it works perfectly.

hintofbasil · 2022-04-12T18:47:12Z

Logs from the manual testing

2022-04-12 19:16:54 [ℹ]  eksctl version 0.93.0-dev+920c2f91.2022-04-12T18:32:04Z
2022-04-12 19:16:54 [ℹ]  using region eu-west-1
2022-04-12 19:16:58 [ℹ]  comparing 2 nodegroups defined in the given config ("/tmp/cellsdev-1-eu-west-1a-1.yaml") against remote state
2022-04-12 19:17:04 [ℹ]  6 nodegroup(s) present in the config file (mixed-instances-12xl-a0adaeb2,mixed-instances-4xl-002a6f2a,mixed-instances-8xl-722a76be,nodes-default-gateway-ingress-5a731ab7,nodes-ingress-3aa9f35a,nodes-public-ingress-24e77938) will be excluded
2022-04-12 19:17:04 [ℹ]  1 nodegroup (mixed-instances-4xl-002a6f2a-test) was included (based on the include/exclude rules)
2022-04-12 19:17:04 [ℹ]  will drain 1 nodegroup(s) in cluster "cellsdev-1-eu-west-1a-1"
2022-04-12 19:17:07 [ℹ]  starting parallel draining, max in-flight of 5
2022-04-12 19:17:08 [ℹ]  cordon node "ip-172-16-1-163.eu-west-1.compute.internal"
2022-04-12 19:17:09 [ℹ]  cordon node "ip-172-16-10-58.eu-west-1.compute.internal"
2022-04-12 19:17:09 [ℹ]  cordon node "ip-172-16-15-44.eu-west-1.compute.internal"
2022-04-12 19:17:09 [ℹ]  cordon node "ip-172-16-21-109.eu-west-1.compute.internal"
2022-04-12 19:17:10 [ℹ]  cordon node "ip-172-16-29-0.eu-west-1.compute.internal"
2022-04-12 19:17:10 [ℹ]  cordon node "ip-172-16-29-49.eu-west-1.compute.internal"
2022-04-12 19:17:10 [ℹ]  cordon node "ip-172-16-30-253.eu-west-1.compute.internal"
2022-04-12 19:17:10 [ℹ]  cordon node "ip-172-16-4-65.eu-west-1.compute.internal"
2022-04-12 19:18:11 [!]  2 pods are unevictable from node ip-172-16-4-65.eu-west-1.compute.internal
2022-04-12 19:19:10 [!]  1 pods are unevictable from node ip-172-16-29-49.eu-west-1.compute.internal
2022-04-12 19:20:16 [!]  1 pods are unevictable from node ip-172-16-29-49.eu-west-1.compute.internal
2022-04-12 19:20:26 [✔]  drained all nodes: [ip-172-16-15-44.eu-west-1.compute.internal ip-172-16-21-109.eu-west-1.compute.internal ip-172-16-30-253.eu-west-1.compute.internal ip-172-16-10-58.eu-west-1.compute.internal ip-172-16-29-0.eu-west-1.compute.internal ip-172-16-29-49.eu-west-1.compute.internal ip-172-16-1-163.eu-west-1.compute.internal ip-172-16-4-65.eu-west-1.compute.internal]
2022-04-12 19:20:26 [ℹ]  will delete 1 nodegroups from cluster "cellsdev-1-eu-west-1a-1"
2022-04-12 19:20:28 [ℹ]  1 task: { 1 task: { delete nodegroup "mixed-instances-4xl-002a6f2a-test" [async] } }
2022-04-12 19:20:28 [ℹ]  will delete stack "eksctl-cellsdev-1-eu-west-1a-1-nodegroup-mixed-instances-4xl-002a6f2a-test"
2022-04-12 19:20:28 [✔]  deleted 1 nodegroup(s) from cluster "cellsdev-1-eu-west-1a-1"

pkg/drain/nodegroup.go

Skarlso · 2022-04-13T05:56:07Z

I actually noticed this as well! Thank you for doing this. I will check the code out. :)

pkg/drain/nodegroup.go

Co-authored-by: Gergely Brautigam <[email protected]>

hintofbasil · 2022-04-17T08:48:50Z

I can see that all the CI checks have passed. Let me know if you think this PR is ready and I can rebase it into a single commit.

Skarlso · 2022-04-17T19:12:47Z

Hi @hintofbasil. Will do. Please bear with us for a while as the whole team is off for Easter. :)

Himangini

LGTM 👍🏻 nicely done

pkg/actions/cluster/delete_test.go

Skarlso · 2022-04-21T14:16:47Z

@hintofbasil Thank you for your contribution! :)

hintofbasil · 2022-04-21T15:59:43Z

Thank you for the reviews!

Hopefully there will be more contributions in the future.

hintofbasil force-pushed the handle-pod-eviction-errors branch from adb467f to 20ac5db Compare April 12, 2022 18:43

hintofbasil commented Apr 12, 2022

View reviewed changes

pkg/drain/nodegroup.go Show resolved Hide resolved

Skarlso self-assigned this Apr 13, 2022

Skarlso requested a review from a team April 13, 2022 05:56

Skarlso added the kind/improvement label Apr 13, 2022