Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle pod eviction errors correctly #5116

Merged
merged 10 commits into from
Apr 21, 2022

Conversation

hintofbasil
Copy link
Contributor

Description

Currently any eviction error causes the draining of a node to stop and a new
node to start draining. Eviction errors are common, expected occurrences
especially when PDBs are used in the cluster.
By having any error abort the draining of a node we slow down the entire node
draining process as many of the pods further in the list could happily be
drained.

This change separates recoverable and irrecoverable eviction errors and retries
only the recoverable. Unrecoverable errors fail the entire command.

An important aspect of this is that the evictPods function becomes blocking
until a node is drained or the process times out. This is required as the
current implementation begins draining another node on the first eviction
error. We would rather keep trying and eventually time out than make a bad
situation worse by draining a new node.

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

Currently any eviction error causes the draining of a node to stop and a new
node to start draining.  Eviction errors are common, expected occurences
especially when PDBs are used in the cluster.
By having any error abort the draining of a node we slow down the entire node
draining process as many of the pods further in the list could happily be
drained.

This change separates recoverable and irrecoverable eviction errors and retries
only the recoverable.  Unrecoverable errors fail the entire command.

An important aspect of this is that the `evictPods` function becomes blocking
until a node is drained or the process times out.  This is required as the
current implementation begins draining another node on the first eviction
error.  We would rather keep trying and eventually time out than make a bad
situation worse by draining a new node.
@hintofbasil hintofbasil force-pushed the handle-pod-eviction-errors branch from adb467f to 20ac5db Compare April 12, 2022 18:43
@@ -204,7 +203,7 @@ func (d *Evictor) daemonSetFilter(pod corev1.Pod) PodDeleteStatus {
if controllerRef.Name == ignoreDaemonSet.Name {
switch ignoreDaemonSet.Namespace {
case pod.Namespace, metav1.NamespaceAll:
return makePodDeleteStatusWithWarning(false, daemonSetWarning)
return makePodDeleteStatusSkip()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is super noisy and adds next to zero value.

There may be a better way to disable it / filter it out but this was the best I found

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that error came from some copied over eviction code from 3 years ago. I don't think we need it... :D

@@ -27,6 +29,13 @@ import (
// retryDelay is how long is slept before retry after an error occurs during drainage
const retryDelay = 5 * time.Second

var recoverablePodEvictionErrors = [...]string{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list may not be exhaustive. These are the ones I found through manual testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder if we could check more reliably than this, using some Kubernetes const for the error message. I'm a bit reluctant to rely on error messages that are subject to change, trimming, extra spaces, etc...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	"TooManyRequests",
	"NotFound",

These two are definitely meta error StatusReasonNotFound StatusReason = "NotFound" and StatusReasonTooManyRequests StatusReason = "TooManyRequests".

The other two appear to be some aws specific things. How did you encounter them? Can you print the whole error so we can see what interface it implements at that point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to use the checker functions built into apimachinery.

I've re-ran a manual test and it works perfectly.

@hintofbasil
Copy link
Contributor Author

Logs from the manual testing

2022-04-12 19:16:54 [ℹ]  eksctl version 0.93.0-dev+920c2f91.2022-04-12T18:32:04Z
2022-04-12 19:16:54 [ℹ]  using region eu-west-1
2022-04-12 19:16:58 [ℹ]  comparing 2 nodegroups defined in the given config ("/tmp/cellsdev-1-eu-west-1a-1.yaml") against remote state
2022-04-12 19:17:04 [ℹ]  6 nodegroup(s) present in the config file (mixed-instances-12xl-a0adaeb2,mixed-instances-4xl-002a6f2a,mixed-instances-8xl-722a76be,nodes-default-gateway-ingress-5a731ab7,nodes-ingress-3aa9f35a,nodes-public-ingress-24e77938) will be excluded
2022-04-12 19:17:04 [ℹ]  1 nodegroup (mixed-instances-4xl-002a6f2a-test) was included (based on the include/exclude rules)
2022-04-12 19:17:04 [ℹ]  will drain 1 nodegroup(s) in cluster "cellsdev-1-eu-west-1a-1"
2022-04-12 19:17:07 [ℹ]  starting parallel draining, max in-flight of 5
2022-04-12 19:17:08 [ℹ]  cordon node "ip-172-16-1-163.eu-west-1.compute.internal"
2022-04-12 19:17:09 [ℹ]  cordon node "ip-172-16-10-58.eu-west-1.compute.internal"
2022-04-12 19:17:09 [ℹ]  cordon node "ip-172-16-15-44.eu-west-1.compute.internal"
2022-04-12 19:17:09 [ℹ]  cordon node "ip-172-16-21-109.eu-west-1.compute.internal"
2022-04-12 19:17:10 [ℹ]  cordon node "ip-172-16-29-0.eu-west-1.compute.internal"
2022-04-12 19:17:10 [ℹ]  cordon node "ip-172-16-29-49.eu-west-1.compute.internal"
2022-04-12 19:17:10 [ℹ]  cordon node "ip-172-16-30-253.eu-west-1.compute.internal"
2022-04-12 19:17:10 [ℹ]  cordon node "ip-172-16-4-65.eu-west-1.compute.internal"
2022-04-12 19:18:11 [!]  2 pods are unevictable from node ip-172-16-4-65.eu-west-1.compute.internal
2022-04-12 19:19:10 [!]  1 pods are unevictable from node ip-172-16-29-49.eu-west-1.compute.internal
2022-04-12 19:20:16 [!]  1 pods are unevictable from node ip-172-16-29-49.eu-west-1.compute.internal
2022-04-12 19:20:26 [✔]  drained all nodes: [ip-172-16-15-44.eu-west-1.compute.internal ip-172-16-21-109.eu-west-1.compute.internal ip-172-16-30-253.eu-west-1.compute.internal ip-172-16-10-58.eu-west-1.compute.internal ip-172-16-29-0.eu-west-1.compute.internal ip-172-16-29-49.eu-west-1.compute.internal ip-172-16-1-163.eu-west-1.compute.internal ip-172-16-4-65.eu-west-1.compute.internal]
2022-04-12 19:20:26 [ℹ]  will delete 1 nodegroups from cluster "cellsdev-1-eu-west-1a-1"
2022-04-12 19:20:28 [ℹ]  1 task: { 1 task: { delete nodegroup "mixed-instances-4xl-002a6f2a-test" [async] } }
2022-04-12 19:20:28 [ℹ]  will delete stack "eksctl-cellsdev-1-eu-west-1a-1-nodegroup-mixed-instances-4xl-002a6f2a-test"
2022-04-12 19:20:28 [✔]  deleted 1 nodegroup(s) from cluster "cellsdev-1-eu-west-1a-1"

@Skarlso
Copy link
Contributor

Skarlso commented Apr 13, 2022

I actually noticed this as well! Thank you for doing this. I will check the code out. :)

@Skarlso Skarlso self-assigned this Apr 13, 2022
@Skarlso Skarlso requested a review from a team April 13, 2022 05:56
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
@hintofbasil
Copy link
Contributor Author

I can see that all the CI checks have passed. Let me know if you think this PR is ready and I can rebase it into a single commit.

@Skarlso
Copy link
Contributor

Skarlso commented Apr 17, 2022

Hi @hintofbasil. Will do. Please bear with us for a while as the whole team is off for Easter. :)

@Skarlso Skarlso requested a review from a team April 17, 2022 19:12
Copy link
Collaborator

@Himangini Himangini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏻 nicely done

@Himangini Himangini requested a review from a team April 21, 2022 09:23
@Skarlso Skarlso enabled auto-merge (squash) April 21, 2022 14:16
@Skarlso
Copy link
Contributor

Skarlso commented Apr 21, 2022

@hintofbasil Thank you for your contribution! :)

@Skarlso Skarlso merged commit 7c7c092 into eksctl-io:main Apr 21, 2022
@hintofbasil
Copy link
Contributor Author

Thank you for the reviews!

Hopefully there will be more contributions in the future.

@hspencer77 hspencer77 mentioned this pull request Jul 8, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants