-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle pod eviction errors correctly #5116
Handle pod eviction errors correctly #5116
Conversation
Currently any eviction error causes the draining of a node to stop and a new node to start draining. Eviction errors are common, expected occurences especially when PDBs are used in the cluster. By having any error abort the draining of a node we slow down the entire node draining process as many of the pods further in the list could happily be drained. This change separates recoverable and irrecoverable eviction errors and retries only the recoverable. Unrecoverable errors fail the entire command. An important aspect of this is that the `evictPods` function becomes blocking until a node is drained or the process times out. This is required as the current implementation begins draining another node on the first eviction error. We would rather keep trying and eventually time out than make a bad situation worse by draining a new node.
adb467f
to
20ac5db
Compare
@@ -204,7 +203,7 @@ func (d *Evictor) daemonSetFilter(pod corev1.Pod) PodDeleteStatus { | |||
if controllerRef.Name == ignoreDaemonSet.Name { | |||
switch ignoreDaemonSet.Namespace { | |||
case pod.Namespace, metav1.NamespaceAll: | |||
return makePodDeleteStatusWithWarning(false, daemonSetWarning) | |||
return makePodDeleteStatusSkip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is super noisy and adds next to zero value.
There may be a better way to disable it / filter it out but this was the best I found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that error came from some copied over eviction code from 3 years ago. I don't think we need it... :D
pkg/drain/nodegroup.go
Outdated
@@ -27,6 +29,13 @@ import ( | |||
// retryDelay is how long is slept before retry after an error occurs during drainage | |||
const retryDelay = 5 * time.Second | |||
|
|||
var recoverablePodEvictionErrors = [...]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list may not be exhaustive. These are the ones I found through manual testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I wonder if we could check more reliably than this, using some Kubernetes const
for the error message. I'm a bit reluctant to rely on error messages that are subject to change, trimming, extra spaces, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"TooManyRequests",
"NotFound",
These two are definitely meta error StatusReasonNotFound StatusReason = "NotFound"
and StatusReasonTooManyRequests StatusReason = "TooManyRequests"
.
The other two appear to be some aws specific things. How did you encounter them? Can you print the whole error so we can see what interface it implements at that point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to use the checker functions built into apimachinery.
I've re-ran a manual test and it works perfectly.
Logs from the manual testing
|
I actually noticed this as well! Thank you for doing this. I will check the code out. :) |
Co-authored-by: Gergely Brautigam <[email protected]>
Co-authored-by: Gergely Brautigam <[email protected]>
Co-authored-by: Gergely Brautigam <[email protected]>
Co-authored-by: Gergely Brautigam <[email protected]>
Co-authored-by: Gergely Brautigam <[email protected]>
Co-authored-by: Gergely Brautigam <[email protected]>
I can see that all the CI checks have passed. Let me know if you think this PR is ready and I can rebase it into a single commit. |
Hi @hintofbasil. Will do. Please bear with us for a while as the whole team is off for Easter. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍🏻 nicely done
@hintofbasil Thank you for your contribution! :) |
Thank you for the reviews! Hopefully there will be more contributions in the future. |
Description
Currently any eviction error causes the draining of a node to stop and a new
node to start draining. Eviction errors are common, expected occurrences
especially when PDBs are used in the cluster.
By having any error abort the draining of a node we slow down the entire node
draining process as many of the pods further in the list could happily be
drained.
This change separates recoverable and irrecoverable eviction errors and retries
only the recoverable. Unrecoverable errors fail the entire command.
An important aspect of this is that the
evictPods
function becomes blockinguntil a node is drained or the process times out. This is required as the
current implementation begins draining another node on the first eviction
error. We would rather keep trying and eventually time out than make a bad
situation worse by draining a new node.
Checklist
README.md
, or theuserdocs
directory)area/nodegroup
) and kind (e.g.kind/improvement
)BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯