Incorrect polling logic in handleRebootUncordon for Scheduled Event Draining #1059

xabinapal · 2024-08-29T21:06:24Z

Description of changes:

I identified an issue with the polling logic in the handleRebootUncordon function within the IMDS mode of NTH. Currently, the polling continues regardless of whether the uncordon checks and requests succeed, because the wrapped function always returns false. This behavior results in repeated retries until the context timeout is reached, ultimately causing a misleading "context deadline exceeded" error.

Here is the error message that is always displayed:

2024/08/29 13:42:42 WRN All retries failed, unable to complete the uncordon after reboot workflow error="context deadline exceeded"

I’ve modified the wrapper function to return true when handleRebootUncordon completes successfully without errors. This ensures that polling stops once all calls to the Kubernetes API have been successful, preventing unnecessary retries and providing clearer logs. This should help avoiding confusion.

For reference, this code utilizes the PollUntilContextCancel function from the k8s.io/apimachinery package. According to the documentation:

PollUntilContextCancel tries a condition func until it returns true, an error, or the context is cancelled or hits a deadline. condition will be invoked after the first interval if the context is not cancelled first. The returned error will be from ctx.Err(), the condition's err return value, or nil. If invoking condition takes longer than interval the next condition will be invoked immediately. When using very short intervals, condition may be invoked multiple times before a context cancellation is detected. If immediate is true, condition will be invoked before waiting and guarantees that condition is invoked at least once, regardless of whether the context has been cancelled.

You can find the full documentation here.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Signed-off-by: Xabier Napal <[email protected]>

cmd/node-termination-handler.go

xabinapal requested a review from a team as a code owner August 29, 2024 21:06

don't retry handling reboot uncordon on successful execution

4633de9

Signed-off-by: Xabier Napal <[email protected]>

xabinapal force-pushed the retry-logic branch from cd1465c to 4633de9 Compare August 29, 2024 21:11

LikithaVemulapalli reviewed Sep 5, 2024

View reviewed changes

cmd/node-termination-handler.go Show resolved Hide resolved

LikithaVemulapalli approved these changes Sep 12, 2024

View reviewed changes

LikithaVemulapalli merged commit 498fc02 into aws:main Sep 12, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect polling logic in handleRebootUncordon for Scheduled Event Draining #1059

Incorrect polling logic in handleRebootUncordon for Scheduled Event Draining #1059

xabinapal commented Aug 29, 2024

Incorrect polling logic in handleRebootUncordon for Scheduled Event Draining #1059

Incorrect polling logic in handleRebootUncordon for Scheduled Event Draining #1059

Conversation

xabinapal commented Aug 29, 2024