Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect polling logic in handleRebootUncordon for Scheduled Event Draining #1059

Merged
merged 1 commit into from
Sep 12, 2024

Conversation

xabinapal
Copy link
Contributor

Description of changes:

I identified an issue with the polling logic in the handleRebootUncordon function within the IMDS mode of NTH. Currently, the polling continues regardless of whether the uncordon checks and requests succeed, because the wrapped function always returns false. This behavior results in repeated retries until the context timeout is reached, ultimately causing a misleading "context deadline exceeded" error.

Here is the error message that is always displayed:

2024/08/29 13:42:42 WRN All retries failed, unable to complete the uncordon after reboot workflow error="context deadline exceeded"

I’ve modified the wrapper function to return true when handleRebootUncordon completes successfully without errors. This ensures that polling stops once all calls to the Kubernetes API have been successful, preventing unnecessary retries and providing clearer logs. This should help avoiding confusion.

For reference, this code utilizes the PollUntilContextCancel function from the k8s.io/apimachinery package. According to the documentation:

PollUntilContextCancel tries a condition func until it returns true, an error, or the context is cancelled or hits a deadline. condition will be invoked after the first interval if the context is not cancelled first. The returned error will be from ctx.Err(), the condition's err return value, or nil. If invoking condition takes longer than interval the next condition will be invoked immediately. When using very short intervals, condition may be invoked multiple times before a context cancellation is detected. If immediate is true, condition will be invoked before waiting and guarantees that condition is invoked at least once, regardless of whether the context has been cancelled.

You can find the full documentation here.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@xabinapal xabinapal requested a review from a team as a code owner August 29, 2024 21:06
@LikithaVemulapalli LikithaVemulapalli merged commit 498fc02 into aws:main Sep 12, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants