-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix buggy retry logic in syncIngress() #7086
Fix buggy retry logic in syncIngress() #7086
Conversation
This code looks like it was intended to do retries with exponential backoff, but due to a bug it seems like it did no retries at all. Per the documentation for wait.ExponentialBackoff, this function "stops and returns as soon as [...] the condition check returns true or an error". The previous implementation of the condition check always returned either true (success) or an error (failure) -- as such, retries would never actually trigger. The new implementation instead checks whether there are retries left (according to the intended policy of 15 retries). If there are retries left, it logs the error but instead returns "false, nil", which should trigger a retry from wait.ExponentialBackoff. We've had issues with this "retry" loop on a real deployment on multiple occasions; the nginx-controller failing to get healthy in time and getting stuck in a crashloop for minutes or hours with an errors akin to this: Unexpected failure reconfiguring NGINX: "requeuing" err="Post \"http://127.0.0.1:10246/configuration/backends\": dial tcp 127.0.0.1:10246: connect: connection refused" key="initial-sync" The hope is that this bugfix will resolve this existing crash-on-startup bug.
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Welcome @steinarvk-kolonial! |
Hi @steinarvk-kolonial. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
uhm @steinarvk-kolonial did you notice that, in addition to never retrying, the backoff factor is smaller than 1 🙈 I don't think it's supposed to be like that. It basically tries 15 times in 6 seconds and it speeds up near the end. For this backoff behavior to actually kick in in any sensible way you probably want to put that on 1.1 or something. As another suggestion.. Maybe set a cap and drop the jitter? The jitter seems kinda useless in this case and a cap of 1 or 2 minutes sounds very sensible? https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff |
Factor 0.8 means that the retry delays get shorter and shorter with each attempt, which is almost certainly not intentional. A more sensible value is 1.2.
1.2 was somewhat arbitrarily chosen. In my stress-testing I observed this taking 67 seconds, so I suggest putting the factor at a point that at least comfortable allows that. Really, though -- perhaps it'd be better if this were somehow configurable.
Since the number of retries appropriate could depend on the configuration size, might as well make this a flag.
/ok-to-test |
Codecov Report
@@ Coverage Diff @@
## master #7086 +/- ##
==========================================
+ Coverage 55.82% 55.98% +0.15%
==========================================
Files 94 94
Lines 6588 6593 +5
==========================================
+ Hits 3678 3691 +13
+ Misses 2449 2441 -8
Partials 461 461
Continue to review full report at Codecov.
|
Gentle ping, would be great to get some feedback on this. Are the automatically assigned reviewers the appropriate ones (and not overloaded)? |
I'll take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, nice catch.
Please just take a look into the suggestion of changing the logic, but it's mostly a nit.
Also, I'm thinking here is the wait.Backoff + the declaration of steps already doesn't do that, so instead of implementing the internal counter we shouldn't just be capturing the error and returning it later
Factor: 0.8, | ||
Steps: 1 + n.cfg.DynamicConfigurationRetries, | ||
Duration: time.Second, | ||
Factor: 1.3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change the Factor here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A factor less than 1 means the delays get shorter and shorter: e.g. first 1 second, then 800ms, then 640ms, and so on.
That's not usually what people mean by "exponential backoff", and it doesn't really achieve the same thing of adaptively polling for a process that may take a long time.
Since this was effectively dead code previously, and by the logic above the previous value doesn't seem to have been tested, I set a new value based based on how long it took the controller to come up while stress-testing it.
retriesRemaining := retry.Steps | ||
err := wait.ExponentialBackoff(retry, func() (bool, error) { | ||
err := n.configureDynamically(pcfg) | ||
if err == nil { | ||
klog.V(2).Infof("Dynamic reconfiguration succeeded.") | ||
return true, nil | ||
} | ||
|
||
retriesRemaining-- | ||
if retriesRemaining > 0 { | ||
klog.Warningf("Dynamic reconfiguration failed (retrying; %d retries left): %v", retriesRemaining, err) | ||
return false, nil | ||
} | ||
|
||
klog.Warningf("Dynamic reconfiguration failed: %v", err) | ||
return false, err | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good.
I would maybe go into a simpler logic like:
retriesRemaining := retry.Steps | |
err := wait.ExponentialBackoff(retry, func() (bool, error) { | |
err := n.configureDynamically(pcfg) | |
if err == nil { | |
klog.V(2).Infof("Dynamic reconfiguration succeeded.") | |
return true, nil | |
} | |
retriesRemaining-- | |
if retriesRemaining > 0 { | |
klog.Warningf("Dynamic reconfiguration failed (retrying; %d retries left): %v", retriesRemaining, err) | |
return false, nil | |
} | |
klog.Warningf("Dynamic reconfiguration failed: %v", err) | |
return false, err | |
}) | |
retriesRemaining := retry.Steps | |
err := wait.ExponentialBackoff(retry, func() (bool, error) { | |
# We should early fail | |
if retriesRemaining > 0 { | |
retriesRemaining-- | |
err := n.configureDynamically(pcfg) | |
if err == nil { | |
klog.V(2).Infof("Dynamic reconfiguration succeeded.") | |
return true, nil | |
} | |
klog.Warningf("Dynamic reconfiguration failed (retrying; %d retries left): %v", retriesRemaining, err) | |
return false, nil | |
} | |
klog.Warningf("Dynamic reconfiguration failed: %v", err) | |
return false, err | |
} | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I read the code, this is not quite equivalent because of the behaviour when the retries are exhausted. Suppose that retriesRemaining=1 and all attempts to configure will be failing.
The old code does: attempt to configure (and fail), decrement retriesRemaining to zero, don't enter branch, return the error from earlier (won't be retried).
The proposed code does: check retriesRemaining above zero, decrement retriesRemaining to zero, attempt to configure (and fail), return nil (will be retried), go back to the retry mechanics and sleep, then on next call: retriesRemaining is zero, return error [actually it's a little unclear which err you propose to return in this branch, because there's been no attempt to configure, so that "err" isn't in scope].
Besides the question of which error to return, we don't want to sleep for another backoff period (the final and longest one!) once it's already clear that the next call is going to immediately fail.
Open to suggestions for improvement here, but as you can see it's a little subtle.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: steinarvk-kolonial The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks so much for taking a look! Responded to your comments in the suggestion threads. This actually hit us again yesterday, so would still be great to get merged. |
@steinarvk-kolonial just to let you know I didn't forgot about this one, just pretty busy the last days with v1.0.0 release. Will come back to this PR probably this weekend |
@steinarvk-kolonial: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@steinarvk-kolonial thanks again for your patience. This is my next on queue (promise) and I expect to review again by the end of the week. You can rebase it so it would be easier to merge if everything is ok! :) Thanks |
/kind bug |
Any update on this? I'm seeing this problem as well. |
If a rebased PR would help, let me know -- I've fixed this in my own version today. |
@steinarvk-kolonial gently ping about whether you still want to follow up on it. /assign |
@iamNoah1 , not sure about steinvarvk, but I've been using my own re-based version of this patch and it seems to work. I don't think I can submit that here, though, would need to open another PR? Let me know if I should do that. |
/close |
@rikatz: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
NGINXController.syncIngress in internal/ingress/controller/controller.go is clearly intended to do retries with exponential backoff when calling configureDynamically(). However, due to what seems to be a bug, it won't do any retries. This bugfix contribution fixes that bug, so NGINXController.syncIngress will start doing retries according to the retry policy that seems to have been the original intent.
Per the documentation for wait.ExponentialBackoff, this function "stops and returns as soon as [...] the condition check returns true or an error". The previous implementation of the condition check always returned either true (success) or an error (failure) -- as such, retries would never actually trigger.
We've had issues with this "retry" loop on a real deployment on multiple occasions. This manifests as the nginx-controller failing to get healthy and getting stuck in a crashloop for minutes or hours with an errors akin to this:
The hope is that this bugfix will resolve this existing crash-on-startup bug by allowing more time (as much time as originally intended) for port 10246 to start listening.
Since the amount of time required will depend on configuration size, as noted in the pre-existing comment, I've also added a flag to control the number of retries.
Types of changes
Which issue/s this PR fixes
might fix #4742
Also referenced in the comments of this one: #4629 (comment)
How Has This Been Tested?
I ran "make test", and everything seemed to pass. (On my development Linux laptop.)
I don't have enough context in the code to know how to easily write a unit test for this. If the reviewers want to require one, pointers on how to set up a test to call a callback function from configureDynamically() would be appreciated.
Update: I attempted to test on a testing cluster. I added lots of ingresses (164 ingresses) and restarted the nginx-controller pods -- first with a control version, then with an image including this patch. I was able to reproduce the "connection refused" error as noted above and as such got to see the retries working. However, in the stress test conditions, even 15 retries were not enough and I ultimately got "Unexpected failure reconfiguring NGINX" again. I was also unable to reproduce the crash loop (both with the control version and the patch) -- both the control version and the patch came up after a few minutes on the second attempt.
The control version I used was: k8s.gcr.io/ingress-nginx/controller:v0.45.0@sha256:c4390c53f348c3bd4e60a5dd6a11c35799ae78c49388090140b9d72ccede1755
This testing attempt also uncovered a secondary bug: 15 retries take around 5 seconds because the Factor is set to 0.8, meaning this is a rarely-seen example of exponential backoff where the retry delay gets shorter and shorter. That's almost certainly not what was intended. I'll set it to 1.2 instead.
Update 2: tried again with the fixed backoff factor and overall more lax settings. In my stress test conditions, the controller came up after 67 seconds. I've suggested bumping the factor to 1.3 so that this apparently-realistic-if-overloaded scenario can finish successfully. Really, though -- perhaps the retry budget here should be somehow configurable? (Added a flag.)
Checklist: