-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for buggy ingress sync with retries #8325
Conversation
@davideshay: This issue is currently awaiting triage. If Ingress contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Welcome @davideshay! |
Hi @davideshay. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is an alternate PR for #7086 which has been updated to 1.1.2. I have tried to fix/create CLA but may need more guidance there on how to do that for an individual contributor. |
Looks good to me, @rikatz please take a look at this small change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
/assign
/ok-to-test
|
I don't at this point know how to test it with a unit or other test. On my cluster, the unpatched code would fail maybe 40% of the time, and instead of doing retries in code, the pod would fail maybe 7-8 times or more, then finally work.
I don't think so. The retry code is not new, just due to a logic error, it was not being executed previously.
I don't think this is the case (port 10246 being occupied). I think sometimes my cluster might just be slower. When this code now activates, it may get down to 9 or 10 retries, typically no further. If it does get through all 15 retries, the pod will fail and restart. |
I personally never experienced the problem this PR is trying to solve. Also, I have not seen a large number of users report the problem you are trying to solve. Hence I am not really sure why we should make a change in main, when nothing is broken or being reported as a problem by a quorum of users. There is no assurance that this change will not introduce any race-condition(s) during the 15 retries for other users, besides your use-case. If the problem is limited to your use case, then you ought to be maintaining a fork with your 15 reties changes. But others may have different opinion so wait for comments and labels. Apologies for not being ok with this, but making this sort of change when there is no large user base reporting a problem is proof enough to "don't fix when not broken". |
Understand. I know that @steinarvk-kolonial, who made this original change had the problem as well. In many cases, the ingress will EVENTUALLY sync, but may crash the pod many times before it finally works. Others may have that symptom and not report it.
Just for clarity, the code was always meant to do a retry 15 times (and had most of that code already there), but due to some incorrect logic it would not do that retry. In addition, the back-off factor in the original code was less than 1, so even if it had executed it would have been getting progressively shorter rather than progressively longer.
I am maintaining that fork and use it every day. Certainly would rather get the fix in to the main code base, but I'll keep doing that if I have to. If you really don't want ANY retries, which is the effect of the current code, then in my mind you should either remove all of that inactive code, or go forward with some kind of fix, rather than leaving code which doesn't serve it's intended purpose. Happy to do whatever is needed and have more discussion. |
Hi David,
Those were my opinions.
@theunrealgeek already said ok.
Lets wait for other comments.
Thanks,
; Long Wu Yuan
… On 22-Mar-2022, at 11:18 PM, David Shay ***@***.***> wrote:
I personally never experienced the problem this PR is trying to solve. Also, I have not seen a large number of users report the problem you are trying to solve. Hence I am not really sure why we should make a change in main, when nothing is broken or being reported as a problem by a quorum of users.
Understand. I know that @steinarvk-kolonial <https://github.com/steinarvk-kolonial>, who made this original change had the problem as well. In many cases, the ingress will EVENTUALLY sync, but may crash the pod many times before it finally works. Others may have that symptom and not report it.
There is no assurance that this change will not introduce any race-condition(s) during the 15 retries for other users, besides your use-case.
Just for clarity, the code was always meant to do a retry 15 times (and had most of that code already there), but due to some incorrect logic it would not do that retry. In addition, the back-off factor in the original code was less than 1, so even if it had executed it would have been getting progressively shorter rather than progressively longer.
If the problem is limited to your use case, then you ought to be maintaining a fork with your 15 reties changes.
But others may have different opinion so wait for comments and labels.
Apologies for not being ok with this, but making this sort of change when there is no large user base reporting a problem is proof enough to "don't fix when not broken".
I am maintaining that fork and use it every day. Certainly would rather get the fix in to the main code base, but I'll keep doing that if I have to. If you really don't want ANY retries, which is the effect of the current code, then in my mind you should either remove all of that inactive code, or go forward with some kind of fix, rather than leaving code which doesn't serve it's intended purpose.
Happy to do whatever is needed and have more discussion.
—
Reply to this email directly, view it on GitHub <#8325 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGZVWQGKVZEZWW4TS5UPNDVBIBWPANCNFSM5QODBWSQ>.
You are receiving this because you commented.
|
/assign Thanks |
5c4ab21
to
de619cb
Compare
de619cb
to
e1b5599
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: davideshay, rikatz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is a rebased version of PR 7086, against 1.1.2. All code was developed originally by steinarvk-kolonial.
What this PR does / why we need it:
NGINXController.syncIngress in internal/ingress/controller/controller.go is clearly intended to do retries with exponential backoff when calling configureDynamically(). However, due to what seems to be a bug, it won't do any retries. This bugfix contribution fixes that bug, so NGINXController.syncIngress will start doing retries according to the retry policy that seems to have been the original intent.
Per the documentation for wait.ExponentialBackoff, this function "stops and returns as soon as [...] the condition check returns true or an error". The previous implementation of the condition check always returned either true (success) or an error (failure) -- as such, retries would never actually trigger.
We've had issues with this "retry" loop on a real deployment on multiple occasions. This manifests as the nginx-controller failing to get healthy and getting stuck in a crashloop for minutes or hours with an errors akin to this:
Unexpected failure reconfiguring NGINX: "requeuing" err="Post \"http://127.0.0.1:10246/configuration/backends\": dial tcp 127.0.0.1:10246: connect: connection refused" key="initial-sync"
The hope is that this bugfix will resolve this existing crash-on-startup bug by allowing more time (as much time as originally intended) for port 10246 to start listening.
Since the amount of time required will depend on configuration size, as noted in the pre-existing comment, I've also added a flag to control the number of retries.
Types of changes
Which issue/s this PR fixes
possibly #4629 and/or #4742
How Has This Been Tested?
steinarvk-kolonial previously tested.
Using a github action, I rebuilt based on 1.1.2 and tested it on my mixed arm64/amd64 cluster and it comes up without the previous failures.
Checklist: