Fix for buggy ingress sync with retries #8325

davideshay · 2022-03-10T23:38:22Z

This is a rebased version of PR 7086, against 1.1.2. All code was developed originally by steinarvk-kolonial.

What this PR does / why we need it:

NGINXController.syncIngress in internal/ingress/controller/controller.go is clearly intended to do retries with exponential backoff when calling configureDynamically(). However, due to what seems to be a bug, it won't do any retries. This bugfix contribution fixes that bug, so NGINXController.syncIngress will start doing retries according to the retry policy that seems to have been the original intent.

Per the documentation for wait.ExponentialBackoff, this function "stops and returns as soon as [...] the condition check returns true or an error". The previous implementation of the condition check always returned either true (success) or an error (failure) -- as such, retries would never actually trigger.

We've had issues with this "retry" loop on a real deployment on multiple occasions. This manifests as the nginx-controller failing to get healthy and getting stuck in a crashloop for minutes or hours with an errors akin to this:

Unexpected failure reconfiguring NGINX: "requeuing" err="Post \"http://127.0.0.1:10246/configuration/backends\": dial tcp 127.0.0.1:10246: connect: connection refused" key="initial-sync"

The hope is that this bugfix will resolve this existing crash-on-startup bug by allowing more time (as much time as originally intended) for port 10246 to start listening.

Since the amount of time required will depend on configuration size, as noted in the pre-existing comment, I've also added a flag to control the number of retries.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation only

Which issue/s this PR fixes

possibly #4629 and/or #4742

How Has This Been Tested?

steinarvk-kolonial previously tested.
Using a github action, I rebuilt based on 1.1.2 and tested it on my mixed arm64/amd64 cluster and it comes up without the previous failures.

Checklist:

My change requires a change to the documentation.
I have updated the documentation accordingly.
I've read the CONTRIBUTION guide
I have added tests to cover my changes.
All new and existing tests passed.

linux-foundation-easycla · 2022-03-10T23:38:25Z

The committers listed above are authorized under a signed CLA.

✅ login: davideshay / name: David Shay (5c4ab211c2e388c99544e10e7cb37d4ef4760b1c)

k8s-ci-robot · 2022-03-10T23:38:29Z

@davideshay: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-03-10T23:38:30Z

Welcome @davideshay!

It looks like this is your first PR to kubernetes/ingress-nginx 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/ingress-nginx has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2022-03-10T23:38:30Z

Hi @davideshay. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

davideshay · 2022-03-11T00:02:41Z

This is an alternate PR for #7086 which has been updated to 1.1.2. I have tried to fix/create CLA but may need more guidance there on how to do that for an individual contributor.

theunrealgeek · 2022-03-22T02:45:42Z

Looks good to me, @rikatz please take a look at this small change

tao12345666333

Thanks

/assign
/ok-to-test

longwuyuan · 2022-03-22T05:51:15Z

Are there no tests needed ?
Is there any chance for a race to occur during those 15 retries ?
Port 10246 is documented as the Lua HTTP endpoint. Is there any chance that something higher up is broken (for example broken CNI/host-network or accidental/undesired packet-filtering) and causing the 10246 to be unavailable expectedly. As that would ensure 15 retries are guaranteed and the failure on 15 retires is also guaranteed .

davideshay · 2022-03-22T13:21:45Z

Are there no tests needed ?

I don't at this point know how to test it with a unit or other test. On my cluster, the unpatched code would fail maybe 40% of the time, and instead of doing retries in code, the pod would fail maybe 7-8 times or more, then finally work.

Is there any chance for a race to occur during those 15 retries ?

I don't think so. The retry code is not new, just due to a logic error, it was not being executed previously.

Port 10246 is documented as the Lua HTTP endpoint. Is there any chance that something higher up is broken (for example broken CNI/host-network or accidental/undesired packet-filtering) and causing the 10246 to be unavailable expectedly. As that would ensure 15 retries are guaranteed and the failure on 15 retires is also guaranteed .

I don't think this is the case (port 10246 being occupied). I think sometimes my cluster might just be slower. When this code now activates, it may get down to 9 or 10 retries, typically no further. If it does get through all 15 retries, the pod will fail and restart.

longwuyuan · 2022-03-22T17:04:09Z

I personally never experienced the problem this PR is trying to solve. Also, I have not seen a large number of users report the problem you are trying to solve. Hence I am not really sure why we should make a change in main, when nothing is broken or being reported as a problem by a quorum of users.

There is no assurance that this change will not introduce any race-condition(s) during the 15 retries for other users, besides your use-case.

If the problem is limited to your use case, then you ought to be maintaining a fork with your 15 reties changes.

But others may have different opinion so wait for comments and labels.

Apologies for not being ok with this, but making this sort of change when there is no large user base reporting a problem is proof enough to "don't fix when not broken".

davideshay · 2022-03-22T17:48:11Z

I personally never experienced the problem this PR is trying to solve. Also, I have not seen a large number of users report the problem you are trying to solve. Hence I am not really sure why we should make a change in main, when nothing is broken or being reported as a problem by a quorum of users.

Understand. I know that @steinarvk-kolonial, who made this original change had the problem as well. In many cases, the ingress will EVENTUALLY sync, but may crash the pod many times before it finally works. Others may have that symptom and not report it.

There is no assurance that this change will not introduce any race-condition(s) during the 15 retries for other users, besides your use-case.

Just for clarity, the code was always meant to do a retry 15 times (and had most of that code already there), but due to some incorrect logic it would not do that retry. In addition, the back-off factor in the original code was less than 1, so even if it had executed it would have been getting progressively shorter rather than progressively longer.

If the problem is limited to your use case, then you ought to be maintaining a fork with your 15 reties changes.

But others may have different opinion so wait for comments and labels.

Apologies for not being ok with this, but making this sort of change when there is no large user base reporting a problem is proof enough to "don't fix when not broken".

I am maintaining that fork and use it every day. Certainly would rather get the fix in to the main code base, but I'll keep doing that if I have to. If you really don't want ANY retries, which is the effect of the current code, then in my mind you should either remove all of that inactive code, or go forward with some kind of fix, rather than leaving code which doesn't serve it's intended purpose.

Happy to do whatever is needed and have more discussion.

longwuyuan · 2022-03-22T21:11:15Z

Hi David, Those were my opinions. @theunrealgeek already said ok. Lets wait for other comments. Thanks, ; Long Wu Yuan

…

On 22-Mar-2022, at 11:18 PM, David Shay ***@***.***> wrote: I personally never experienced the problem this PR is trying to solve. Also, I have not seen a large number of users report the problem you are trying to solve. Hence I am not really sure why we should make a change in main, when nothing is broken or being reported as a problem by a quorum of users. Understand. I know that @steinarvk-kolonial <https://github.com/steinarvk-kolonial>, who made this original change had the problem as well. In many cases, the ingress will EVENTUALLY sync, but may crash the pod many times before it finally works. Others may have that symptom and not report it. There is no assurance that this change will not introduce any race-condition(s) during the 15 retries for other users, besides your use-case. Just for clarity, the code was always meant to do a retry 15 times (and had most of that code already there), but due to some incorrect logic it would not do that retry. In addition, the back-off factor in the original code was less than 1, so even if it had executed it would have been getting progressively shorter rather than progressively longer. If the problem is limited to your use case, then you ought to be maintaining a fork with your 15 reties changes. But others may have different opinion so wait for comments and labels. Apologies for not being ok with this, but making this sort of change when there is no large user base reporting a problem is proof enough to "don't fix when not broken". I am maintaining that fork and use it every day. Certainly would rather get the fix in to the main code base, but I'll keep doing that if I have to. If you really don't want ANY retries, which is the effect of the current code, then in my mind you should either remove all of that inactive code, or go forward with some kind of fix, rather than leaving code which doesn't serve it's intended purpose. Happy to do whatever is needed and have more discussion. — Reply to this email directly, view it on GitHub <#8325 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGZVWQGKVZEZWW4TS5UPNDVBIBWPANCNFSM5QODBWSQ>. You are receiving this because you commented.

rikatz · 2022-04-10T20:34:26Z

/assign
Please rebase, and ping me in Slack for some review. Sorry,I've been focused on another major delivery and didn't had time to check this before.

Thanks

rikatz · 2022-04-11T17:40:48Z

/lgtm
/approve
This should go to v1.2 beta
Thanks!

k8s-ci-robot · 2022-04-11T17:41:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davideshay, rikatz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rikatz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 10, 2022

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 10, 2022

k8s-ci-robot requested review from ElvinEfendi and strongjz March 10, 2022 23:39

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 11, 2022

pdefreitas mentioned this pull request Mar 21, 2022

Ingress controller keeps increasing the memory when new backend reload action triggered #8362

Closed

tao12345666333 reviewed Mar 22, 2022

View reviewed changes

k8s-ci-robot assigned tao12345666333 Mar 22, 2022

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 22, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2022

k8s-ci-robot assigned rikatz Apr 10, 2022

davideshay force-pushed the fixingresssync branch from 5c4ab21 to de619cb Compare April 11, 2022 02:02

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2022

Fix for buggy ingress sync with retries

e1b5599

davideshay force-pushed the fixingresssync branch from de619cb to e1b5599 Compare April 11, 2022 17:29

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 11, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 11, 2022

k8s-ci-robot merged commit 47a266d into kubernetes:main Apr 11, 2022

davideshay deleted the fixingresssync branch April 11, 2022 18:47

davideshay restored the fixingresssync branch April 11, 2022 18:47

rchshld pushed a commit to joomcode/ingress-nginx that referenced this pull request May 19, 2023

Fix for buggy ingress sync with retries (kubernetes#8325)

8686d89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for buggy ingress sync with retries #8325

Fix for buggy ingress sync with retries #8325

davideshay commented Mar 10, 2022

linux-foundation-easycla bot commented Mar 10, 2022 •

edited

Loading

k8s-ci-robot commented Mar 10, 2022

k8s-ci-robot commented Mar 10, 2022

k8s-ci-robot commented Mar 10, 2022

davideshay commented Mar 11, 2022

theunrealgeek commented Mar 22, 2022

tao12345666333 left a comment

longwuyuan commented Mar 22, 2022 •

edited

Loading

davideshay commented Mar 22, 2022

longwuyuan commented Mar 22, 2022

davideshay commented Mar 22, 2022

longwuyuan commented Mar 22, 2022 via email

rikatz commented Apr 10, 2022

rikatz commented Apr 11, 2022

k8s-ci-robot commented Apr 11, 2022

Fix for buggy ingress sync with retries #8325

Fix for buggy ingress sync with retries #8325

Conversation

davideshay commented Mar 10, 2022

What this PR does / why we need it:

Types of changes

Which issue/s this PR fixes

How Has This Been Tested?

Checklist:

linux-foundation-easycla bot commented Mar 10, 2022 • edited Loading

k8s-ci-robot commented Mar 10, 2022

k8s-ci-robot commented Mar 10, 2022

k8s-ci-robot commented Mar 10, 2022

davideshay commented Mar 11, 2022

theunrealgeek commented Mar 22, 2022

tao12345666333 left a comment

Choose a reason for hiding this comment

longwuyuan commented Mar 22, 2022 • edited Loading

davideshay commented Mar 22, 2022

longwuyuan commented Mar 22, 2022

davideshay commented Mar 22, 2022

longwuyuan commented Mar 22, 2022 via email

rikatz commented Apr 10, 2022

rikatz commented Apr 11, 2022

k8s-ci-robot commented Apr 11, 2022

linux-foundation-easycla bot commented Mar 10, 2022 •

edited

Loading

longwuyuan commented Mar 22, 2022 •

edited

Loading