Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for buggy ingress sync with retries #8325

Merged
merged 1 commit into from
Apr 11, 2022

Conversation

davideshay
Copy link
Contributor

This is a rebased version of PR 7086, against 1.1.2. All code was developed originally by steinarvk-kolonial.

What this PR does / why we need it:

NGINXController.syncIngress in internal/ingress/controller/controller.go is clearly intended to do retries with exponential backoff when calling configureDynamically(). However, due to what seems to be a bug, it won't do any retries. This bugfix contribution fixes that bug, so NGINXController.syncIngress will start doing retries according to the retry policy that seems to have been the original intent.

Per the documentation for wait.ExponentialBackoff, this function "stops and returns as soon as [...] the condition check returns true or an error". The previous implementation of the condition check always returned either true (success) or an error (failure) -- as such, retries would never actually trigger.

We've had issues with this "retry" loop on a real deployment on multiple occasions. This manifests as the nginx-controller failing to get healthy and getting stuck in a crashloop for minutes or hours with an errors akin to this:

Unexpected failure reconfiguring NGINX: "requeuing" err="Post \"http://127.0.0.1:10246/configuration/backends\": dial tcp 127.0.0.1:10246: connect: connection refused" key="initial-sync"

The hope is that this bugfix will resolve this existing crash-on-startup bug by allowing more time (as much time as originally intended) for port 10246 to start listening.

Since the amount of time required will depend on configuration size, as noted in the pre-existing comment, I've also added a flag to control the number of retries.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation only

Which issue/s this PR fixes

possibly #4629 and/or #4742

How Has This Been Tested?

steinarvk-kolonial previously tested.
Using a github action, I rebuilt based on 1.1.2 and tested it on my mixed arm64/amd64 cluster and it comes up without the previous failures.

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 10, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: davideshay / name: David Shay (5c4ab211c2e388c99544e10e7cb37d4ef4760b1c)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 10, 2022
@k8s-ci-robot
Copy link
Contributor

@davideshay: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

Welcome @davideshay!

It looks like this is your first PR to kubernetes/ingress-nginx 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/ingress-nginx has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @davideshay. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 10, 2022
@davideshay
Copy link
Contributor Author

This is an alternate PR for #7086 which has been updated to 1.1.2. I have tried to fix/create CLA but may need more guidance there on how to do that for an individual contributor.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 11, 2022
@theunrealgeek
Copy link
Contributor

Looks good to me, @rikatz please take a look at this small change

Copy link
Member

@tao12345666333 tao12345666333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

/assign
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 22, 2022
@longwuyuan
Copy link
Contributor

longwuyuan commented Mar 22, 2022

  • Are there no tests needed ?
  • Is there any chance for a race to occur during those 15 retries ?
  • Port 10246 is documented as the Lua HTTP endpoint. Is there any chance that something higher up is broken (for example broken CNI/host-network or accidental/undesired packet-filtering) and causing the 10246 to be unavailable expectedly. As that would ensure 15 retries are guaranteed and the failure on 15 retires is also guaranteed .

@davideshay
Copy link
Contributor Author

  • Are there no tests needed ?

I don't at this point know how to test it with a unit or other test. On my cluster, the unpatched code would fail maybe 40% of the time, and instead of doing retries in code, the pod would fail maybe 7-8 times or more, then finally work.

  • Is there any chance for a race to occur during those 15 retries ?

I don't think so. The retry code is not new, just due to a logic error, it was not being executed previously.

  • Port 10246 is documented as the Lua HTTP endpoint. Is there any chance that something higher up is broken (for example broken CNI/host-network or accidental/undesired packet-filtering) and causing the 10246 to be unavailable expectedly. As that would ensure 15 retries are guaranteed and the failure on 15 retires is also guaranteed .

I don't think this is the case (port 10246 being occupied). I think sometimes my cluster might just be slower. When this code now activates, it may get down to 9 or 10 retries, typically no further. If it does get through all 15 retries, the pod will fail and restart.

@longwuyuan
Copy link
Contributor

I personally never experienced the problem this PR is trying to solve. Also, I have not seen a large number of users report the problem you are trying to solve. Hence I am not really sure why we should make a change in main, when nothing is broken or being reported as a problem by a quorum of users.

There is no assurance that this change will not introduce any race-condition(s) during the 15 retries for other users, besides your use-case.

If the problem is limited to your use case, then you ought to be maintaining a fork with your 15 reties changes.

But others may have different opinion so wait for comments and labels.

Apologies for not being ok with this, but making this sort of change when there is no large user base reporting a problem is proof enough to "don't fix when not broken".

@davideshay
Copy link
Contributor Author

I personally never experienced the problem this PR is trying to solve. Also, I have not seen a large number of users report the problem you are trying to solve. Hence I am not really sure why we should make a change in main, when nothing is broken or being reported as a problem by a quorum of users.

Understand. I know that @steinarvk-kolonial, who made this original change had the problem as well. In many cases, the ingress will EVENTUALLY sync, but may crash the pod many times before it finally works. Others may have that symptom and not report it.

There is no assurance that this change will not introduce any race-condition(s) during the 15 retries for other users, besides your use-case.

Just for clarity, the code was always meant to do a retry 15 times (and had most of that code already there), but due to some incorrect logic it would not do that retry. In addition, the back-off factor in the original code was less than 1, so even if it had executed it would have been getting progressively shorter rather than progressively longer.

If the problem is limited to your use case, then you ought to be maintaining a fork with your 15 reties changes.

But others may have different opinion so wait for comments and labels.

Apologies for not being ok with this, but making this sort of change when there is no large user base reporting a problem is proof enough to "don't fix when not broken".

I am maintaining that fork and use it every day. Certainly would rather get the fix in to the main code base, but I'll keep doing that if I have to. If you really don't want ANY retries, which is the effect of the current code, then in my mind you should either remove all of that inactive code, or go forward with some kind of fix, rather than leaving code which doesn't serve it's intended purpose.

Happy to do whatever is needed and have more discussion.

@longwuyuan
Copy link
Contributor

longwuyuan commented Mar 22, 2022 via email

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2022
@rikatz
Copy link
Contributor

rikatz commented Apr 10, 2022

/assign
Please rebase, and ping me in Slack for some review. Sorry,I've been focused on another major delivery and didn't had time to check this before.

Thanks

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2022
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 11, 2022
@rikatz
Copy link
Contributor

rikatz commented Apr 11, 2022

/lgtm
/approve
This should go to v1.2 beta
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davideshay, rikatz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 11, 2022
@k8s-ci-robot k8s-ci-robot merged commit 47a266d into kubernetes:main Apr 11, 2022
@davideshay davideshay deleted the fixingresssync branch April 11, 2022 18:47
@davideshay davideshay restored the fixingresssync branch April 11, 2022 18:47
rchshld pushed a commit to joomcode/ingress-nginx that referenced this pull request May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants