Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix buggy retry logic in syncIngress() #7086

Closed

Conversation

steinarvk-oda
Copy link

@steinarvk-oda steinarvk-oda commented Apr 28, 2021

What this PR does / why we need it:

NGINXController.syncIngress in internal/ingress/controller/controller.go is clearly intended to do retries with exponential backoff when calling configureDynamically(). However, due to what seems to be a bug, it won't do any retries. This bugfix contribution fixes that bug, so NGINXController.syncIngress will start doing retries according to the retry policy that seems to have been the original intent.

Per the documentation for wait.ExponentialBackoff, this function "stops and returns as soon as [...] the condition check returns true or an error". The previous implementation of the condition check always returned either true (success) or an error (failure) -- as such, retries would never actually trigger.

We've had issues with this "retry" loop on a real deployment on multiple occasions. This manifests as the nginx-controller failing to get healthy and getting stuck in a crashloop for minutes or hours with an errors akin to this:

Unexpected failure reconfiguring NGINX:
"requeuing" err="Post \"http://127.0.0.1:10246/configuration/backends\": dial tcp 127.0.0.1:10246: connect: connection refused" key="initial-sync"

The hope is that this bugfix will resolve this existing crash-on-startup bug by allowing more time (as much time as originally intended) for port 10246 to start listening.

Since the amount of time required will depend on configuration size, as noted in the pre-existing comment, I've also added a flag to control the number of retries.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Which issue/s this PR fixes

might fix #4742

Also referenced in the comments of this one: #4629 (comment)

How Has This Been Tested?

I ran "make test", and everything seemed to pass. (On my development Linux laptop.)

I don't have enough context in the code to know how to easily write a unit test for this. If the reviewers want to require one, pointers on how to set up a test to call a callback function from configureDynamically() would be appreciated.

Update: I attempted to test on a testing cluster. I added lots of ingresses (164 ingresses) and restarted the nginx-controller pods -- first with a control version, then with an image including this patch. I was able to reproduce the "connection refused" error as noted above and as such got to see the retries working. However, in the stress test conditions, even 15 retries were not enough and I ultimately got "Unexpected failure reconfiguring NGINX" again. I was also unable to reproduce the crash loop (both with the control version and the patch) -- both the control version and the patch came up after a few minutes on the second attempt.

The control version I used was: k8s.gcr.io/ingress-nginx/controller:v0.45.0@sha256:c4390c53f348c3bd4e60a5dd6a11c35799ae78c49388090140b9d72ccede1755

This testing attempt also uncovered a secondary bug: 15 retries take around 5 seconds because the Factor is set to 0.8, meaning this is a rarely-seen example of exponential backoff where the retry delay gets shorter and shorter. That's almost certainly not what was intended. I'll set it to 1.2 instead.

Update 2: tried again with the fixed backoff factor and overall more lax settings. In my stress test conditions, the controller came up after 67 seconds. I've suggested bumping the factor to 1.3 so that this apparently-realistic-if-overloaded scenario can finish successfully. Really, though -- perhaps the retry budget here should be somehow configurable? (Added a flag.)

Checklist:

  • My change requires a change to the documentation. (If we're adding a flag, should document it.)
  • I have updated the documentation accordingly. (Waiting to see whether reviewers think this warrants a flag. If so, please point me to the docs that should be updated.)
  • I've read the CONTRIBUTION guide
  • I have added tests to cover my changes.
  • All new and existing tests passed.

This code looks like it was intended to do retries with exponential
backoff, but due to a bug it seems like it did no retries at all.

Per the documentation for wait.ExponentialBackoff, this function
"stops and returns as soon as [...] the condition check returns true
or an error".

The previous implementation of the condition check always returned
either true (success) or an error (failure) -- as such, retries
would never actually trigger.

The new implementation instead checks whether there are retries
left (according to the intended policy of 15 retries). If there
are retries left, it logs the error but instead returns "false, nil",
which should trigger a retry from wait.ExponentialBackoff.

We've had issues with this "retry" loop on a real deployment on
multiple occasions; the nginx-controller failing to get healthy in
time and getting stuck in a crashloop for minutes or hours with an
errors akin to this:

    Unexpected failure reconfiguring NGINX:
    "requeuing" err="Post \"http://127.0.0.1:10246/configuration/backends\": dial tcp 127.0.0.1:10246: connect: connection refused" key="initial-sync"

The hope is that this bugfix will resolve this existing
crash-on-startup bug.
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2021
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
  • Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 28, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @steinarvk-kolonial!

It looks like this is your first PR to kubernetes/ingress-nginx 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/ingress-nginx has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 28, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @steinarvk-kolonial. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 28, 2021
@steinarvk-oda steinarvk-oda marked this pull request as ready for review April 28, 2021 19:22
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2021
@RmStorm
Copy link

RmStorm commented Apr 28, 2021

uhm @steinarvk-kolonial did you notice that, in addition to never retrying, the backoff factor is smaller than 1 🙈 I don't think it's supposed to be like that. It basically tries 15 times in 6 seconds and it speeds up near the end. For this backoff behavior to actually kick in in any sensible way you probably want to put that on 1.1 or something.

As another suggestion.. Maybe set a cap and drop the jitter? The jitter seems kinda useless in this case and a cap of 1 or 2 minutes sounds very sensible? https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff

Factor 0.8 means that the retry delays get
shorter and shorter with each attempt, which
is almost certainly not intentional. A more
sensible value is 1.2.
1.2 was somewhat arbitrarily chosen. In my
stress-testing I observed this taking 67 seconds,
so I suggest putting the factor at a point
that at least comfortable allows that.

Really, though -- perhaps it'd be better if
this were somehow configurable.
Since the number of retries appropriate could
depend on the configuration size, might as well
make this a flag.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 28, 2021
@rikatz
Copy link
Contributor

rikatz commented Apr 29, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 29, 2021
@codecov-commenter
Copy link

Codecov Report

Merging #7086 (981c52c) into master (bacd735) will increase coverage by 0.15%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7086      +/-   ##
==========================================
+ Coverage   55.82%   55.98%   +0.15%     
==========================================
  Files          94       94              
  Lines        6588     6593       +5     
==========================================
+ Hits         3678     3691      +13     
+ Misses       2449     2441       -8     
  Partials      461      461              
Impacted Files Coverage Δ
internal/ingress/controller/nginx.go 28.71% <ø> (ø)
internal/ingress/controller/store/store.go 58.94% <16.66%> (ø)
internal/ingress/controller/controller.go 46.53% <27.27%> (-0.15%) ⬇️
cmd/nginx/flags.go 81.90% <100.00%> (+0.27%) ⬆️
internal/ingress/types_equals.go 17.96% <100.00%> (+3.80%) ⬆️
...ternal/ingress/annotations/globalratelimit/main.go 64.70% <0.00%> (-0.30%) ⬇️
internal/ingress/controller/template/template.go 76.82% <0.00%> (-0.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ada2300...981c52c. Read the comment docs.

@steinarvk-oda
Copy link
Author

Gentle ping, would be great to get some feedback on this. Are the automatically assigned reviewers the appropriate ones (and not overloaded)?

@rikatz
Copy link
Contributor

rikatz commented Jun 27, 2021

I'll take a look
/cc

@k8s-ci-robot k8s-ci-robot requested a review from rikatz June 27, 2021 22:25
Copy link
Contributor

@rikatz rikatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, nice catch.

Please just take a look into the suggestion of changing the logic, but it's mostly a nit.

Also, I'm thinking here is the wait.Backoff + the declaration of steps already doesn't do that, so instead of implementing the internal counter we shouldn't just be capturing the error and returning it later

Factor: 0.8,
Steps: 1 + n.cfg.DynamicConfigurationRetries,
Duration: time.Second,
Factor: 1.3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the Factor here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A factor less than 1 means the delays get shorter and shorter: e.g. first 1 second, then 800ms, then 640ms, and so on.

That's not usually what people mean by "exponential backoff", and it doesn't really achieve the same thing of adaptively polling for a process that may take a long time.

Since this was effectively dead code previously, and by the logic above the previous value doesn't seem to have been tested, I set a new value based based on how long it took the controller to come up while stress-testing it.

Comment on lines +187 to 203
retriesRemaining := retry.Steps
err := wait.ExponentialBackoff(retry, func() (bool, error) {
err := n.configureDynamically(pcfg)
if err == nil {
klog.V(2).Infof("Dynamic reconfiguration succeeded.")
return true, nil
}

retriesRemaining--
if retriesRemaining > 0 {
klog.Warningf("Dynamic reconfiguration failed (retrying; %d retries left): %v", retriesRemaining, err)
return false, nil
}

klog.Warningf("Dynamic reconfiguration failed: %v", err)
return false, err
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good.

I would maybe go into a simpler logic like:

Suggested change
retriesRemaining := retry.Steps
err := wait.ExponentialBackoff(retry, func() (bool, error) {
err := n.configureDynamically(pcfg)
if err == nil {
klog.V(2).Infof("Dynamic reconfiguration succeeded.")
return true, nil
}
retriesRemaining--
if retriesRemaining > 0 {
klog.Warningf("Dynamic reconfiguration failed (retrying; %d retries left): %v", retriesRemaining, err)
return false, nil
}
klog.Warningf("Dynamic reconfiguration failed: %v", err)
return false, err
})
retriesRemaining := retry.Steps
err := wait.ExponentialBackoff(retry, func() (bool, error) {
# We should early fail
if retriesRemaining > 0 {
retriesRemaining--
err := n.configureDynamically(pcfg)
if err == nil {
klog.V(2).Infof("Dynamic reconfiguration succeeded.")
return true, nil
}
klog.Warningf("Dynamic reconfiguration failed (retrying; %d retries left): %v", retriesRemaining, err)
return false, nil
}
klog.Warningf("Dynamic reconfiguration failed: %v", err)
return false, err
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read the code, this is not quite equivalent because of the behaviour when the retries are exhausted. Suppose that retriesRemaining=1 and all attempts to configure will be failing.

The old code does: attempt to configure (and fail), decrement retriesRemaining to zero, don't enter branch, return the error from earlier (won't be retried).

The proposed code does: check retriesRemaining above zero, decrement retriesRemaining to zero, attempt to configure (and fail), return nil (will be retried), go back to the retry mechanics and sleep, then on next call: retriesRemaining is zero, return error [actually it's a little unclear which err you propose to return in this branch, because there's been no attempt to configure, so that "err" isn't in scope].

Besides the question of which error to return, we don't want to sleep for another backoff period (the final and longest one!) once it's already clear that the next call is going to immediately fail.

Open to suggestions for improvement here, but as you can see it's a little subtle.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: steinarvk-kolonial
To complete the pull request process, please assign rikatz after the PR has been reviewed.
You can assign the PR to them by writing /assign @rikatz in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@steinarvk-oda steinarvk-oda requested a review from rikatz July 9, 2021 10:35
@steinarvk-oda
Copy link
Author

Thanks so much for taking a look! Responded to your comments in the suggestion threads.

This actually hit us again yesterday, so would still be great to get merged.

@rikatz rikatz changed the base branch from master to main July 16, 2021 13:01
@rikatz
Copy link
Contributor

rikatz commented Jul 28, 2021

@steinarvk-kolonial just to let you know I didn't forgot about this one, just pretty busy the last days with v1.0.0 release.

Will come back to this PR probably this weekend

@rikatz rikatz added this to the v1.0.1 milestone Aug 21, 2021
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2021
@k8s-ci-robot
Copy link
Contributor

@steinarvk-kolonial: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rikatz
Copy link
Contributor

rikatz commented Sep 7, 2021

@steinarvk-kolonial thanks again for your patience. This is my next on queue (promise) and I expect to review again by the end of the week.

You can rebase it so it would be easier to merge if everything is ok! :)

Thanks

@rikatz rikatz modified the milestones: v1.0.1, v1.0.2 Sep 14, 2021
@rikatz
Copy link
Contributor

rikatz commented Sep 24, 2021

/kind bug
@steinarvk-kolonial let's put this one on the queue for v1.0.2? :D

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 24, 2021
@rikatz rikatz modified the milestones: v1.0.2, v1.0.3 Sep 26, 2021
@rikatz rikatz removed this from the v1.0.3 milestone Oct 10, 2021
@davideshay
Copy link
Contributor

Any update on this? I'm seeing this problem as well.

@davideshay
Copy link
Contributor

If a rebased PR would help, let me know -- I've fixed this in my own version today.

@iamNoah1
Copy link
Contributor

iamNoah1 commented Feb 1, 2022

@steinarvk-kolonial gently ping about whether you still want to follow up on it.

/assign
/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Feb 1, 2022
@davideshay
Copy link
Contributor

@iamNoah1 , not sure about steinvarvk, but I've been using my own re-based version of this patch and it seems to work. I don't think I can submit that here, though, would need to open another PR? Let me know if I should do that.

@rikatz
Copy link
Contributor

rikatz commented Apr 12, 2022

/close

@k8s-ci-robot
Copy link
Contributor

@rikatz: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nginx pod crashed when scaling up
7 participants