Fix buggy retry logic in syncIngress() #7086

steinarvk-oda · 2021-04-28T17:05:06Z

What this PR does / why we need it:

NGINXController.syncIngress in internal/ingress/controller/controller.go is clearly intended to do retries with exponential backoff when calling configureDynamically(). However, due to what seems to be a bug, it won't do any retries. This bugfix contribution fixes that bug, so NGINXController.syncIngress will start doing retries according to the retry policy that seems to have been the original intent.

Per the documentation for wait.ExponentialBackoff, this function "stops and returns as soon as [...] the condition check returns true or an error". The previous implementation of the condition check always returned either true (success) or an error (failure) -- as such, retries would never actually trigger.

We've had issues with this "retry" loop on a real deployment on multiple occasions. This manifests as the nginx-controller failing to get healthy and getting stuck in a crashloop for minutes or hours with an errors akin to this:

Unexpected failure reconfiguring NGINX:
"requeuing" err="Post \"http://127.0.0.1:10246/configuration/backends\": dial tcp 127.0.0.1:10246: connect: connection refused" key="initial-sync"

The hope is that this bugfix will resolve this existing crash-on-startup bug by allowing more time (as much time as originally intended) for port 10246 to start listening.

Since the amount of time required will depend on configuration size, as noted in the pre-existing comment, I've also added a flag to control the number of retries.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Which issue/s this PR fixes

might fix #4742

Also referenced in the comments of this one: #4629 (comment)

How Has This Been Tested?

I ran "make test", and everything seemed to pass. (On my development Linux laptop.)

I don't have enough context in the code to know how to easily write a unit test for this. If the reviewers want to require one, pointers on how to set up a test to call a callback function from configureDynamically() would be appreciated.

Update: I attempted to test on a testing cluster. I added lots of ingresses (164 ingresses) and restarted the nginx-controller pods -- first with a control version, then with an image including this patch. I was able to reproduce the "connection refused" error as noted above and as such got to see the retries working. However, in the stress test conditions, even 15 retries were not enough and I ultimately got "Unexpected failure reconfiguring NGINX" again. I was also unable to reproduce the crash loop (both with the control version and the patch) -- both the control version and the patch came up after a few minutes on the second attempt.

The control version I used was: k8s.gcr.io/ingress-nginx/controller:v0.45.0@sha256:c4390c53f348c3bd4e60a5dd6a11c35799ae78c49388090140b9d72ccede1755

This testing attempt also uncovered a secondary bug: 15 retries take around 5 seconds because the Factor is set to 0.8, meaning this is a rarely-seen example of exponential backoff where the retry delay gets shorter and shorter. That's almost certainly not what was intended. I'll set it to 1.2 instead.

Update 2: tried again with the fixed backoff factor and overall more lax settings. In my stress test conditions, the controller came up after 67 seconds. I've suggested bumping the factor to 1.3 so that this apparently-realistic-if-overloaded scenario can finish successfully. Really, though -- perhaps the retry budget here should be somehow configurable? (Added a flag.)

Checklist:

My change requires a change to the documentation. (If we're adding a flag, should document it.)
I have updated the documentation accordingly. (Waiting to see whether reviewers think this warrants a flag. If so, please point me to the docs that should be updated.)
I've read the CONTRIBUTION guide
I have added tests to cover my changes.
All new and existing tests passed.

This code looks like it was intended to do retries with exponential backoff, but due to a bug it seems like it did no retries at all. Per the documentation for wait.ExponentialBackoff, this function "stops and returns as soon as [...] the condition check returns true or an error". The previous implementation of the condition check always returned either true (success) or an error (failure) -- as such, retries would never actually trigger. The new implementation instead checks whether there are retries left (according to the intended policy of 15 retries). If there are retries left, it logs the error but instead returns "false, nil", which should trigger a retry from wait.ExponentialBackoff. We've had issues with this "retry" loop on a real deployment on multiple occasions; the nginx-controller failing to get healthy in time and getting stuck in a crashloop for minutes or hours with an errors akin to this: Unexpected failure reconfiguring NGINX: "requeuing" err="Post \"http://127.0.0.1:10246/configuration/backends\": dial tcp 127.0.0.1:10246: connect: connection refused" key="initial-sync" The hope is that this bugfix will resolve this existing crash-on-startup bug.

k8s-ci-robot · 2021-04-28T17:05:09Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2021-04-28T17:05:13Z

Welcome @steinarvk-kolonial!

It looks like this is your first PR to kubernetes/ingress-nginx 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/ingress-nginx has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2021-04-28T17:05:14Z

Hi @steinarvk-kolonial. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RmStorm · 2021-04-28T20:06:32Z

uhm @steinarvk-kolonial did you notice that, in addition to never retrying, the backoff factor is smaller than 1 🙈 I don't think it's supposed to be like that. It basically tries 15 times in 6 seconds and it speeds up near the end. For this backoff behavior to actually kick in in any sensible way you probably want to put that on 1.1 or something.

As another suggestion.. Maybe set a cap and drop the jitter? The jitter seems kinda useless in this case and a cap of 1 or 2 minutes sounds very sensible? https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff

Factor 0.8 means that the retry delays get shorter and shorter with each attempt, which is almost certainly not intentional. A more sensible value is 1.2.

1.2 was somewhat arbitrarily chosen. In my stress-testing I observed this taking 67 seconds, so I suggest putting the factor at a point that at least comfortable allows that. Really, though -- perhaps it'd be better if this were somehow configurable.

Since the number of retries appropriate could depend on the configuration size, might as well make this a flag.

rikatz · 2021-04-29T14:18:12Z

/ok-to-test

codecov-commenter · 2021-04-29T14:22:12Z

Codecov Report

Merging #7086 (981c52c) into master (bacd735) will increase coverage by 0.15%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master    #7086      +/-   ##
==========================================
+ Coverage   55.82%   55.98%   +0.15%     
==========================================
  Files          94       94              
  Lines        6588     6593       +5     
==========================================
+ Hits         3678     3691      +13     
+ Misses       2449     2441       -8     
  Partials      461      461

Impacted Files	Coverage Δ
internal/ingress/controller/nginx.go	`28.71% <ø> (ø)`
internal/ingress/controller/store/store.go	`58.94% <16.66%> (ø)`
internal/ingress/controller/controller.go	`46.53% <27.27%> (-0.15%)`	⬇️
cmd/nginx/flags.go	`81.90% <100.00%> (+0.27%)`	⬆️
internal/ingress/types_equals.go	`17.96% <100.00%> (+3.80%)`	⬆️
...ternal/ingress/annotations/globalratelimit/main.go	`64.70% <0.00%> (-0.30%)`	⬇️
internal/ingress/controller/template/template.go	`76.82% <0.00%> (-0.04%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ada2300...981c52c. Read the comment docs.

steinarvk-oda · 2021-05-27T09:26:52Z

Gentle ping, would be great to get some feedback on this. Are the automatically assigned reviewers the appropriate ones (and not overloaded)?

rikatz · 2021-06-27T22:25:20Z

I'll take a look
/cc

rikatz

Overall looks good to me, nice catch.

Please just take a look into the suggestion of changing the logic, but it's mostly a nit.

Also, I'm thinking here is the wait.Backoff + the declaration of steps already doesn't do that, so instead of implementing the internal counter we shouldn't just be capturing the error and returning it later

rikatz · 2021-06-27T22:31:57Z

internal/ingress/controller/controller.go

-		Factor:   0.8,
+		Steps:    1 + n.cfg.DynamicConfigurationRetries,
+		Duration: time.Second,
+		Factor:   1.3,


Why change the Factor here?

A factor less than 1 means the delays get shorter and shorter: e.g. first 1 second, then 800ms, then 640ms, and so on.

That's not usually what people mean by "exponential backoff", and it doesn't really achieve the same thing of adaptively polling for a process that may take a long time.

Since this was effectively dead code previously, and by the logic above the previous value doesn't seem to have been tested, I set a new value based based on how long it took the controller to come up while stress-testing it.

rikatz · 2021-06-27T22:39:51Z

internal/ingress/controller/controller.go

+	retriesRemaining := retry.Steps
 	err := wait.ExponentialBackoff(retry, func() (bool, error) {
 		err := n.configureDynamically(pcfg)
 		if err == nil {
 			klog.V(2).Infof("Dynamic reconfiguration succeeded.")
 			return true, nil
 		}

+		retriesRemaining--
+		if retriesRemaining > 0 {
+			klog.Warningf("Dynamic reconfiguration failed (retrying; %d retries left): %v", retriesRemaining, err)
+			return false, nil
+		}
+
 		klog.Warningf("Dynamic reconfiguration failed: %v", err)
 		return false, err
 	})


Overall this looks good.

I would maybe go into a simpler logic like:

Suggested change

retriesRemaining := retry.Steps

err := wait.ExponentialBackoff(retry, func() (bool, error) {

err := n.configureDynamically(pcfg)

if err == nil {

klog.V(2).Infof("Dynamic reconfiguration succeeded.")

return true, nil

}

retriesRemaining--

if retriesRemaining > 0 {

klog.Warningf("Dynamic reconfiguration failed (retrying; %d retries left): %v", retriesRemaining, err)

return false, nil

}

klog.Warningf("Dynamic reconfiguration failed: %v", err)

return false, err

})

retriesRemaining := retry.Steps

err := wait.ExponentialBackoff(retry, func() (bool, error) {

# We should early fail

if retriesRemaining > 0 {

retriesRemaining--

err := n.configureDynamically(pcfg)

if err == nil {

klog.V(2).Infof("Dynamic reconfiguration succeeded.")

return true, nil

}

klog.Warningf("Dynamic reconfiguration failed (retrying; %d retries left): %v", retriesRemaining, err)

return false, nil

}

klog.Warningf("Dynamic reconfiguration failed: %v", err)

return false, err

}

As I read the code, this is not quite equivalent because of the behaviour when the retries are exhausted. Suppose that retriesRemaining=1 and all attempts to configure will be failing.

The old code does: attempt to configure (and fail), decrement retriesRemaining to zero, don't enter branch, return the error from earlier (won't be retried).

The proposed code does: check retriesRemaining above zero, decrement retriesRemaining to zero, attempt to configure (and fail), return nil (will be retried), go back to the retry mechanics and sleep, then on next call: retriesRemaining is zero, return error [actually it's a little unclear which err you propose to return in this branch, because there's been no attempt to configure, so that "err" isn't in scope].

Besides the question of which error to return, we don't want to sleep for another backoff period (the final and longest one!) once it's already clear that the next call is going to immediately fail.

Open to suggestions for improvement here, but as you can see it's a little subtle.

k8s-ci-robot · 2021-06-27T22:42:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: steinarvk-kolonial
To complete the pull request process, please assign rikatz after the PR has been reviewed.
You can assign the PR to them by writing /assign @rikatz in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

steinarvk-oda · 2021-07-09T10:37:25Z

Thanks so much for taking a look! Responded to your comments in the suggestion threads.

This actually hit us again yesterday, so would still be great to get merged.

rikatz · 2021-07-28T03:30:39Z

@steinarvk-kolonial just to let you know I didn't forgot about this one, just pretty busy the last days with v1.0.0 release.

Will come back to this PR probably this weekend

k8s-ci-robot · 2021-08-22T01:21:26Z

@steinarvk-kolonial: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rikatz · 2021-09-07T19:52:02Z

@steinarvk-kolonial thanks again for your patience. This is my next on queue (promise) and I expect to review again by the end of the week.

You can rebase it so it would be easier to merge if everything is ok! :)

Thanks

rikatz · 2021-09-24T11:49:31Z

/kind bug
@steinarvk-kolonial let's put this one on the queue for v1.0.2? :D

davideshay · 2021-12-17T19:27:47Z

Any update on this? I'm seeing this problem as well.

davideshay · 2022-01-04T18:48:55Z

If a rebased PR would help, let me know -- I've fixed this in my own version today.

iamNoah1 · 2022-02-01T17:12:54Z

@steinarvk-kolonial gently ping about whether you still want to follow up on it.

/assign
/triage accepted
/priority important-longterm

davideshay · 2022-02-07T19:55:31Z

@iamNoah1 , not sure about steinvarvk, but I've been using my own re-based version of this patch and it seems to work. I don't think I can submit that here, though, would need to open another PR? Let me know if I should do that.

rikatz · 2022-04-12T16:44:58Z

/close

k8s-ci-robot · 2022-04-12T16:45:15Z

@rikatz: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2021

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 28, 2021

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 28, 2021

k8s-ci-robot requested review from cmluciano and ElvinEfendi April 28, 2021 17:05

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 28, 2021

steinarvk-oda marked this pull request as ready for review April 28, 2021 19:22

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2021

steinarvk-oda added 3 commits April 28, 2021 22:34

syncIngress: change backoff factor to 1.2

2c538b6

Factor 0.8 means that the retry delays get shorter and shorter with each attempt, which is almost certainly not intentional. A more sensible value is 1.2.

syncIngress: flag for number of retries

981c52c

Since the number of retries appropriate could depend on the configuration size, might as well make this a flag.

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 28, 2021

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 29, 2021

k8s-ci-robot requested a review from rikatz June 27, 2021 22:25

rikatz requested changes Jun 27, 2021

View reviewed changes

steinarvk-oda requested a review from rikatz July 9, 2021 10:35

rikatz changed the base branch from master to main July 16, 2021 13:01

rikatz added this to the v1.0.1 milestone Aug 21, 2021

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2021

rikatz modified the milestones: v1.0.1, v1.0.2 Sep 14, 2021

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 24, 2021

rikatz modified the milestones: v1.0.2, v1.0.3 Sep 26, 2021

rikatz removed this from the v1.0.3 milestone Oct 10, 2021

k8s-ci-robot assigned iamNoah1 Feb 1, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Feb 1, 2022

davideshay mentioned this pull request Mar 11, 2022

Fix for buggy ingress sync with retries #8325

Merged

9 tasks

pdefreitas mentioned this pull request Mar 21, 2022

Ingress controller keeps increasing the memory when new backend reload action triggered #8362

Closed

k8s-ci-robot closed this Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix buggy retry logic in syncIngress() #7086

Fix buggy retry logic in syncIngress() #7086

steinarvk-oda commented Apr 28, 2021 •

edited

Loading

k8s-ci-robot commented Apr 28, 2021

k8s-ci-robot commented Apr 28, 2021

k8s-ci-robot commented Apr 28, 2021

RmStorm commented Apr 28, 2021

rikatz commented Apr 29, 2021

codecov-commenter commented Apr 29, 2021

steinarvk-oda commented May 27, 2021

rikatz commented Jun 27, 2021

rikatz left a comment

rikatz Jun 27, 2021

steinarvk-oda Jul 9, 2021

rikatz Jun 27, 2021

steinarvk-oda Jul 9, 2021

k8s-ci-robot commented Jun 27, 2021

steinarvk-oda commented Jul 9, 2021

rikatz commented Jul 28, 2021

k8s-ci-robot commented Aug 22, 2021

rikatz commented Sep 7, 2021

rikatz commented Sep 24, 2021

davideshay commented Dec 17, 2021

davideshay commented Jan 4, 2022

iamNoah1 commented Feb 1, 2022

davideshay commented Feb 7, 2022

rikatz commented Apr 12, 2022

k8s-ci-robot commented Apr 12, 2022

Fix buggy retry logic in syncIngress() #7086

Fix buggy retry logic in syncIngress() #7086

Conversation

steinarvk-oda commented Apr 28, 2021 • edited Loading

What this PR does / why we need it:

Types of changes

Which issue/s this PR fixes

How Has This Been Tested?

Checklist:

k8s-ci-robot commented Apr 28, 2021

k8s-ci-robot commented Apr 28, 2021

k8s-ci-robot commented Apr 28, 2021

RmStorm commented Apr 28, 2021

rikatz commented Apr 29, 2021

codecov-commenter commented Apr 29, 2021

Codecov Report

steinarvk-oda commented May 27, 2021

rikatz commented Jun 27, 2021

rikatz left a comment

Choose a reason for hiding this comment

rikatz Jun 27, 2021

Choose a reason for hiding this comment

steinarvk-oda Jul 9, 2021

Choose a reason for hiding this comment

rikatz Jun 27, 2021

Choose a reason for hiding this comment

steinarvk-oda Jul 9, 2021

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 27, 2021

steinarvk-oda commented Jul 9, 2021

rikatz commented Jul 28, 2021

k8s-ci-robot commented Aug 22, 2021

rikatz commented Sep 7, 2021

rikatz commented Sep 24, 2021

davideshay commented Dec 17, 2021

davideshay commented Jan 4, 2022

iamNoah1 commented Feb 1, 2022

davideshay commented Feb 7, 2022

rikatz commented Apr 12, 2022

k8s-ci-robot commented Apr 12, 2022

steinarvk-oda commented Apr 28, 2021 •

edited

Loading