Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] fix potential issue with shutdown #4006

Closed
wants to merge 1 commit into from

Conversation

ElvinEfendi
Copy link
Member

What this PR does / why we need it:

We have two layers of reverse proxies where the second layer is ingress-nginx instances. Everytime we deploy ingress-nginx the first layer proxies see significant number of 502s (should say that we handle millions of requests per minutes).

I think the problem is we fail readiness probes only after Nginx exits. Once the readiness probe fails only then K8s components proxying packets to ingress-nginx replicas starts removing the particular ingress-nginx replica from their upstream pool, but this takes time. During that time there's no Nginx running but the pod can still get requests, which will fail with connection refused because there's nothing listening on the ports since Nginx is shutdown. We currently sleep after Nginx exits, but this does not help with anything, is redundant.

Inline with the code I described what I'd like to do in this PR.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 12, 2019
@k8s-ci-robot k8s-ci-robot requested review from aledbf and bowei April 12, 2019 19:46
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 12, 2019
@ElvinEfendi ElvinEfendi changed the title [WIP] nothing to see yet [WIP] fix potential issue with shutdown Apr 12, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ElvinEfendi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2019
@@ -170,16 +170,23 @@ func handleSigterm(ngx *controller.NGINXController, exit exiter) {
signal.Notify(signalChan, syscall.SIGTERM)
<-signalChan
klog.Info("Received SIGTERM, shutting down")
// 1. TODO(elvinefendi) at this point we should immediately fail readiness probe and sleep for some configurable time
Copy link
Member

@aledbf aledbf Apr 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, we should add a check in

func (n *NGINXController) Check(_ *http.Request) error {
checking for n.isShuttingDown to start failing the health check

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 10, 2019
@yvespp
Copy link

yvespp commented Oct 30, 2019

@ElvinEfendi is this actually fixed in 0.26? In my tests the upstream proxy still sees errors because nginx refuses new connections immediately when the pod is terminated.
The upstream proxy calls the health check of nginx, this check has to fail 3 times before it removes the nginx from the pool and no new connections get routet to that instance.

As far as I can see, there is no time defined where the health check fails but nginx still accepts new connections.

@yvespp
Copy link

yvespp commented Oct 30, 2019

@aledbf yes I did, even with the hook nginx stops accepting new connections immediately.

@aledbf
Copy link
Member

aledbf commented Oct 30, 2019

even with the hook nginx stops accepting new connections immediately.

Yes, that is the intent. NGINX should not accept new connections, only drain the existing ones.

The upstream proxy calls the health check of nginx, this check has to fail 3 times before it removes the nginx from the pool and no new connections get routet to that instance.

You mean the cloud LB right?

As far as I can see, there is no time defined where the health check fails but nginx still accepts new connections.

You mean the cloud LB right? If that's the case, you are right, we don't set additional annotations. Please check https://gist.github.com/mgoodness/1a2926f3b02d8e8149c224d25cc57dc1

@aledbf
Copy link
Member

aledbf commented Oct 30, 2019

@yvespp not sure I understand what's exactly the issue

@yvespp
Copy link

yvespp commented Oct 30, 2019

@aledbf We are on prem and use F5 as a TCP loadbalancer in front of the nginx pods:
client -> F5 TCP LB -> nginx Pods

  1. The F5 checks the nginx health check every 5 seconds.
  2. A nginx pod starts to terminate and immediately stops accepting new connections.
  3. The F5 still sends new connections to the nginx pod for 5 seconds because the health check did not fail yet.

I think what would help here is a configurable sleep in wait-shutdown hook between the stop of the controller (which fails the health check) and the shutdown of nginx. Ningx would still acception new connections until the F5 has realized that it is down and stops sending new connections to it.

@aledbf
Copy link
Member

aledbf commented Oct 30, 2019

First, thanks for the context.

Ningx would still acception new connections until the F5 has realized that it is down and stops sending new connections to it.

What you are proposing makes sense, i.e., a new flag for the time to sleep before shutting down NGINX using the default readiness probe failure value (30 seconds). That said, such change also requires an increase in the default 300s to 330s.

Please open a new issue with the comment you just posted and a link to this comment.

Edit: the sleep goes here

klog.Info("Stopping NGINX process")

@aledbf
Copy link
Member

aledbf commented Oct 30, 2019

@yvespp also, pull requests are welcome :)

@yvespp
Copy link

yvespp commented Oct 30, 2019

@ElvinEfendi done, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants