[WIP] fix potential issue with shutdown #4006

ElvinEfendi · 2019-04-12T19:46:03Z

What this PR does / why we need it:

We have two layers of reverse proxies where the second layer is ingress-nginx instances. Everytime we deploy ingress-nginx the first layer proxies see significant number of 502s (should say that we handle millions of requests per minutes).

I think the problem is we fail readiness probes only after Nginx exits. Once the readiness probe fails only then K8s components proxying packets to ingress-nginx replicas starts removing the particular ingress-nginx replica from their upstream pool, but this takes time. During that time there's no Nginx running but the pod can still get requests, which will fail with connection refused because there's nothing listening on the ports since Nginx is shutdown. We currently sleep after Nginx exits, but this does not help with anything, is redundant.

Inline with the code I described what I'd like to do in this PR.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

k8s-ci-robot · 2019-04-12T19:46:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ElvinEfendi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ElvinEfendi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aledbf · 2019-04-12T19:53:32Z

cmd/nginx/main.go

@@ -170,16 +170,23 @@ func handleSigterm(ngx *controller.NGINXController, exit exiter) {
 	signal.Notify(signalChan, syscall.SIGTERM)
 	<-signalChan
 	klog.Info("Received SIGTERM, shutting down")
+	// 1. TODO(elvinefendi) at this point we should immediately fail readiness probe and sleep for some configurable time


For this, we should add a check in

ingress-nginx/internal/ingress/controller/checker.go

Line 38 in 17e788b

func (n *NGINXController) Check(_ *http.Request) error {

checking for n.isShuttingDown to start failing the health check

fejta-bot · 2019-07-11T20:19:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-08-10T21:07:17Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

yvespp · 2019-10-30T11:54:44Z

@ElvinEfendi is this actually fixed in 0.26? In my tests the upstream proxy still sees errors because nginx refuses new connections immediately when the pod is terminated.
The upstream proxy calls the health check of nginx, this check has to fail 3 times before it removes the nginx from the pool and no new connections get routet to that instance.

As far as I can see, there is no time defined where the health check fails but nginx still accepts new connections.

aledbf · 2019-10-30T12:07:16Z

@yvespp did you update the ingress nginx deployment?

https://github.com/kubernetes/ingress-nginx/blob/master/deploy/static/with-rbac.yaml#L25
https://github.com/kubernetes/ingress-nginx/blob/master/deploy/static/with-rbac.yaml#L81

yvespp · 2019-10-30T12:33:05Z

@aledbf yes I did, even with the hook nginx stops accepting new connections immediately.

aledbf · 2019-10-30T12:56:30Z

even with the hook nginx stops accepting new connections immediately.

Yes, that is the intent. NGINX should not accept new connections, only drain the existing ones.

The upstream proxy calls the health check of nginx, this check has to fail 3 times before it removes the nginx from the pool and no new connections get routet to that instance.

You mean the cloud LB right?

As far as I can see, there is no time defined where the health check fails but nginx still accepts new connections.

You mean the cloud LB right? If that's the case, you are right, we don't set additional annotations. Please check https://gist.github.com/mgoodness/1a2926f3b02d8e8149c224d25cc57dc1

aledbf · 2019-10-30T12:56:57Z

@yvespp not sure I understand what's exactly the issue

yvespp · 2019-10-30T13:44:18Z

@aledbf We are on prem and use F5 as a TCP loadbalancer in front of the nginx pods:
client -> F5 TCP LB -> nginx Pods

The F5 checks the nginx health check every 5 seconds.
A nginx pod starts to terminate and immediately stops accepting new connections.
The F5 still sends new connections to the nginx pod for 5 seconds because the health check did not fail yet.

I think what would help here is a configurable sleep in wait-shutdown hook between the stop of the controller (which fails the health check) and the shutdown of nginx. Ningx would still acception new connections until the F5 has realized that it is down and stops sending new connections to it.

aledbf · 2019-10-30T13:51:16Z

First, thanks for the context.

Ningx would still acception new connections until the F5 has realized that it is down and stops sending new connections to it.

What you are proposing makes sense, i.e., a new flag for the time to sleep before shutting down NGINX using the default readiness probe failure value (30 seconds). That said, such change also requires an increase in the default 300s to 330s.

Please open a new issue with the comment you just posted and a link to this comment.

Edit: the sleep goes here

ingress-nginx/internal/ingress/controller/nginx.go

Line 406 in 182de47

klog.Info("Stopping NGINX process")

aledbf · 2019-10-30T13:52:08Z

@yvespp also, pull requests are welcome :)

yvespp · 2019-10-30T14:15:04Z

@ElvinEfendi done, thanks!

nothing to see yet

469c31c

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 12, 2019

k8s-ci-robot requested review from aledbf and bowei April 12, 2019 19:46

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 12, 2019

ElvinEfendi changed the title ~~[WIP] nothing to see yet~~ [WIP] fix potential issue with shutdown Apr 12, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2019

aledbf reviewed Apr 12, 2019

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 10, 2019

aledbf mentioned this pull request Aug 23, 2019

Refactor health checks and wait until NGINX process ends #4487

Merged

k8s-ci-robot closed this in #4487 Sep 1, 2019

yvespp mentioned this pull request Oct 30, 2019

Configurable sleep before nginx shutdown for upstream LB #4726

Closed

aledbf mentioned this pull request Mar 11, 2020

Add wait-before-shutdown command line option #5237

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] fix potential issue with shutdown #4006

[WIP] fix potential issue with shutdown #4006

ElvinEfendi commented Apr 12, 2019

k8s-ci-robot commented Apr 12, 2019

aledbf Apr 12, 2019 •

edited

Loading

fejta-bot commented Jul 11, 2019

fejta-bot commented Aug 10, 2019

yvespp commented Oct 30, 2019

aledbf commented Oct 30, 2019

yvespp commented Oct 30, 2019

aledbf commented Oct 30, 2019

aledbf commented Oct 30, 2019

yvespp commented Oct 30, 2019 •

edited

Loading

aledbf commented Oct 30, 2019 •

edited

Loading

aledbf commented Oct 30, 2019

yvespp commented Oct 30, 2019

[WIP] fix potential issue with shutdown #4006

[WIP] fix potential issue with shutdown #4006

Conversation

ElvinEfendi commented Apr 12, 2019

k8s-ci-robot commented Apr 12, 2019

aledbf Apr 12, 2019 • edited Loading

Choose a reason for hiding this comment

fejta-bot commented Jul 11, 2019

fejta-bot commented Aug 10, 2019

yvespp commented Oct 30, 2019

aledbf commented Oct 30, 2019

yvespp commented Oct 30, 2019

aledbf commented Oct 30, 2019

aledbf commented Oct 30, 2019

yvespp commented Oct 30, 2019 • edited Loading

aledbf commented Oct 30, 2019 • edited Loading

aledbf commented Oct 30, 2019

yvespp commented Oct 30, 2019

aledbf Apr 12, 2019 •

edited

Loading

yvespp commented Oct 30, 2019 •

edited

Loading

aledbf commented Oct 30, 2019 •

edited

Loading