controller can't shutdown gracefully on >=0.18.0 #3533

nikopen · 2018-12-06T15:27:04Z

Bug report:

There appears to be a stray SIGKILL during post-SIGTERM cleanup tasks that doesn't allow the controller to gracefully shutdown with 0.

This was discovered after upgrading from 0.10 to 0.21. There is a custom controller that makes changes to an external on-premise load balancer when an ingress controller receives a SIGTERM in order to stop receiving traffic until it comes back up again, and this stopped working correctly with the upgrade.

Via git bisect and manual testing, we found that the commit responsible for the SIGKILL is this one, between 0.17.1 and 0.18: 7210518

--- difference in loglines and strace after SIGTERM ---

0.17.1:

I1206 14:46:32.274258      13 main.go:158] Received SIGTERM, shutting down
I1206 14:46:32.274324      13 nginx.go:340] Shutting down controller queues
I1206 14:46:32.274416      13 status.go:119] updating status of Ingress rules (remove)
I1206 14:46:32.291974      13 status.go:129] leaving status update for next leader (2)
I1206 14:46:32.292018      13 nginx.go:348] Stopping NGINX process

(various leaderelection lines in between)

I1206 14:47:23.463667      13 nginx.go:361] NGINX process has stopped
I1206 14:47:23.463715      13 main.go:166] Handled quit, awaiting Pod deletion
I1206 14:47:24.200190      13 leaderelection.go:209] successfully renewed lease olympus/ingress-controller-leader-internal
I1206 14:47:28.052480      13 nginx.go:720] Handling connection from remote address  to local 
---pod exits---

=0.18 (tested latest as well):

I1206 13:33:01.729053      14 main.go:163] Received SIGTERM, shutting down
I1206 13:33:01.729091      14 nginx.go:343] Shutting down controller queues
I1206 13:33:01.729114      14 status.go:200] updating status of Ingress rules (remove)
--- pod dies ---

In regards to SIGKILL, it sometimes also happens in <=0.17.1, but it mostly ++exits with 0++. >0.18 always gets a SIGKILL and these three loglines.

SIGQUIT goes to nginx itself in <0.17), while it is absent on >0.18 - controller dies before that.

What you expected to happen:
ingress controller shutting down gracefully

How to reproduce it:
Shut down ingress controller (delete the pod). It can be reproduced most of the time but not always.

NGINX Ingress controller version: >=0.18.0

Kubernetes version: 1.10.11

Environment:

hardware: VMWare on premise
OS: Ubuntu 16.04
Kernel: 4.4.0-135-generic

The text was updated successfully, but these errors were encountered:

nikopen · 2018-12-06T15:27:56Z

/cc @aledbf @breunigs

breunigs · 2018-12-07T10:46:55Z

Using the most recent version (ea89d27) with this patch

diff --git a/images/nginx/rootfs/entrypoint.sh b/images/nginx/rootfs/entrypoint.sh
index 46bb5e076..e8317012c 100755
--- a/images/nginx/rootfs/entrypoint.sh
+++ b/images/nginx/rootfs/entrypoint.sh
@@ -16,4 +16,4 @@
 
 set -e
 
-authbind --deep $@
+exec authbind --deep $@

I cannot reproduce the issue anymore. This is the non-verbose log output after deleting the pod, with a long running connection still being open:

I1207 10:36:21.341023      13 main.go:168] Received SIGTERM, shutting down
I1207 10:36:21.341050      13 nginx.go:349] Shutting down controller queues
I1207 10:36:21.341069      13 status.go:200] updating status of Ingress rules (remove)
I1207 10:36:21.356367      13 status.go:210] leaving status update for next leader (2)
I1207 10:36:21.356387      13 nginx.go:357] Stopping NGINX process
2018/12/07 10:36:21 [notice] 225#225: ModSecurity-nginx v1.0.0
2018/12/07 10:36:21 [notice] 225#225: signal process started
W1207 10:36:30.002950      13 nginx_status.go:207] unexpected error obtaining nginx status info: unexpected error scraping nginx status page: unexpected error scraping nginx : Get http://0.0.0.0:18080/nginx_status: dial tcp 0.0.0.0:18080: connect: connection refused
W1207 10:36:45.025834      13 nginx_status.go:207] unexpected error obtaining nginx status info: unexpected error scraping nginx status page: unexpected error scraping nginx : Get http://0.0.0.0:18080/nginx_status: dial tcp 0.0.0.0:18080: connect: connection refused
W1207 10:37:00.019484      13 nginx_status.go:207] unexpected error obtaining nginx status info: unexpected error scraping nginx status page: unexpected error scraping nginx : Get http://0.0.0.0:18080/nginx_status: dial tcp 0.0.0.0:18080: connect: connection refused
I1207 10:37:14.443987      13 nginx.go:370] NGINX process has stopped
I1207 10:37:14.444015      13 main.go:176] Handled quit, awaiting Pod deletion
W1207 10:37:15.024281      13 nginx_status.go:207] unexpected error obtaining nginx status info: unexpected error scraping nginx status page: unexpected error scraping nginx : Get http://0.0.0.0:18080/nginx_status: dial tcp 0.0.0.0:18080: connect: connection refused
rpc error: code = Unknown desc = Error: No such container: 23c87326cb0ced03f58a9042d22f5f5d4aed0275b94fc82ba175c9d388590665%

The long running connection gets closed at around 10:37:14, after which nginx exists.

kubernetes#3533)

aledbf · 2018-12-10T15:11:26Z

Closing. Fixed in master (#3536)

…rnetes#3533 kubernetes#3536)

breunigs mentioned this issue Dec 7, 2018

remove bash from start chain because it breaks graceful shutdown #3535

Closed

breunigs added a commit to breunigs/ingress-nginx that referenced this issue Dec 7, 2018

remove bash from start chain because it breaks graceful shutdown (fixes

5cbb50c

kubernetes#3533)

breunigs mentioned this issue Dec 10, 2018

Replace dockerfile entrypoint #3536

Merged

aledbf closed this as completed Dec 10, 2018

breunigs added a commit to breunigs/ingress-nginx that referenced this issue Dec 11, 2018

e2e test to ensure graceful shutdown does not lose requests (see kube…

571bcfc

…rnetes#3533 kubernetes#3536)

breunigs mentioned this issue Dec 11, 2018

e2e test to ensure graceful shutdown does not lose requests #3548

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

controller can't shutdown gracefully on >=0.18.0 #3533

controller can't shutdown gracefully on >=0.18.0 #3533

nikopen commented Dec 6, 2018 •

edited

Loading

nikopen commented Dec 6, 2018 •

edited

Loading

breunigs commented Dec 7, 2018

aledbf commented Dec 10, 2018

controller can't shutdown gracefully on >=0.18.0 #3533

controller can't shutdown gracefully on >=0.18.0 #3533

Comments

nikopen commented Dec 6, 2018 • edited Loading

nikopen commented Dec 6, 2018 • edited Loading

breunigs commented Dec 7, 2018

aledbf commented Dec 10, 2018

nikopen commented Dec 6, 2018 •

edited

Loading

nikopen commented Dec 6, 2018 •

edited

Loading