Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller can't shutdown gracefully on >=0.18.0 #3533

Closed
nikopen opened this issue Dec 6, 2018 · 3 comments
Closed

controller can't shutdown gracefully on >=0.18.0 #3533

nikopen opened this issue Dec 6, 2018 · 3 comments

Comments

@nikopen
Copy link

nikopen commented Dec 6, 2018

Bug report:

There appears to be a stray SIGKILL during post-SIGTERM cleanup tasks that doesn't allow the controller to gracefully shutdown with 0.

This was discovered after upgrading from 0.10 to 0.21. There is a custom controller that makes changes to an external on-premise load balancer when an ingress controller receives a SIGTERM in order to stop receiving traffic until it comes back up again, and this stopped working correctly with the upgrade.

Via git bisect and manual testing, we found that the commit responsible for the SIGKILL is this one, between 0.17.1 and 0.18: 7210518

--- difference in loglines and strace after SIGTERM ---

0.17.1:

I1206 14:46:32.274258      13 main.go:158] Received SIGTERM, shutting down
I1206 14:46:32.274324      13 nginx.go:340] Shutting down controller queues
I1206 14:46:32.274416      13 status.go:119] updating status of Ingress rules (remove)
I1206 14:46:32.291974      13 status.go:129] leaving status update for next leader (2)
I1206 14:46:32.292018      13 nginx.go:348] Stopping NGINX process

(various leaderelection lines in between)

I1206 14:47:23.463667      13 nginx.go:361] NGINX process has stopped
I1206 14:47:23.463715      13 main.go:166] Handled quit, awaiting Pod deletion
I1206 14:47:24.200190      13 leaderelection.go:209] successfully renewed lease olympus/ingress-controller-leader-internal
I1206 14:47:28.052480      13 nginx.go:720] Handling connection from remote address  to local 
---pod exits---

=0.18 (tested latest as well):

I1206 13:33:01.729053      14 main.go:163] Received SIGTERM, shutting down
I1206 13:33:01.729091      14 nginx.go:343] Shutting down controller queues
I1206 13:33:01.729114      14 status.go:200] updating status of Ingress rules (remove)
--- pod dies ---

In regards to SIGKILL, it sometimes also happens in <=0.17.1, but it mostly ++exits with 0++. >0.18 always gets a SIGKILL and these three loglines.

SIGQUIT goes to nginx itself in <0.17), while it is absent on >0.18 - controller dies before that.

What you expected to happen:
ingress controller shutting down gracefully

How to reproduce it:
Shut down ingress controller (delete the pod). It can be reproduced most of the time but not always.

NGINX Ingress controller version: >=0.18.0

Kubernetes version: 1.10.11

Environment:

  • hardware: VMWare on premise
  • OS: Ubuntu 16.04
  • Kernel: 4.4.0-135-generic
@nikopen
Copy link
Author

nikopen commented Dec 6, 2018

/cc @aledbf @breunigs

@breunigs
Copy link
Contributor

breunigs commented Dec 7, 2018

Using the most recent version (ea89d27) with this patch

diff --git a/images/nginx/rootfs/entrypoint.sh b/images/nginx/rootfs/entrypoint.sh
index 46bb5e076..e8317012c 100755
--- a/images/nginx/rootfs/entrypoint.sh
+++ b/images/nginx/rootfs/entrypoint.sh
@@ -16,4 +16,4 @@
 
 set -e
 
-authbind --deep $@
+exec authbind --deep $@

I cannot reproduce the issue anymore. This is the non-verbose log output after deleting the pod, with a long running connection still being open:

I1207 10:36:21.341023      13 main.go:168] Received SIGTERM, shutting down
I1207 10:36:21.341050      13 nginx.go:349] Shutting down controller queues
I1207 10:36:21.341069      13 status.go:200] updating status of Ingress rules (remove)
I1207 10:36:21.356367      13 status.go:210] leaving status update for next leader (2)
I1207 10:36:21.356387      13 nginx.go:357] Stopping NGINX process
2018/12/07 10:36:21 [notice] 225#225: ModSecurity-nginx v1.0.0
2018/12/07 10:36:21 [notice] 225#225: signal process started
W1207 10:36:30.002950      13 nginx_status.go:207] unexpected error obtaining nginx status info: unexpected error scraping nginx status page: unexpected error scraping nginx : Get http://0.0.0.0:18080/nginx_status: dial tcp 0.0.0.0:18080: connect: connection refused
W1207 10:36:45.025834      13 nginx_status.go:207] unexpected error obtaining nginx status info: unexpected error scraping nginx status page: unexpected error scraping nginx : Get http://0.0.0.0:18080/nginx_status: dial tcp 0.0.0.0:18080: connect: connection refused
W1207 10:37:00.019484      13 nginx_status.go:207] unexpected error obtaining nginx status info: unexpected error scraping nginx status page: unexpected error scraping nginx : Get http://0.0.0.0:18080/nginx_status: dial tcp 0.0.0.0:18080: connect: connection refused
I1207 10:37:14.443987      13 nginx.go:370] NGINX process has stopped
I1207 10:37:14.444015      13 main.go:176] Handled quit, awaiting Pod deletion
W1207 10:37:15.024281      13 nginx_status.go:207] unexpected error obtaining nginx status info: unexpected error scraping nginx status page: unexpected error scraping nginx : Get http://0.0.0.0:18080/nginx_status: dial tcp 0.0.0.0:18080: connect: connection refused
rpc error: code = Unknown desc = Error: No such container: 23c87326cb0ced03f58a9042d22f5f5d4aed0275b94fc82ba175c9d388590665%  

The long running connection gets closed at around 10:37:14, after which nginx exists.

@aledbf
Copy link
Member

aledbf commented Dec 10, 2018

Closing. Fixed in master (#3536)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants