-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] fix potential issue with shutdown #4006
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ElvinEfendi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -170,16 +170,23 @@ func handleSigterm(ngx *controller.NGINXController, exit exiter) { | |||
signal.Notify(signalChan, syscall.SIGTERM) | |||
<-signalChan | |||
klog.Info("Received SIGTERM, shutting down") | |||
// 1. TODO(elvinefendi) at this point we should immediately fail readiness probe and sleep for some configurable time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, we should add a check in
func (n *NGINXController) Check(_ *http.Request) error { |
n.isShuttingDown
to start failing the health check
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@ElvinEfendi is this actually fixed in 0.26? In my tests the upstream proxy still sees errors because nginx refuses new connections immediately when the pod is terminated. As far as I can see, there is no time defined where the health check fails but nginx still accepts new connections. |
@aledbf yes I did, even with the hook nginx stops accepting new connections immediately. |
Yes, that is the intent. NGINX should not accept new connections, only drain the existing ones.
You mean the cloud LB right?
You mean the cloud LB right? If that's the case, you are right, we don't set additional annotations. Please check https://gist.github.com/mgoodness/1a2926f3b02d8e8149c224d25cc57dc1 |
@yvespp not sure I understand what's exactly the issue |
@aledbf We are on prem and use F5 as a TCP loadbalancer in front of the nginx pods:
I think what would help here is a configurable sleep in wait-shutdown hook between the stop of the controller (which fails the health check) and the shutdown of nginx. Ningx would still acception new connections until the F5 has realized that it is down and stops sending new connections to it. |
First, thanks for the context.
What you are proposing makes sense, i.e., a new flag for the time to sleep before shutting down NGINX using the default readiness probe failure value (30 seconds). That said, such change also requires an increase in the default 300s to 330s. Please open a new issue with the comment you just posted and a link to this comment. Edit: the sleep goes here
|
@yvespp also, pull requests are welcome :) |
@ElvinEfendi done, thanks! |
What this PR does / why we need it:
We have two layers of reverse proxies where the second layer is ingress-nginx instances. Everytime we deploy ingress-nginx the first layer proxies see significant number of 502s (should say that we handle millions of requests per minutes).
I think the problem is we fail readiness probes only after Nginx exits. Once the readiness probe fails only then K8s components proxying packets to ingress-nginx replicas starts removing the particular ingress-nginx replica from their upstream pool, but this takes time. During that time there's no Nginx running but the pod can still get requests, which will fail with connection refused because there's nothing listening on the ports since Nginx is shutdown. We currently sleep after Nginx exits, but this does not help with anything, is redundant.
Inline with the code I described what I'd like to do in this PR.
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #Special notes for your reviewer: