Make graceful shut-down keep-alive behavior consistent #1236

tilgovi · 2016-03-25T20:21:07Z

Following on from #922, the handling of keep-alive connections during graceful shutdown is not really specified anywhere and may not be consistent among workers.

Describe the intended behavior
Take stock of current behavior and make issues for each worker
Ship it

benoitc · 2016-04-07T19:38:08Z

@tilgovi do you need anything from me there?

tilgovi · 2016-04-07T19:52:44Z

If you want to "describe the intended behavior" that would be helpful, otherwise I'll propose it.

benoitc · 2016-10-16T17:16:12Z

Sorry I missed your answer.

Graceful shutdown only means we let a time for the requests to finish.

when the signal is received, the workers stops to accept any new connections
At the graceful time, all still running client connections are closed (sockets are closed), kept alived or not.

Speaking of keepalive connections I think we should stop the request loop when the signal is received instead of accepting any new requests. Thoughts?

tuukkamustonen · 2017-04-03T09:34:39Z

At the graceful time, all still running client connections are closed (sockets are closed), kept alived or not.

@benoitc Considering a sync worker, without threads, I believe no other connections are aborted/lost than the current request in-process, because connections are queued at the master process and not at the worker process level?

If so, can you clarify what you mean by "all still running client connections are closed" - I assume you refer to threaded/async workers here (where multiple requests may be processed concurrently, compared to sync worker without threads)?

benoitc · 2018-06-29T21:17:39Z

@tuukkamustonen master doesn’t queue any connection. each workers is responsible to accept a connection. afaik connections are queued at the system level. When the master receive the hup signal it notify the worker about it and they stop to accept new connections. Then running connections (those already accepted) will have the graceful time to finish or be forcefully closed.

tuukkamustonen · 2018-07-03T07:00:26Z

afaik connections are queued at the system level

Ah, I wonder how that works / where it's instructed... well, no need to go that deep :)

When the master receive the HUP signal it notify the worker about it and they stop to accept new connections. Then running connections (those already accepted) will have the graceful time to finish or be forcefully closed.

Ok. This summarizes it nicely.

benoitc · 2018-10-09T05:19:51Z

@tilgovi we probably should close that issue?

tilgovi · 2018-10-09T18:18:42Z

I would like to keep this one open. I'm not convinced we have consistent behavior here yet.

vgrebenschikov · 2019-11-15T22:33:31Z

How to reproduce problem:
(in fact problem easily can be observed on busy production during graceful shutdown)

Run gunicorn with keepalive (trivial app returning some data)
$ gunicorn --max-requests 512 --keep-alive 2 --threads 20 --workers 4 t:app

run apache benchmark:

$ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/

...
Concurrency Level: 20
Time taken for tests: 2.693 seconds
Complete requests: 10000
Failed requests: 435

See > 4% failed requests just due to restarted workers (in this case by max-requests)

Run gunicorn without keepalive (same app)
$ gunicorn --max-requests 512 --keep-alive 0 --threads 20 --workers 4 t:app

$ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/

...
Complete requests: 10000
Failed requests: 0

See no failed requests

tried on gunicorn up to 20.0.0

vgrebenschikov · 2019-11-15T22:46:17Z

Probably it worth resolve problem in a following way:

In case of graceful shutdown on keep-alive connection try to serve one more request after graceful shutdown request and send Connection: close in response to force sender not use this socket any more for next request, if no request arrived in reasonable timeframe (i.e. 1s) just close connection.

Yes, there is small possibility for race (when server decides to close when client sends request),
but this will completely close window for problems on heavy load when requests are followed one by one.

benoitc · 2019-11-23T00:34:55Z

cc @tilgovi ^^

In fact there is two schools there imo

we consider the connection is already living so we could accept one or more quest inside during the graceful time (what you're describing). Ie we keep open the connections as long as we receive a request on them and we are in the graceful time.
if HUP or USR2 is sent to gunicorn, it means we want to change as fast as possible the configuration or the application version. In such case the gracefultime is here so we make sure we terminate the current requests cleanly (finishing a transaction, etc...). But we don't want any new request on the old version. In such case it make sense to also not accept new requests on keepalived connections and close them once the current request terminate.

I'm in favour of 2 which may be more safe. Thoughts?

tilgovi · 2019-11-23T22:10:34Z

I am very much in favor of Option 2. That was the behavior I've assumed and I've made some changes to this end in, linked from #922.

I don't know which workers implement all of these behaviors, but we should check:

Close the socket. This is necessary in case Gunicorn is being shut down and not reloaded. The OS should be notified as soon as possible to stop accepting requests so that new requests can be directed to a different node in a load balanced deployment.
Close connections after the next response. Do not allow connection keep-alive. For the reasons you state, it is best that any future request go to the new version of the code, or to another node.

Any other behaviors we should describe before we audit?

tilgovi · 2019-11-23T22:12:59Z

Close the socket. This is necessary in case Gunicorn is being shut down and not reloaded. The OS should be notified as soon as possible to stop accepting requests so that new requests can be directed to a different node in a load balanced deployment.

This is #922. I think it is done for all workers and the arbiter.

Close connections after the next response. Do not allow connection keep-alive. For the reasons you state, it is best that any future request go to the new version of the code, or to another node.

This is this ticket. We should make sure all workers do this.

tilgovi · 2020-04-21T21:27:51Z

Close the socket. This is necessary in case Gunicorn is being shut down and not reloaded. The OS should be notified as soon as possible to stop accepting requests so that new requests can be directed to a different node in a load balanced deployment.

This is #922. I think it is done for all workers and the arbiter.

I think this is still done, but we have an new issue due to this at #1725. The same issue might exist for other worker types than eventlet.

Close connections after the next response. Do not allow connection keep-alive. For the reasons you state, it is best that any future request go to the new version of the code, or to another node.

This is this ticket. We should make sure all workers do this.

I think this is now done for the threaded worker and the async workers in #2288, ebb41da and 4ae2a05.

tilgovi · 2020-04-21T21:37:34Z

I'm going to close this issue because I think it's mostly addressed now. I don't think the tornado worker is implementing graceful shutdown, but that can be a separate ticket.

tilgovi · 2020-04-21T21:40:53Z

I've opened #2317 for Tornado and I'll close this.

vgrebenschikov · 2021-01-19T20:23:19Z

cc @tilgovi ^^
...
2. if HUP or USR2 is sent to gunicorn, it means we want to change as fast as possible the configuration or the application version. In such case the gracefultime is here so we make sure we terminate the current requests cleanly (finishing a transaction, etc...). But we don't want any new request on the old version. In such case it make sense to also not accept new requests on keepalived connections and close them once the current request terminate.

I'm in favour of 2 which may be more safe. Thoughts?

Probably, I was not clear enough ... for keep-alive connection there are no way close connection "safe",
client is admissible to send next request just after receiving response (without any wait), and then, if we will just close connection after last request finished and HUP signaled - we, for sure, can fail next request unexpectedly (see experiment above which prove that).

So, only "safe" way will be either

get a "waiting" window, when client does not send next request and just wait on keep-alive socket
(no clue how to get it with guarantee, just some approximation)
or
send as minimum one response with Connection:close before actual closing connection, but, we may wait for long until next request ...

tilgovi · 2021-01-24T03:14:26Z

The gevent and eventlet workers do not have any logic to close keepalive connections during graceful shutdown. Instead, they have logic to force "Connection: close" on requests that happen during graceful shutdown. So, I believe it is already the case that they will send a "Connection: close" before actually closing the connection.

There is always a possibility that a long request ends close enough to the graceful timeout deadline that the client never gets to send another request and discover "Connection: close" before the server closes the connection forcefully. I don't see any way to avoid that. Set a longer graceful timeout to handle this.

tilgovi · 2021-02-16T03:04:45Z

Re-opening until I can address issues in the eventlet and threaded server.

Just to re-iterate, on graceful shutdown a worker should:

Continue to notify the arbiter that it is alive.
Stop accepting new requests.
Close the socket.
Force Connection: close for any future requests on existing keep-alive connections.
Shutdown when there are no further connections or the timeout expires.

Right now, the eventlet worker cannot handle existing keep-alive connections because it fails on listener.getsockename() after the socket is closed. The threaded worker is not handling existing keepalive requests and is not notifying the arbiter that it is still alive.

I'll work to get both PRs submitted this week. I apologize for not being able to do so sooner.

vupadhyay · 2021-02-17T06:32:53Z

gunicorn = "20.0.4"
worker_class = "gthread"
threads = 20
workers = 1
keepalive = 70
timeout = 500

In my case, when gunicorn master receives SIGHUP signal (sent by consul-template to reload refreshed secrets written in a file on local disk), it creates a new worker and gracefully shuts down old worker. However, during the transition from old to the new worker, http connections cached b/w client and old worker (keep-alive connections) are stale now and any request sent by client to the server that happen to use stale socket will hung and eventually timeout.

Essentially, the threaded worker is not able to handle existing keep-alive requests.

ccnupq · 2023-06-22T18:15:09Z

Hi @tilgovi
Is there any update for this issue?

Can this issue be closed ?

tilgovi · 2023-12-29T03:50:48Z

There are still issues as documented in my last comment.

geevarghesest · 2024-06-05T01:38:35Z

Hi,
i have been running
gunicorn 22.0
python 3.11
worker 10
threads 300
max requests 1000
keepalive 75
graceful-timeout 80
timeout 200
in production and this issue might have been there in previous versions also.
There were many recv() failed (104: Connection reset by peer) while reading response header from upstream errors in nginx logs.
tcpdump showed gunicorn sending RST packets to nginx at the time of errors.
Disabling keepalive and switching to 20.1.0 release of gunicorn fixes the issue.

benoitc · 2024-08-06T17:21:02Z

no activity since awhile. closing feel @tilgovi feel free to reopen if you still want to work on it :)

tilgovi mentioned this issue Mar 25, 2016

Close listening socket immediately after stop signal received #922

Closed

tilgovi added Improvement Discussion Feature/Worker labels Mar 25, 2016

tilgovi self-assigned this Mar 25, 2016

benoitc added the - Mailing List - label Feb 26, 2017

tuukkamustonen mentioned this issue Apr 3, 2017

Clarify what/how timeout and graceful_timeout work #1493

Closed

tilgovi mentioned this issue Apr 28, 2018

Request interrupted during graceful restarting due to the signal #1337

Closed

tilgovi mentioned this issue Jun 29, 2018

Gunicorn receiving HUP signal and restarting #1824

Closed

tilgovi mentioned this issue Apr 21, 2020

Gunicorn with keepalive + max requests stuck until graceful timeout #2297

Closed

tilgovi mentioned this issue Apr 21, 2020

Graceful shutdown for tornado worker #2317

Closed

tilgovi closed this as completed Apr 21, 2020

tilgovi reopened this Feb 16, 2021

tilgovi mentioned this issue Mar 8, 2021

Requests in flight dropped during "graceful" shutdown #2529

Closed

zdicesare mentioned this issue Nov 4, 2023

Connection reset during max-requests auto-restart with gthread #3038

Open

tilgovi mentioned this issue Dec 27, 2023

Gunicorn 20.1.0 worker does not respect graceful timeout option #2839

Closed

benoitc closed this as completed Aug 6, 2024

tilgovi reopened this Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make graceful shut-down keep-alive behavior consistent #1236

Make graceful shut-down keep-alive behavior consistent #1236

tilgovi commented Mar 25, 2016 •

edited

Loading

benoitc commented Apr 7, 2016

tilgovi commented Apr 7, 2016

benoitc commented Oct 16, 2016

tuukkamustonen commented Apr 3, 2017

benoitc commented Jun 29, 2018

tuukkamustonen commented Jul 3, 2018

benoitc commented Oct 9, 2018

tilgovi commented Oct 9, 2018

vgrebenschikov commented Nov 15, 2019 •

edited

Loading

vgrebenschikov commented Nov 15, 2019 •

edited

Loading

benoitc commented Nov 23, 2019

tilgovi commented Nov 23, 2019

tilgovi commented Nov 23, 2019

tilgovi commented Apr 21, 2020 •

edited

Loading

tilgovi commented Apr 21, 2020

tilgovi commented Apr 21, 2020

vgrebenschikov commented Jan 19, 2021 •

edited

Loading

tilgovi commented Jan 24, 2021

tilgovi commented Feb 16, 2021

vupadhyay commented Feb 17, 2021

ccnupq commented Jun 22, 2023 •

edited

Loading

tilgovi commented Dec 29, 2023

geevarghesest commented Jun 5, 2024 •

edited

Loading

benoitc commented Aug 6, 2024

Make graceful shut-down keep-alive behavior consistent #1236

Make graceful shut-down keep-alive behavior consistent #1236

Comments

tilgovi commented Mar 25, 2016 • edited Loading

benoitc commented Apr 7, 2016

tilgovi commented Apr 7, 2016

benoitc commented Oct 16, 2016

tuukkamustonen commented Apr 3, 2017

benoitc commented Jun 29, 2018

tuukkamustonen commented Jul 3, 2018

benoitc commented Oct 9, 2018

tilgovi commented Oct 9, 2018

vgrebenschikov commented Nov 15, 2019 • edited Loading

vgrebenschikov commented Nov 15, 2019 • edited Loading

benoitc commented Nov 23, 2019

tilgovi commented Nov 23, 2019

tilgovi commented Nov 23, 2019

tilgovi commented Apr 21, 2020 • edited Loading

tilgovi commented Apr 21, 2020

tilgovi commented Apr 21, 2020

vgrebenschikov commented Jan 19, 2021 • edited Loading

tilgovi commented Jan 24, 2021

tilgovi commented Feb 16, 2021

vupadhyay commented Feb 17, 2021

ccnupq commented Jun 22, 2023 • edited Loading

tilgovi commented Dec 29, 2023

geevarghesest commented Jun 5, 2024 • edited Loading

benoitc commented Aug 6, 2024

tilgovi commented Mar 25, 2016 •

edited

Loading

vgrebenschikov commented Nov 15, 2019 •

edited

Loading

vgrebenschikov commented Nov 15, 2019 •

edited

Loading

tilgovi commented Apr 21, 2020 •

edited

Loading

vgrebenschikov commented Jan 19, 2021 •

edited

Loading

ccnupq commented Jun 22, 2023 •

edited

Loading

geevarghesest commented Jun 5, 2024 •

edited

Loading