Graceful shutdown for tornado worker #2317

tilgovi · 2020-04-21T21:40:00Z

Following from #1236, the tornado worker should behave like the other workers.

On graceful shutdown:

Stop accepting and close the listening socket
Start a graceful shutdown timer
Force end all connections if there are running requests after the timeout expires

I think that the call to server.stop() is already stopping it from accepting, but I don't know if it closes the listener socket. I don't see any graceful timeout handling.

We should also double checking that calling server.stop() prevents open connections from being reused for keep-alive requests.

The text was updated successfully, but these errors were encountered:

vgrebenschikov · 2021-01-19T20:29:08Z

Proposed solution will not help with in case of keep-alive sockets:

see proof in comment: #1236 (comment)

To shutdown keep-alive socket gracefully - we should do ether one of:

first respond on request with connect:close (downgrade keepalive to close connection), then close socket

or

catch a window when client does not yet send next request (to be in time with closing until client fire next request, otherwise it will be failed, which is bad)

p.s.
also, original issue #1236 affects us without any tornado, just plain gunicorn ... not sure way it was closed to this one

benoitc · 2021-01-19T21:21:04Z

i don't think it's expected by the spec to close gracefully such socket if the request is not started. Client should normally expect to timeout. Do you have any example of a client that bug with the current behavior?

On Tue 19 Jan 2021 at 21:29 vgrebenschikov ***@***.***> wrote: Proposed solution will not help with in case of keep-alive sockets: see proof in comment: of #1236 <#1236>: #1236 (comment) <#1236 (comment)> To shutdown keep-alive socket gracefully: we should do ether one of: 1. first respond on request with connect:close (downgrade keepalive to close connection), then close socket or 1. catch a window when client does not yet send next request (to be in time with closing until client fire next request, otherwise it will be failed, which is bad) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2317 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADRIT6WYG3HMCBUAR3BJ3S2XTSNANCNFSM4MNUK22A> .

-- Sent from my Mobile

vgrebenschikov · 2021-01-20T09:31:34Z

i don't think it's expected by the spec to close gracefully such socket if the request is not started.
Client should normally expect to timeout.
Do you have any example of a client that bug with the current behavior?

How to reproduce problem:
(in fact problem easily can be observed on busy production during graceful shutdown)

Run gunicorn with keepalive (trivial app returning some data)
$ gunicorn --max-requests 512 --keep-alive 2 --threads 20 --workers 4 t:app
run apache benchmark:

$ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/

...
Concurrency Level: 20
Time taken for tests: 2.693 seconds
Complete requests: 10000
Failed requests: 435

See > 4% failed requests just due to restarted workers (in this case by max-requests)

Run gunicorn without keepalive (same app)
$ gunicorn --max-requests 512 --keep-alive 0 --threads 20 --workers 4 t:app
$ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/

...
Complete requests: 10000
Failed requests: 0

See no failed requests

Errors in first case caused by closing keep-alive socket by the server (due to max-requests) while next request in progress.
gunicorn-20.0.4

tilgovi · 2021-01-24T03:34:36Z

As I answered in #1236 (comment) the design is to send "Connection: close" once the server is shutting down and to wait for the graceful timeout before forcing connection closed. I'm able to reproduce what you see. Based on the distribution of request times, it seems impossible that connections are getting closed during the keepalive timeout, and instead they must be getting closed by the shutdown before they receive "Connection: close".

I'm not able to reproduce this with gevent, but I can reproduce it with eventlet and the threaded worker. I'll see if I can narrow down a bug in either of those.

tilgovi · 2021-01-24T05:15:16Z

I've narrowed down the issue in eventlet to be a failing getsockname() call during graceful shutdown. I'll file a separate bug for that tomorrow and tackle it. I haven't isolated the issue for threaded workers yet, but believe there may be a bug. Thanks for your report @vgrebenschikov.

tilgovi · 2021-01-24T05:39:46Z

Alright, I've isolated the bug for the threaded worker, as well. I'll file issues or PRs to fix both tomorrow.

tilgovi · 2021-01-24T05:53:29Z

Here's the gist if you want to check my work and try it out before I get a chance to look at it with eyes that have slept:

diff --git a/gunicorn/workers/gthread.py b/gunicorn/workers/gthread.py
index d531811..ed70dcb 100644
--- a/gunicorn/workers/gthread.py
+++ b/gunicorn/workers/gthread.py
@@ -220,13 +220,37 @@ class ThreadWorker(base.Worker):
             # handle keepalive timeouts
             self.murder_keepalived()
 
-        self.tpool.shutdown(False)
-        self.poller.close()
-
+        # stop accepting new connections
         for s in self.sockets:
+            self.poller.unregister(s)
             s.close()
 
-        futures.wait(self.futures, timeout=self.cfg.graceful_timeout)
+        # handle existing connections until the graceful timeout
+        deadline = time.time() + self.cfg.graceful_timeout
+        while self.nr_conns:
+            # notify the arbiter we are alive
+            self.notify()
+
+            timeout = deadline - time.time()
+            if timeout <= 0:
+                break
+
+            # clean up finished requests
+            result = futures.wait(self.futures, timeout=0)
+            for fut in result.done:
+                self.futures.remove(fut)
+
+            # handle keepalive requests
+            events = self.poller.select(1.0)
+            for key, _ in events:
+                callback = key.data
+                callback(key.fileobj)
+
+            # handle keepalive timeouts
+            self.murder_keepalived()
+
+        self.tpool.shutdown(False)
+        self.poller.close()
 
     def finish_request(self, fs):
         if fs.cancelled():
@@ -238,7 +262,7 @@ class ThreadWorker(base.Worker):
             (keepalive, conn) = fs.result()
             # if the connection should be kept alived add it
             # to the eventloop and record it
-            if keepalive and self.alive:
+            if keepalive:
                 # flag the socket as non blocked
                 conn.sock.setblocking(False)

The first change is to ensure that the worker continues to handle keepalive requests in progress while shutting down. The second chunk fixes a bug preventing keepalive requests that started before the shutdown began from handling their next request and receiving the "Connection: close" that they need to shut down gracefully.

tilgovi added good first issue Open for everyone. A good first issue to become familiar with codebase. help wanted Open for everyone. You do not need permission to work on these. May need familiarity with codebase. Thirdparty/Tornado labels Apr 21, 2020

tilgovi mentioned this issue Apr 21, 2020

Make graceful shut-down keep-alive behavior consistent #1236

Open

3 tasks

tilgovi mentioned this issue Feb 16, 2021

add a delay between SIGABRT and SIGKILL for worker timeouts #2455

Closed

benoitc closed this as completed Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown for tornado worker #2317

Graceful shutdown for tornado worker #2317

tilgovi commented Apr 21, 2020 •

edited

Loading

vgrebenschikov commented Jan 19, 2021 •

edited

Loading

benoitc commented Jan 19, 2021 via email

vgrebenschikov commented Jan 20, 2021 •

edited

Loading

tilgovi commented Jan 24, 2021

tilgovi commented Jan 24, 2021

tilgovi commented Jan 24, 2021

tilgovi commented Jan 24, 2021

Graceful shutdown for tornado worker #2317

Graceful shutdown for tornado worker #2317

Comments

tilgovi commented Apr 21, 2020 • edited Loading

vgrebenschikov commented Jan 19, 2021 • edited Loading

benoitc commented Jan 19, 2021 via email

vgrebenschikov commented Jan 20, 2021 • edited Loading

tilgovi commented Jan 24, 2021

tilgovi commented Jan 24, 2021

tilgovi commented Jan 24, 2021

tilgovi commented Jan 24, 2021

tilgovi commented Apr 21, 2020 •

edited

Loading

vgrebenschikov commented Jan 19, 2021 •

edited

Loading

vgrebenschikov commented Jan 20, 2021 •

edited

Loading