Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutting down Mercure server takes indefinitely #986

Open
bobvandevijver opened this issue Nov 21, 2024 · 3 comments
Open

Shutting down Mercure server takes indefinitely #986

bobvandevijver opened this issue Nov 21, 2024 · 3 comments

Comments

@bobvandevijver
Copy link

I was looking into a very slow shutdown of the Mercure instance I am running on a Ubuntu server, using a systemd unit.

Unit file
[Unit]
Description=Mercure server
After=network.target

[Service]
Type=simple
EnvironmentFile=/opt/application/mercure/env
ExecStart=/opt/application/mercure/mercure run
Restart=always
RestartSec=5
RuntimeMaxSec=7d
WorkingDirectory=/opt/application/mercure
User=www-data

[Install]
WantedBy=multi-user.target
Logs
Nov 21 20:15:40 linux645 mercure[1197]: {"level":"info","ts":1732216540.629172,"msg":"shutting down apps, then terminating","signal":"SIGTERM"}
Nov 21 20:15:40 linux645 mercure[1197]: {"level":"warn","ts":1732216540.6292357,"msg":"exiting; byeee!! �<9F><91><8B>","signal":"SIGTERM"}
Nov 21 20:15:40 linux645 mercure[1197]: {"level":"info","ts":1732216540.6292503,"logger":"http","msg":"servers shutting down with eternal grace period"}
Nov 21 20:15:40 linux645 systemd[1]: Stopping mercure.service - Mercure server...
Nov 21 20:16:19 linux645 mercure[1197]: {"level":"info","ts":1732216579.8925092,"logger":"http.handlers.mercure","msg":"Subscriber disconnected","subscriber":{"id":"urn:uuid:adce74f5-1f27-4912-91b6-83cd4442f0a>
Nov 21 20:16:19 linux645 mercure[1197]: {"level":"info","ts":1732216579.89257,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"55950","client_ip":"127.0>
Nov 21 20:16:31 linux645 mercure[1197]: {"level":"info","ts":1732216591.1713917,"logger":"http.handlers.mercure","msg":"Subscriber disconnected","subscriber":{"id":"urn:uuid:5db2a139-2a36-4f4d-a7e9-e4b1ed7d6fb>
Nov 21 20:16:31 linux645 mercure[1197]: {"level":"info","ts":1732216591.17145,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"37544","client_ip":"127.0>
Nov 21 20:17:10 linux645 systemd[1]: mercure.service: State 'stop-sigterm' timed out. Killing.
Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Killing process 1197 (mercure) with signal SIGKILL.
Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Failed to kill control group /system.slice/mercure.service, ignoring: Invalid argument
Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Main process exited, code=killed, status=9/KILL
Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Failed with result 'timeout'.

From the logs it became clear that the default shutdown action (SIGTERM), while received perfectly fine by caddy/mercure, didn't really do anything except for denying new connections. After a 90 seconds, systemd would have had enough, and use SIGKILL to get rid of the process.

One line caught my attention:

servers shutting down with eternal grace period

This is something that is described in the Caddy manual under the grace_period option and allows the clients to complete their request before the server is actually shutdown. However, with SSE and long running connections, this does not really make sense as those clients will keep the connection option.

I have now added an additional env variable to my configuration:

GLOBAL_OPTIONS=grace_period 1s

And immediately mercure shutdowns fast, although with some errors in the log:

Logs with 1s `grace_period`
Nov 21 20:41:43 linux645 mercure[46014]: {"level":"info","ts":1732218103.2779043,"msg":"shutting down apps, then terminating","signal":"SIGTERM"}
Nov 21 20:41:43 linux645 mercure[46014]: {"level":"warn","ts":1732218103.2779415,"msg":"exiting; byeee!! 👋","signal":"SIGTERM"}
Nov 21 20:41:43 linux645 mercure[46014]: {"level":"info","ts":1732218103.277979,"logger":"http","msg":"servers shutting down; grace period initiated","duration":1}
Nov 21 20:41:43 linux645 systemd[1]: Stopping nl-mercure.service - Mercure server...
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"error","ts":1732218104.278682,"logger":"http","msg":"server shutdown","error":"context deadline exceeded","addresses":[":3000"]}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"warn","ts":1732218104.2788424,"logger":"http.handlers.mercure","msg":"Failed to remove subscriber on shutdown","subscriber":{},"error":"hub: read/write on closed Transport"}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.2788796,"logger":"http.handlers.mercure","msg":"LocalSubscriber disconnected","subscriber":{}}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"warn","ts":1732218104.2788115,"logger":"http.handlers.mercure","msg":"Failed to remove subscriber on shutdown","subscriber":{},"error":"hub: read/write on closed Transport"}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.27891,"logger":"http.handlers.mercure","msg":"LocalSubscriber disconnected","subscriber":{}}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.27891,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"46232","client_ip":"127.0.0.1","proto":"HTTP/1.1","method":"GET","host":"localhost:3000","uri":"/.well-known/mercure?lastEventID=urn%3Auuid%3A74003c94-cdc2-4fdc-8948-d4db509702ff&topic=%2A","headers":{REDACTED}},"bytes_read":0,"user_id":"","duration":38.964192544,"size":407,"status":200,"resp_headers":{REDACTED}}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.2789543,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"53440","client_ip":"127.0.0.1","proto":"HTTP/1.1","method":"GET","host":"localhost:3000","uri":"/.well-known/mercure?lastEventID=urn%3Auuid%3Add133490-ed99-4d32-968a-40ba36492738&topic=%2A","headers":{REDACTED},"bytes_read":0,"user_id":"","duration":41.440676106,"size":4,"status":200,"resp_headers":{REDACTED}}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.279367,"logger":"admin","msg":"stopped previous server","address":"localhost:2019"}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.2793803,"msg":"shutdown complete","signal":"SIGTERM","exit_code":0}
Nov 21 20:41:44 linux645 systemd[1]: mercure.service: Deactivated successfully.
Nov 21 20:41:44 linux645 systemd[1]: Stopped mercure.service - Mercure server.

As I did not find any note about this in the documentation, I am wondering what the best course of action is. Is this be something that should only be added to the documentation, or is this maybe even something that we can can configure by default in the Caddyfile that is bundled with the tar.gz releases?

@dunglas
Copy link
Owner

dunglas commented Nov 21, 2024

Good catch. IMHO, we should detect that Caddy is shutting down (we may use the context for that) and cleanly close the SSE connection on our side. This should be done in the Caddy module.

In the meantime, your workaround looks good and we could at least update the default config.

bobvandevijver added a commit to bobvandevijver/mercure that referenced this issue Nov 25, 2024
As per dunglas#986, adds a default grace period of 1s. If the same global option has been set using the $GLOBAL_OPTIONS it wil override the default one.
bobvandevijver added a commit to bobvandevijver/mercure that referenced this issue Nov 25, 2024
As per dunglas#986, adds a default grace period of 1s. If the same global option has been set using the $GLOBAL_OPTIONS it wil override the default one.
@dunglas
Copy link
Owner

dunglas commented Nov 25, 2024

Thanks for the PR @bobvandevijver.

I dived into this issue, and it looks like the current behavior is "expected". When I try to reproduce locally, the server closes all connections and finally terminates:

{"level":"info","ts":1732559992.2463958,"msg":"shutting down apps, then terminating","signal":"SIGTERM"}
{"level":"warn","ts":1732559992.2464871,"msg":"exiting; byeee!! 👋","signal":"SIGTERM"}
{"level":"info","ts":1732559992.2465448,"logger":"http","msg":"servers shutting down with eternal grace period"}
{"level":"debug","ts":1732559992.247116,"logger":"http.handlers.mercure","msg":"Connection closed by the client","subscriber":{"id":"urn:uuid:e1d0b3df-f2a8-43c9-b19a-b82b59737c44","last_event_id":"","remote_addr":"127.0.0.1:51060","topic_selectors":["https://example.com/my-private-topic","{scheme}://{+host}/demo/books/{id}.jsonld","/.well-known/mercure/subscriptions{/topic}{/subscriber}"],"topics":["https://example.com/my-private-topic"]}}
{"level":"info","ts":1732559992.256612,"logger":"http.handlers.mercure","msg":"LocalSubscriber disconnected","subscriber":{"id":"urn:uuid:e1d0b3df-f2a8-43c9-b19a-b82b59737c44","last_event_id":"","remote_addr":"127.0.0.1:51060","topic_selectors":["https://example.com/my-private-topic","{scheme}://{+host}/demo/books/{id}.jsonld","/.well-known/mercure/subscriptions{/topic}{/subscriber}"],"topics":["https://example.com/my-private-topic"]}}
{"level":"info","ts":1732559992.256854,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"51060","client_ip":"127.0.0.1","proto":"HTTP/3.0","method":"GET","host":"localhost","uri":"/.well-known/mercure?topic=https%3A%2F%2Fexample.com%2Fmy-private-topic","headers":{"Sec-Gpc":["1"],"Priority":["u=0"],"Accept":["text/event-stream"],"Alt-Used":["localhost"],"Dnt":["1"],"Cookie":["REDACTED"],"Sec-Fetch-Site":["same-origin"],"Authorization":["REDACTED"],"Referer":["https://localhost/.well-known/mercure/ui/"],"Sec-Fetch-Dest":["empty"],"Sec-Fetch-Mode":["cors"],"Pragma":["no-cache"],"Cache-Control":["no-cache"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0"],"Accept-Language":["fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3"],"Accept-Encoding":["gzip, deflate, br, zstd"]},"tls":{"resumed":false,"version":772,"cipher_suite":4865,"proto":"h3","server_name":"localhost"}},"bytes_read":0,"user_id":"","duration":40.463308167,"size":4,"status":200,"resp_headers":{"Expire":["0"],"Server":["Caddy"],"Content-Type":["text/event-stream"],"Pragma":["no-cache"],"Content-Security-Policy":["default-src 'self' mercure.rocks cdn.jsdelivr.net"],"Connection":["keep-alive"],"Cache-Control":["private, no-cache, no-store, must-revalidate, max-age=0"],"X-Accel-Buffering":["no"],"Date":["Mon, 25 Nov 2024 18:39:11 GMT"],"X-Frame-Options":["DENY"],"X-Content-Type-Options":["nosniff"],"X-Xss-Protection":["1; mode=block"]}}
{"level":"info","ts":1732559992.25741,"logger":"admin","msg":"stopped previous server","address":"localhost:2019"}
{"level":"info","ts":1732559992.257416,"msg":"shutdown complete","signal":"SIGTERM","exit_code":0}

I think that the issue you're getting is because some of your connections with the clients are "stuck", so the underlying goroutine is stuck and https://github.com/dunglas/mercure/blob/main/subscribe.go#L125 is never called.

Could you show me your timeout configs (if any)? Setting Mercure's write_timeout may not be enough. Setting write here in addition (or in place) of the grace can help: https://caddyserver.com/docs/caddyfile/options#timeouts

@bobvandevijver
Copy link
Author

bobvandevijver commented Nov 25, 2024

I am using the default Caddy file with only some configuration for the server name, transport and keys.

SERVER_NAME=':3000'
MERCURE_PUBLISHER_JWT_KEY=something
MERCURE_SUBSCRIBER_JWT_KEY=something
MERCURE_EXTRA_DIRECTIVES='transport bolt {
  path mercure.db
  size 10000
}'

However, I am using a nginx proxy (based on https://mercure.rocks/docs/hub/nginx) to be able to host the hub on the same port and domain as the main Symfony application. Client IP detection doesn't work currently (but that is why I just responded to #546).

        location /.well-known/mercure {
                proxy_pass http://localhost:3000;
                proxy_read_timeout 24h;
                proxy_http_version 1.1;
                proxy_set_header Connection "";
                proxy_pass_header Upgrade;

                # Remove untrusted headers
                proxy_set_header Forwarded "";
                proxy_set_header X-Real-IP "";
                proxy_set_header X-Forwarded-Scheme "";

                ## Be sure to set USE_FORWARDED_HEADERS=1 to allow the hub to use those headers ##
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Host $host;
                proxy_set_header X-Forwarded-Proto $scheme;
        }

Maybe that is relevant? I can consistently reproduce the issue with a 10s timeout window, where a freshly created connection isn't dropped (also visible in the browser console) when the restart is requested. Only after 10s the onerror on the EventSource is triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants