Shutting down Mercure server takes indefinitely #986

bobvandevijver · 2024-11-21T19:54:55Z

I was looking into a very slow shutdown of the Mercure instance I am running on a Ubuntu server, using a systemd unit.

Unit file

[Unit]
Description=Mercure server
After=network.target

[Service]
Type=simple
EnvironmentFile=/opt/application/mercure/env
ExecStart=/opt/application/mercure/mercure run
Restart=always
RestartSec=5
RuntimeMaxSec=7d
WorkingDirectory=/opt/application/mercure
User=www-data

[Install]
WantedBy=multi-user.target

Logs

Nov 21 20:15:40 linux645 mercure[1197]: {"level":"info","ts":1732216540.629172,"msg":"shutting down apps, then terminating","signal":"SIGTERM"}
Nov 21 20:15:40 linux645 mercure[1197]: {"level":"warn","ts":1732216540.6292357,"msg":"exiting; byeee!! �<9F><91><8B>","signal":"SIGTERM"}
Nov 21 20:15:40 linux645 mercure[1197]: {"level":"info","ts":1732216540.6292503,"logger":"http","msg":"servers shutting down with eternal grace period"}
Nov 21 20:15:40 linux645 systemd[1]: Stopping mercure.service - Mercure server...
Nov 21 20:16:19 linux645 mercure[1197]: {"level":"info","ts":1732216579.8925092,"logger":"http.handlers.mercure","msg":"Subscriber disconnected","subscriber":{"id":"urn:uuid:adce74f5-1f27-4912-91b6-83cd4442f0a>
Nov 21 20:16:19 linux645 mercure[1197]: {"level":"info","ts":1732216579.89257,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"55950","client_ip":"127.0>
Nov 21 20:16:31 linux645 mercure[1197]: {"level":"info","ts":1732216591.1713917,"logger":"http.handlers.mercure","msg":"Subscriber disconnected","subscriber":{"id":"urn:uuid:5db2a139-2a36-4f4d-a7e9-e4b1ed7d6fb>
Nov 21 20:16:31 linux645 mercure[1197]: {"level":"info","ts":1732216591.17145,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"37544","client_ip":"127.0>
Nov 21 20:17:10 linux645 systemd[1]: mercure.service: State 'stop-sigterm' timed out. Killing.
Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Killing process 1197 (mercure) with signal SIGKILL.
Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Failed to kill control group /system.slice/mercure.service, ignoring: Invalid argument
Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Main process exited, code=killed, status=9/KILL
Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Failed with result 'timeout'.

From the logs it became clear that the default shutdown action (SIGTERM), while received perfectly fine by caddy/mercure, didn't really do anything except for denying new connections. After a 90 seconds, systemd would have had enough, and use SIGKILL to get rid of the process.

One line caught my attention:

servers shutting down with eternal grace period

This is something that is described in the Caddy manual under the grace_period option and allows the clients to complete their request before the server is actually shutdown. However, with SSE and long running connections, this does not really make sense as those clients will keep the connection option.

I have now added an additional env variable to my configuration:

GLOBAL_OPTIONS=grace_period 1s

And immediately mercure shutdowns fast, although with some errors in the log:

Logs with 1s `grace_period`

Nov 21 20:41:43 linux645 mercure[46014]: {"level":"info","ts":1732218103.2779043,"msg":"shutting down apps, then terminating","signal":"SIGTERM"}
Nov 21 20:41:43 linux645 mercure[46014]: {"level":"warn","ts":1732218103.2779415,"msg":"exiting; byeee!! 👋","signal":"SIGTERM"}
Nov 21 20:41:43 linux645 mercure[46014]: {"level":"info","ts":1732218103.277979,"logger":"http","msg":"servers shutting down; grace period initiated","duration":1}
Nov 21 20:41:43 linux645 systemd[1]: Stopping nl-mercure.service - Mercure server...
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"error","ts":1732218104.278682,"logger":"http","msg":"server shutdown","error":"context deadline exceeded","addresses":[":3000"]}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"warn","ts":1732218104.2788424,"logger":"http.handlers.mercure","msg":"Failed to remove subscriber on shutdown","subscriber":{},"error":"hub: read/write on closed Transport"}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.2788796,"logger":"http.handlers.mercure","msg":"LocalSubscriber disconnected","subscriber":{}}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"warn","ts":1732218104.2788115,"logger":"http.handlers.mercure","msg":"Failed to remove subscriber on shutdown","subscriber":{},"error":"hub: read/write on closed Transport"}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.27891,"logger":"http.handlers.mercure","msg":"LocalSubscriber disconnected","subscriber":{}}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.27891,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"46232","client_ip":"127.0.0.1","proto":"HTTP/1.1","method":"GET","host":"localhost:3000","uri":"/.well-known/mercure?lastEventID=urn%3Auuid%3A74003c94-cdc2-4fdc-8948-d4db509702ff&topic=%2A","headers":{REDACTED}},"bytes_read":0,"user_id":"","duration":38.964192544,"size":407,"status":200,"resp_headers":{REDACTED}}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.2789543,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"53440","client_ip":"127.0.0.1","proto":"HTTP/1.1","method":"GET","host":"localhost:3000","uri":"/.well-known/mercure?lastEventID=urn%3Auuid%3Add133490-ed99-4d32-968a-40ba36492738&topic=%2A","headers":{REDACTED},"bytes_read":0,"user_id":"","duration":41.440676106,"size":4,"status":200,"resp_headers":{REDACTED}}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.279367,"logger":"admin","msg":"stopped previous server","address":"localhost:2019"}
Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.2793803,"msg":"shutdown complete","signal":"SIGTERM","exit_code":0}
Nov 21 20:41:44 linux645 systemd[1]: mercure.service: Deactivated successfully.
Nov 21 20:41:44 linux645 systemd[1]: Stopped mercure.service - Mercure server.

As I did not find any note about this in the documentation, I am wondering what the best course of action is. Is this be something that should only be added to the documentation, or is this maybe even something that we can can configure by default in the Caddyfile that is bundled with the tar.gz releases?

The text was updated successfully, but these errors were encountered:

dunglas · 2024-11-21T21:53:09Z

Good catch. IMHO, we should detect that Caddy is shutting down (we may use the context for that) and cleanly close the SSE connection on our side. This should be done in the Caddy module.

In the meantime, your workaround looks good and we could at least update the default config.

As per dunglas#986, adds a default grace period of 1s. If the same global option has been set using the $GLOBAL_OPTIONS it wil override the default one.

dunglas · 2024-11-25T18:52:55Z

Thanks for the PR @bobvandevijver.

I dived into this issue, and it looks like the current behavior is "expected". When I try to reproduce locally, the server closes all connections and finally terminates:

{"level":"info","ts":1732559992.2463958,"msg":"shutting down apps, then terminating","signal":"SIGTERM"}
{"level":"warn","ts":1732559992.2464871,"msg":"exiting; byeee!! 👋","signal":"SIGTERM"}
{"level":"info","ts":1732559992.2465448,"logger":"http","msg":"servers shutting down with eternal grace period"}
{"level":"debug","ts":1732559992.247116,"logger":"http.handlers.mercure","msg":"Connection closed by the client","subscriber":{"id":"urn:uuid:e1d0b3df-f2a8-43c9-b19a-b82b59737c44","last_event_id":"","remote_addr":"127.0.0.1:51060","topic_selectors":["https://example.com/my-private-topic","{scheme}://{+host}/demo/books/{id}.jsonld","/.well-known/mercure/subscriptions{/topic}{/subscriber}"],"topics":["https://example.com/my-private-topic"]}}
{"level":"info","ts":1732559992.256612,"logger":"http.handlers.mercure","msg":"LocalSubscriber disconnected","subscriber":{"id":"urn:uuid:e1d0b3df-f2a8-43c9-b19a-b82b59737c44","last_event_id":"","remote_addr":"127.0.0.1:51060","topic_selectors":["https://example.com/my-private-topic","{scheme}://{+host}/demo/books/{id}.jsonld","/.well-known/mercure/subscriptions{/topic}{/subscriber}"],"topics":["https://example.com/my-private-topic"]}}
{"level":"info","ts":1732559992.256854,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"51060","client_ip":"127.0.0.1","proto":"HTTP/3.0","method":"GET","host":"localhost","uri":"/.well-known/mercure?topic=https%3A%2F%2Fexample.com%2Fmy-private-topic","headers":{"Sec-Gpc":["1"],"Priority":["u=0"],"Accept":["text/event-stream"],"Alt-Used":["localhost"],"Dnt":["1"],"Cookie":["REDACTED"],"Sec-Fetch-Site":["same-origin"],"Authorization":["REDACTED"],"Referer":["https://localhost/.well-known/mercure/ui/"],"Sec-Fetch-Dest":["empty"],"Sec-Fetch-Mode":["cors"],"Pragma":["no-cache"],"Cache-Control":["no-cache"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0"],"Accept-Language":["fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3"],"Accept-Encoding":["gzip, deflate, br, zstd"]},"tls":{"resumed":false,"version":772,"cipher_suite":4865,"proto":"h3","server_name":"localhost"}},"bytes_read":0,"user_id":"","duration":40.463308167,"size":4,"status":200,"resp_headers":{"Expire":["0"],"Server":["Caddy"],"Content-Type":["text/event-stream"],"Pragma":["no-cache"],"Content-Security-Policy":["default-src 'self' mercure.rocks cdn.jsdelivr.net"],"Connection":["keep-alive"],"Cache-Control":["private, no-cache, no-store, must-revalidate, max-age=0"],"X-Accel-Buffering":["no"],"Date":["Mon, 25 Nov 2024 18:39:11 GMT"],"X-Frame-Options":["DENY"],"X-Content-Type-Options":["nosniff"],"X-Xss-Protection":["1; mode=block"]}}
{"level":"info","ts":1732559992.25741,"logger":"admin","msg":"stopped previous server","address":"localhost:2019"}
{"level":"info","ts":1732559992.257416,"msg":"shutdown complete","signal":"SIGTERM","exit_code":0}

I think that the issue you're getting is because some of your connections with the clients are "stuck", so the underlying goroutine is stuck and https://github.com/dunglas/mercure/blob/main/subscribe.go#L125 is never called.

Could you show me your timeout configs (if any)? Setting Mercure's write_timeout may not be enough. Setting write here in addition (or in place) of the grace can help: https://caddyserver.com/docs/caddyfile/options#timeouts

bobvandevijver · 2024-11-25T20:43:30Z

I am using the default Caddy file with only some configuration for the server name, transport and keys.

SERVER_NAME=':3000'
MERCURE_PUBLISHER_JWT_KEY=something
MERCURE_SUBSCRIBER_JWT_KEY=something
MERCURE_EXTRA_DIRECTIVES='transport bolt {
  path mercure.db
  size 10000
}'

However, I am using a nginx proxy (based on https://mercure.rocks/docs/hub/nginx) to be able to host the hub on the same port and domain as the main Symfony application. Client IP detection doesn't work currently (but that is why I just responded to #546).

        location /.well-known/mercure {
                proxy_pass http://localhost:3000;
                proxy_read_timeout 24h;
                proxy_http_version 1.1;
                proxy_set_header Connection "";
                proxy_pass_header Upgrade;

                # Remove untrusted headers
                proxy_set_header Forwarded "";
                proxy_set_header X-Real-IP "";
                proxy_set_header X-Forwarded-Scheme "";

                ## Be sure to set USE_FORWARDED_HEADERS=1 to allow the hub to use those headers ##
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Host $host;
                proxy_set_header X-Forwarded-Proto $scheme;
        }

Maybe that is relevant? I can consistently reproduce the issue with a 10s timeout window, where a freshly created connection isn't dropped (also visible in the browser console) when the restart is requested. Only after 10s the onerror on the EventSource is triggered.

dunglas · 2025-02-17T09:18:12Z

FIxed by #1005.

bobvandevijver added a commit to bobvandevijver/mercure that referenced this issue Nov 25, 2024

Add default shutdown grace period

b36720a

As per dunglas#986, adds a default grace period of 1s. If the same global option has been set using the $GLOBAL_OPTIONS it wil override the default one.

bobvandevijver mentioned this issue Nov 25, 2024

fix: Add default shutdown grace period #987

Open

erseco mentioned this issue Feb 6, 2025

docs: add proxy_buffering off to Nginx example for SSL proxy compatibility #1005

Merged

dunglas closed this as completed Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutting down Mercure server takes indefinitely #986

Shutting down Mercure server takes indefinitely #986

bobvandevijver commented Nov 21, 2024

dunglas commented Nov 21, 2024

dunglas commented Nov 25, 2024

bobvandevijver commented Nov 25, 2024 •

edited

Loading

dunglas commented Feb 17, 2025

Shutting down Mercure server takes indefinitely #986

Shutting down Mercure server takes indefinitely #986

Comments

bobvandevijver commented Nov 21, 2024

dunglas commented Nov 21, 2024

dunglas commented Nov 25, 2024

bobvandevijver commented Nov 25, 2024 • edited Loading

dunglas commented Feb 17, 2025

bobvandevijver commented Nov 25, 2024 •

edited

Loading