Consider increasing default federation timeout #8338

MayeulC · 2020-09-17T14:49:59Z

Alternatively, expose it in a configuration file.

This is the timeout in question:

synapse/synapse/http/matrixfederationclient.py

Line 261 in 837293c

self.default_timeout = 60

Rationale

Related: #7113 #6699 #5445 #8118

Also related to #8080 and #5604, as the telltale sign is this log pattern:

synapse.http.federation.well_known_resolver - 246 - INFO - federation_transaction_transmission_loop-2612- Fetching https://haenerconsulting.com/.well-known/matrix/server
synapse.http.federation.well_known_resolver - 264 - INFO - federation_transaction_transmission_loop-2612- Error fetching https://haenerconsulting.com/.well-known/matrix/server: DNS lookup failed: Couldn't find the hostname 'haenerconsulting.com'.
(note: the above doesn't necessarily fail on .well-known nor DNS)
synapse.logging.context - 351 - WARNING - federation_transaction_transmission_loop-2612- Expected logging context federation_transaction_transmission_loop-2612 was lost
synapse.http.matrixfederationclient - 546 - INFO - federation_transaction_transmission_loop-2612- {PUT-O-1163} [haenerconsulting.com] Request failed: PUT matrix://haenerconsulting.com/_matrix/federation/v1/send/1600285938965: TimeoutError(60, 'Deferred')
synapse.logging.context - 396 - WARNING - federation_transaction_transmission_loop-2612- Re-starting finished log context federation_transaction_transmission_loop-2612
(multiple times)

Note: the timeout error can also be something like

synapse.http.matrixfederationclient - 546 - INFO - federation_transaction_transmission_loop-2812- {PUT-O-1251} [sigsegv.cc] Request failed: PUT matrix://sigsegv.cc/_matrix/federation/v1/send/1600285939053: AttributeError("'NoneType' object has no attribute 'stopListening'")

That indicates, if I am not mistaken, that the transaction actually returned something after the timeout expired.

High level symptoms

Federation was working bi-directionally with some (smaller?) homeservers. It wasn't working with matrix.org, though (my homeserver could receive events -- PDU and EDU -- but matrix.org wouldn't receive events from my server, except when backfilling from others, of course).

This started happening after a ~1-day federation outage (synapse was down for postgresql maintenance & upgrade, and I hit another synapse bug on start-up).

Probable causes

This is likely due to my having a slow internet connection (ADSL), subject to bufferbloat. Trying to transmit a lot of data at once will dramatically increase latency, even causing DNS lookups to fail.

The origin is likely due to events piling up while the server wasn't federating, and the new async I/O architecture being too performant (meaning it tries to send all at once on start-up. (I successfully recovered from multiple-week federation outages in the past).

I got around #8118 by increasing dnsmasq's max simultaneous connections to 300 from the default of 100, and its number of cache entries to 4096 up from 256, then restarting synapse twice. It is possible that changing the federation timeout could have helped with that as well.

Considerations

That might only happen in a number of scenarios:

Slow internet connection, or high latency
A lot of events to send to some servers, and a lot of catch-up to do (hence why federation was only breaking on some of the biggest servers of the network).

As such, it might be better to progressively increase the timeout (though it then makes no sense having a big timeout), or alternatively, use a large timeout on startup and progressively reduce it.

Rate-limitting requests (including DNS), especially on start-up might also help quite a bit.

The text was updated successfully, but these errors were encountered:

anoadragon453 · 2020-09-23T13:35:19Z

Hey, thanks for your detailed issue.

Making federation timeout configurable generally sounds like a good idea. This would allow admins to play around with the value and see what works best for them given their resource limitations. We are a bit wary about changing the default value of 60 for now though, given that this can be a double-edged sword: while you'd end up being able to complete some federation operations that you otherwise wouldn't have before, you also end up spending more time waiting for servers that will never respond, which also decreases overall performance.

If it's configurable and people find that generally raising the timeout gives a better experience, then we can take a look at changing the defaults, or potentially working out a way to scale the timeout when needed.

The origin is likely due to events piling up while the server wasn't federating, and the new async I/O architecture being too performant (meaning it tries to send all at once on start-up. (I successfully recovered from multiple-week federation outages in the past).

Currently requests are indeed sent out in parallel as needed. There's been work in the past to queue batches of these requests, under the name "Federation Side-Bus", but it's been put on hiatus. This is a known problem though, and there are other solutions than just increasing the timeout, but it's a hard problem I'm afraid.

richvdh · 2020-09-23T14:04:33Z

under the name "Federation Side-Bus"

federation-side bus.

MayeulC · 2020-09-23T21:54:45Z

Thank you for the detailed answer, that makes sense. I've indeed seen a few PR/issues mentioning batching requests, and wondered if it was related. I think it could actually help quite a bit.

Nit: a bus has a different meaning to me, as a high-bandwidth link that is connected to multiple receivers at once (think public transit). Federation queues might be more explicit, unless it's called like that for internal reasons, of course, or that I am mistaken about buses :)

Edit: ah, yes, #5883 looks pretty stale.

rubo77 · 2020-09-29T21:05:32Z

you also end up spending more time waiting for servers that will never respond

Maybe we could solve this, if we could add a configuration, that would increase the timeout by hostname, so i.e. only for communication with some known problematic servers, while all other servers still run into the default timeout.

a-0-dev · 2022-01-22T17:52:14Z

Following a recent discussion in #synapse:matrix.org this still seems to be an issue, I am also suspecting this to cause trouble with my homeserver (logs are matching, and I am experiencing irregular downtimes of my synapse instance without any other log indication).

I would be very thankful if any experienced synapse dev would implement the fix suggested above (exposing the variable to the config)... I doesn't seem to be that much work at all to me, I'm just not familiar with the codebase at all.

Anyhow, the issue is from Sep 2020 and this is a bump/reminder :)

fuomag9 · 2022-03-01T22:58:09Z

Same, I'm full of DNSLookupError('no results for hostname lookup and HttpResponseException('500: Internal Server Error') when joining big rooms, but dns works fine with dig :(

strufkin · 2022-03-28T09:08:49Z

I had to increase timeout to 120 seconds, so I could join the room,

tail -f access.log|grep join
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:31:49 +0000] "OPTIONS /_matrix/client/r0/join/!TyHsKBJMsnkHlauFel%3Alibera.chat HTTP/2.0" 204 0 72 0.012 0.012 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:32:52 +0000] "POST /_matrix/client/r0/join/!TyHsKBJMsnkHlauFel%3Alibera.chat HTTP/2.0" **502 68 63 62.802 62.801** "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"


matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:37:27 +0000] "OPTIONS /_matrix/client/r0/join/!TyHsKBJMsnkHlauFel%3Alibera.chat HTTP/2.0" 204 0 66 0.002 0.003 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:39:14 +0000] "POST /_matrix/client/r0/join/!TyHsKBJMsnkHlauFel%3Alibera.chat HTTP/2.0" **200 65 61 107.203 107.203** "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:43:57 +0000] "OPTIONS /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 204 0 72 0.006 0.006 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:44:09 +0000] "OPTIONS /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 204 0 72 0.002 0.002 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:44:20 +0000] "OPTIONS /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 204 0 72 0.972 0.971 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"



matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:45:52 +0000] "POST /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 200 64 69 114.815 114.816 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:45:53 +0000] "POST /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 200 64 67 104.365 104.365 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:45:53 +0000] "POST /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 200 64 67 93.416 93.416 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"

logic · 2022-05-27T22:15:39Z

As an additional datapoint, I was seeing similar issues trying to re-join the #fedora-devel channel on libera.chat after getting idle-booted. When I increased the timeout to 120s, Element started throwing new errors (unable to obtain room_version for the room, etc), and still wouldn't connect. Increased again to 240s, and was finally able to join the channel (after a couple of failed attempts).

I suspect, based on the errors Element was throwing, that I'm bumping into multiple problematic timeouts, but this is the one that successfully solved the problem for me this time around.

jivanpal · 2022-10-26T20:32:52Z

3–4 minutes is not enough for me, my /send_join requests to certain large rooms (e.g. on libera.chat) are timing out consistently. I have no trouble joining smaller rooms on the same homeservers. Increasing the timeout would allow me to join these rooms.

Here are logs from an attempt to join #ansible:libera.chat, for example:

2022-10-26 19:29:13,148 - synapse.http.matrixfederationclient - 299 - INFO - POST-34469 - {GET-O-92082} [libera.chat] Completed request: 200 OK in 0.35 secs, got 71 bytes - GET matrix://libera.chat/_matrix/federation/v1/query/directory?room_alias=%23ansible%3Alibera.chat
2022-10-26 19:29:13,609 - synapse.http.matrixfederationclient - 299 - INFO - POST-34469 - {GET-O-92083} [libera.chat] Completed request: 200 OK in 0.41 secs, got 528 bytes - GET matrix://libera.chat/_matrix/federation/v1/make_join/%21czpAhMQgXXiRMvhOef%3Alibera.chat/%40jivan%3Aopaline.uk?ver=1&ver=2&ver=3&ver=4&ver=5&ver=6&ver=org.matrix.msc2176&ver=7&ver=8&ver=9&ver=org.matrix.msc3787&ver=10&ver=org.matrix.msc2716v4
2022-10-26 19:30:14,219 - synapse.http.matrixfederationclient - 672 - INFO - POST-34469 - {PUT-O-92085} [libera.chat] Request failed: PUT matrix://libera.chat/_matrix/federation/v2/send_join/%21czpAhMQgXXiRMvhOef%3Alibera.chat/%24mj5R7RnPgtZOnQvT0wKHgGef0J97YtdKIUw7xYGIgGQ: ResponseNeverReceived:[CancelledError()]
2022-10-26 19:31:14,887 - synapse.http.matrixfederationclient - 672 - INFO - POST-34469 - {PUT-O-92085} [libera.chat] Request failed: PUT matrix://libera.chat/_matrix/federation/v2/send_join/%21czpAhMQgXXiRMvhOef%3Alibera.chat/%24mj5R7RnPgtZOnQvT0wKHgGef0J97YtdKIUw7xYGIgGQ: ResponseNeverReceived:[CancelledError()]
2022-10-26 19:32:19,751 - synapse.http.matrixfederationclient - 672 - INFO - POST-34469 - {PUT-O-92085} [libera.chat] Request failed: PUT matrix://libera.chat/_matrix/federation/v2/send_join/%21czpAhMQgXXiRMvhOef%3Alibera.chat/%24mj5R7RnPgtZOnQvT0wKHgGef0J97YtdKIUw7xYGIgGQ: ResponseNeverReceived:[CancelledError()]
2022-10-26 19:33:21,582 - synapse.http.matrixfederationclient - 672 - INFO - POST-34469 - {PUT-O-92085} [libera.chat] Request failed: PUT matrix://libera.chat/_matrix/federation/v2/send_join/%21czpAhMQgXXiRMvhOef%3Alibera.chat/%24mj5R7RnPgtZOnQvT0wKHgGef0J97YtdKIUw7xYGIgGQ: ResponseNeverReceived:[CancelledError()]
2022-10-26 19:33:21,582 - synapse.util.retryutils - 231 - INFO - POST-34469 - Connection to libera.chat was unsuccessful (<class 'synapse.api.errors.RequestSendFailed'>(Failed to send request: ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >])); backoff now 600000
2022-10-26 19:33:21,583 - synapse.federation.federation_client - 819 - WARNING - POST-34469 - Failed to send_join via libera.chat: Failed to send request: ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
2022-10-26 19:33:21,590 - synapse.http.server - 107 - INFO - POST-34469 - <XForwardedForRequest at 0x7f169172bd30 method='POST' uri='/_matrix/client/r0/join/%23ansible%3Alibera.chat' clientproto='HTTP/1.0' site='8008'> SynapseError: 502 - Failed to send_join via any server
2022-10-26 19:33:21,626 - synapse.access.http.8008 - 459 - INFO - POST-34469 - 10.162.39.129 - 8008 - {@jivan:opaline.uk} Processed request: 248.788sec/0.036sec (0.037sec, 0.003sec) (0.018sec/0.022sec/8) 68B 502 "POST /_matrix/client/r0/join/%23ansible%3Alibera.chat HTTP/1.0" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.12 Chrome/104.0.5112.124 Electron/20.3.0 Safari/537.36" [0 dbevts]

anoadragon453 · 2022-10-28T13:48:33Z

The faster remote room joins project will likely resolve this problem in most cases.

jivanpal · 2022-11-01T16:44:38Z

@anoadragon453 Is there anything that Synapse admins can do in the meantime? My issue is not that the server I'm trying to federate with is being slow, it's that my server isn't being patient enough. The server I'm trying to federate with isn't even on Synapse 1.69, let alone whether it has this new feature enabled, so whilst I appreciate that it will fix this in the long term, what can we do now?

This was referenced Sep 17, 2020

Federation errors when DNS server is on same IP #8118

Closed

Traceback in synapse.metrics.background_process_metrics #8318

Closed

Provide a 1.20-rc3 branch YunoHost-Apps/synapse_ynh#214

Closed

anoadragon453 added enhancement A-Federation z-p2 (Deprecated Label) labels Sep 23, 2020

anoadragon453 added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Jan 24, 2022

anoadragon453 mentioned this issue Jul 8, 2022

Does the new 3 minute fed client timeout break things? #1806

Closed

matrixbot mentioned this issue Dec 21, 2023

Consider increasing default federation timeout element-hq/synapse#8338

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider increasing default federation timeout #8338

Consider increasing default federation timeout #8338

MayeulC commented Sep 17, 2020 •

edited by MadLittleMods

Loading

anoadragon453 commented Sep 23, 2020

richvdh commented Sep 23, 2020

MayeulC commented Sep 23, 2020 •

edited

Loading

rubo77 commented Sep 29, 2020

a-0-dev commented Jan 22, 2022

fuomag9 commented Mar 1, 2022

strufkin commented Mar 28, 2022

logic commented May 27, 2022

jivanpal commented Oct 26, 2022

anoadragon453 commented Oct 28, 2022

jivanpal commented Nov 1, 2022

Consider increasing default federation timeout #8338

Consider increasing default federation timeout #8338

Comments

MayeulC commented Sep 17, 2020 • edited by MadLittleMods Loading

Rationale

High level symptoms

Probable causes

Considerations

anoadragon453 commented Sep 23, 2020

richvdh commented Sep 23, 2020

MayeulC commented Sep 23, 2020 • edited Loading

rubo77 commented Sep 29, 2020

a-0-dev commented Jan 22, 2022

fuomag9 commented Mar 1, 2022

strufkin commented Mar 28, 2022

logic commented May 27, 2022

jivanpal commented Oct 26, 2022

anoadragon453 commented Oct 28, 2022

jivanpal commented Nov 1, 2022

MayeulC commented Sep 17, 2020 •

edited by MadLittleMods

Loading

MayeulC commented Sep 23, 2020 •

edited

Loading