-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Consider increasing default federation timeout #8338
Comments
Hey, thanks for your detailed issue. Making federation timeout configurable generally sounds like a good idea. This would allow admins to play around with the value and see what works best for them given their resource limitations. We are a bit wary about changing the default value of If it's configurable and people find that generally raising the timeout gives a better experience, then we can take a look at changing the defaults, or potentially working out a way to scale the timeout when needed.
Currently requests are indeed sent out in parallel as needed. There's been work in the past to queue batches of these requests, under the name "Federation Side-Bus", but it's been put on hiatus. This is a known problem though, and there are other solutions than just increasing the timeout, but it's a hard problem I'm afraid. |
federation-side bus. |
Thank you for the detailed answer, that makes sense. I've indeed seen a few PR/issues mentioning batching requests, and wondered if it was related. I think it could actually help quite a bit. Nit: a bus has a different meaning to me, as a high-bandwidth link that is connected to multiple receivers at once (think public transit). Federation queues might be more explicit, unless it's called like that for internal reasons, of course, or that I am mistaken about buses :) Edit: ah, yes, #5883 looks pretty stale. |
Maybe we could solve this, if we could add a configuration, that would increase the timeout by hostname, so i.e. only for communication with some known problematic servers, while all other servers still run into the default timeout. |
Following a recent discussion in #synapse:matrix.org this still seems to be an issue, I am also suspecting this to cause trouble with my homeserver (logs are matching, and I am experiencing irregular downtimes of my synapse instance without any other log indication). I would be very thankful if any experienced synapse dev would implement the fix suggested above (exposing the variable to the config)... I doesn't seem to be that much work at all to me, I'm just not familiar with the codebase at all. Anyhow, the issue is from Sep 2020 and this is a bump/reminder :) |
Same, I'm full of |
I had to increase timeout to 120 seconds, so I could join the room,
|
As an additional datapoint, I was seeing similar issues trying to re-join the #fedora-devel channel on libera.chat after getting idle-booted. When I increased the timeout to 120s, Element started throwing new errors (unable to obtain room_version for the room, etc), and still wouldn't connect. Increased again to 240s, and was finally able to join the channel (after a couple of failed attempts). I suspect, based on the errors Element was throwing, that I'm bumping into multiple problematic timeouts, but this is the one that successfully solved the problem for me this time around. |
3–4 minutes is not enough for me, my Here are logs from an attempt to join
|
The faster remote room joins project will likely resolve this problem in most cases. |
@anoadragon453 Is there anything that Synapse admins can do in the meantime? My issue is not that the server I'm trying to federate with is being slow, it's that my server isn't being patient enough. The server I'm trying to federate with isn't even on Synapse 1.69, let alone whether it has this new feature enabled, so whilst I appreciate that it will fix this in the long term, what can we do now? |
Alternatively, expose it in a configuration file.
This is the timeout in question:
synapse/synapse/http/matrixfederationclient.py
Line 261 in 837293c
Rationale
Related: #7113 #6699 #5445 #8118
Also related to #8080 and #5604, as the telltale sign is this log pattern:
Note: the timeout error can also be something like
That indicates, if I am not mistaken, that the transaction actually returned something after the timeout expired.
High level symptoms
Federation was working bi-directionally with some (smaller?) homeservers. It wasn't working with matrix.org, though (my homeserver could receive events -- PDU and EDU -- but matrix.org wouldn't receive events from my server, except when backfilling from others, of course).
This started happening after a ~1-day federation outage (synapse was down for postgresql maintenance & upgrade, and I hit another synapse bug on start-up).
Probable causes
This is likely due to my having a slow internet connection (ADSL), subject to bufferbloat. Trying to transmit a lot of data at once will dramatically increase latency, even causing DNS lookups to fail.
The origin is likely due to events piling up while the server wasn't federating, and the new async I/O architecture being too performant (meaning it tries to send all at once on start-up. (I successfully recovered from multiple-week federation outages in the past).
I got around #8118 by increasing dnsmasq's max simultaneous connections to 300 from the default of 100, and its number of cache entries to 4096 up from 256, then restarting synapse twice. It is possible that changing the federation timeout could have helped with that as well.
Considerations
That might only happen in a number of scenarios:
As such, it might be better to progressively increase the timeout (though it then makes no sense having a big timeout), or alternatively, use a large timeout on startup and progressively reduce it.
Rate-limitting requests (including DNS), especially on start-up might also help quite a bit.
The text was updated successfully, but these errors were encountered: