-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Outbound federation breaks to a particular server until restart #3842
Comments
One recent change that may be relevant is that we've moved to use We also don't seem to time out reading the body from the response: synapse/synapse/http/matrixfederationclient.py Lines 380 to 398 in 9a5ea51
|
Right, so what probably isn't helping is that our retry schedule for "long retries" is insane: synapse/synapse/http/matrixfederationclient.py Lines 255 to 266 in 9a5ea51
Edit: Which gives a potential max sleep duration as over 16 days |
In what will shock everyone it turns out I can't read, the maximum sleep time is actually only 84 seconds |
After the logging change its clear that things are getting wedged in the actual request deferred, which suggests that the timeouts are broken. |
This is an attempt to mitigate #3842 by adding yet-another-timeout
The awful hack didn't work. But I found a stack trace:
so I hacked the hack to hack around that and moved the cancellation until after resolving the new deferred |
The hack is: synapse/synapse/util/async_helpers.py Lines 443 to 492 in 9d13ff4
|
So conclusion is that this is a bug in twisted, where the |
The existing deferred timeout helper function (and the one into twisted) suffer from a bug when a deferred's canceller throws an exception, #3842. The new helper function doesn't suffer from this problem.
We've updated all timeouts to use the new function |
No events were sent from jki.re to matrix.org for ~24h. Other requests were successfully sent.
The last log line in the federation transmission loop for matrix.org is:
suggesting that the deferred to actually send the transaction got wedged entirely.
Later we have an unhandled error, which may or may not be related.
Edit: Version:
Synapse/0.33.4 (b=develop,9a5ea511b)
The text was updated successfully, but these errors were encountered: