we should be more intelligent with backoff for federation requests #8917

richvdh · 2020-12-10T13:31:00Z

It seems inappropriate that a single failed request can cause all subsequent requests to a server to fail for the next 10 minutes.

(See also #8915 which asks why it's a single failure rather than at least a few)

For example, when we are handling a federation transaction, we can end up needing to make many requests to /v1/event. If any one of these hundreds of requests fails, all subsequent requests also fail. The upshot is that it's hard to make progress in populating complex rooms over federation: if we did a better job of persisting the events we did receive rather than aborting halfway through the operation, we might be able to make progress in the right direction so that subsequent federation transactions have a better chance of succeeding.

Essentially I think we should consider that there are different sorts of requests that need different "backoff" behaviour:

stuff we "push" (ie, /v1/send requests) vs stuff that we "pull".
stuff we pull from a specific server, vs stuff we could get from any server in the room. This is actually a spectrum, ranging from "claim E2E keys" which cannot possibly go anywhere else, through "fetch an event" which should probably go back to the server originating a transaction, to "join a room" where almost any server is as good as any other.

Obviously repeated failures to /send should mean we back off from further /send attempts; it should maybe also mean that the target server is moved down the preference list for "join a room" requests. But it should it affect key-claim requests or /v1/event requests?

We have some provision for this sort of thing with the "long retry" schedule, and the "ignore backoff" flag, but I don't think we use it consistently, and tbh I don't really think the larger picture has been considered: it's just been thrown together as the need arises.

The text was updated successfully, but these errors were encountered:

This was referenced Feb 8, 2021

Synapse should not back off federation for 10 minutes due to a single received 5xx error #9335

Closed

we should backoff on 40x errors #5442

Open

richvdh mentioned this issue Mar 18, 2021

retry device resync doesn't follow exponentially back off algorithm #9603

Open

reivilibre mentioned this issue Aug 27, 2021

Doesn't back off on backfill if remote server returns 404 #10689

Closed

This was referenced Nov 30, 2021

Joining a room where all servers are blacklisted fails with obscure error #7297

Open

Confusing error when Synapse fails to connect to a domain due to it being on the blacklist #10224

Open

richvdh added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Dec 23, 2021

richvdh mentioned this issue Dec 23, 2021

server keeps trying to handle device list update on a server that doesn't exists anymore (301: Moved Permanently) #8983

Open

richvdh mentioned this issue Jan 27, 2022

spoofed event breaks federation (SYN-739) #1574

Closed

DMRobertson mentioned this issue Aug 25, 2022

The default federation ratecontrol limits may be too low for busy servers, and break badly. #4971

Closed

DMRobertson added A-Federation S-Major Major functionality / product severely impaired, no satisfactory workaround. O-Uncommon Most users are unlikely to come across this or unexpected workflow A-Performance Performance, both client-facing and admin-facing labels Aug 25, 2022

richvdh mentioned this issue Oct 10, 2022

Faster joins: smarter algorithm for picking a server to resync from #12999

Closed

MadLittleMods mentioned this issue Jun 5, 2023

Add context for when/why to use the long_retries option when sending Federation requests #15721

Merged

4 tasks

matrixbot mentioned this issue Dec 21, 2023

we should be more intelligent with backoff for federation requests element-hq/synapse#8917

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

we should be more intelligent with backoff for federation requests #8917

we should be more intelligent with backoff for federation requests #8917

richvdh commented Dec 10, 2020 •

edited by MadLittleMods

Loading

we should be more intelligent with backoff for federation requests #8917

we should be more intelligent with backoff for federation requests #8917

Comments

richvdh commented Dec 10, 2020 • edited by MadLittleMods Loading

richvdh commented Dec 10, 2020 •

edited by MadLittleMods

Loading