-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Homeservers don't catch up with missed traffic until someone sends another event #2528
Comments
some keywords: federation outage, resync room, server offline, recover |
This is not just about temporary outage with following recovery. Partial netsplit can exist by design during transition to another underlying networking protocol. Imagine nodes that exist in I2P network only (or any other privacy-oriented network). These nodes can not communicate directly to normal HTTPS-based nodes that are not I2P-aware, but messages can be relayed through nodes that are dual-stack. I think, partial netsplit should be considered not a temporary outage, but part of normal functioning. |
i'm guessing this is the cause of https://twitter.com/jomatv6/status/1213161573759541256. The fact we don't retry transactions when a server reappears seems very worthwhile (especially in a p2p world). |
To limit the number of events we send we can limit ourselves to only sending the most recent N events for each room there are missed updates in. Side note: sending old events down |
just missed a relatively important message from @benbz thanks to this >:( |
re-triaging this, because (as per the examples in the comments here), it's pretty disasterous that messages can get just stuck forever because of a brief federation outage. I get bitten by this every few weeks. |
I will put my thoughts down regarding this. I don't really know Synapse's internal architecture, but here is what I imagine: Homeservers could keep a list of other homeservers they are in contact with, as well as the last successful connection. Or the timestamp of the first error, if you want to update the table less often. For every event:
In parallel, exponential backoff is implemented at the homeserver level, not at the room level.
And incoming connections can be a source of truth: when receiving a connection from a server that was previously down, clear the A couple issues/thoughts with that implementation:
This is even more important if we think about a p2p world, and is one of the glaring issues with Matrix at present. In parallel, it would be nice for servers to schedule a sync at startup. |
In case it is helpful, here is an implementation in Dendrite. |
Next steps:
|
In #2527, you have said
In #2526, you have said 'And the fact we hear from a remote HS doesn't necessarily mean it's ready to receive unsent transactions. #2527 may be a better solution.'... oops. In any case, barring any new ideas, I take it we are left with #2526. I'm tempted to take this on, but don't know if I know enough about federation (but it might be a worthwhile chance to learn!). I'm tempted to also believe that only the recovering homeserver is in a best position to know about whether it is 'ready' to catch up. The idea of 'hearing that a homeserver is back online' seems a tad frail for some low-traffic cases; would it be the case that a recovering HS should go and prod others so it gets noticed? I will study the Dendrite implementation. |
I've been doing a bit of thinking about this over the weekend. First of all I think there are a couple of essential requirements for any solution here:
Now, I see two sides to this problem. The first is how we decide to resend data over over to the other homeserver; the second is figuring out what that data should be. The first part is relatively easy: as per the opening comment, options include having the other server pull or resending when we get an incoming request from the remote server. Another option might just be an extended retry schedule. We essentially already have this mechanism today: when we successfully receive an incoming transaction, we will try sending any device list updates or to-device messages (see https://github.com/matrix-org/synapse/blob/release-v1.16.1/synapse/federation/sender/__init__.py#L494). To my mind, this is by far the most promising option for a hook for resending room events too, although as noted in #2526 (comment) and in @MayeulC 's comment above, this solution not without its downsides. Still, it'll be fine for a first pass. (Incidentally, it looks to me as though matrix-org/dendrite#1077 is all about this first part of deciding when to send new data, rather than what to send. It also appears to be focussed somewhat on the P2P case, though maybe I'm missing something. We should check with the Dendrite folks - and indeed other HS developers - if they have a more complete solution to this problem though.) So, the second part of the problem: once server A decides (by whatever mechanism) that server B is back in the game, how does it know what to send? An idea: Currently, synapse's outgoing federation logic iterates though the event stream in We also add to the current So then, in Outstanding problems:
|
That all broadly makes sense. I think the interesting thing to note is that when catching up we're going to have to support doing so in batches (to stay under the event limit). That sort of implies that we need to keep around ranges in the DB about gaps we still need to send to a remote server. I think one way of doing that is to also have a table
This way the On startup we could also go through all destinations and see if there are any entries in |
As discussed in #synapse-dev: there's an alternative solution to this, which seems simpler and I'd like to propose as an initial implementation. Essentially, we make sure to catch up with the least-recently-updated rooms first. In other words, on each transaction attempt, we do A concern with this mechanism is that we might never be able to catch up with real time in this way, since it's much slower for a server to have to request missed data with |
this morning: More Sygnal#130 (HTTP proxy) rework, I'm feeling it straightening out a lot so hopefully it'll be back in the queue soon; today: 'More of that'; Catch up on #2528 and #7828 which Riiich has suggested solving first Github matrix-org/synapse#2528 : Homeservers don't catch up with missed traffic until someone sends another event matrix-org/synapse#7828 : in-memory federation transaction transmission queues build up indefinitely for offline servers
One note that may be worth considering (but probably not for now)
There is a little issue in this in that a homeserver may miss events for which notifications should be generated, e.g. mentions. Perhaps this sounds silly for a month-long outage but if we take the example of an outage of a few hours on a moderately-busy room, it wouldn't sound silly to me to say that there was a reliability problem here if I didn't get notified when someone mentions me. |
I think this is fixed by #8272 and previous PRs. |
If a homeserver goes offline, it may miss events. There is then no reliable means for it to catch up. Currently we rely on someone else sending an event in the room, which will then make it do a backfill.
Potential solutions include:
The text was updated successfully, but these errors were encountered: