Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

How destination_rooms is currently persisted in correlation to the per-destination catch-up mechanic would likely not properly work as-is #9651

Closed
ShadowJonathan opened this issue Mar 19, 2021 · 4 comments

Comments

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Mar 19, 2021

While working on #9639 and thinking about how certain aspects of it actually work, I have a feeling that entries destination_rooms are not properly persisted for what it is supposed to do.

Currently, store_destination_rooms_entries is called in _send_pdu, which in effect is called in the process_event_queue_for_federation loop on every event that is being batched to be sent.

However, this store happens before the events are distributed to the individual per_destination transmission loops, which means it is possible for these loops to encounter that the remote is not available, and so destination_rooms will be updated with a stream_id from unsent events.

This would cause the catch-up transmission loop to miss some (or all) events that need to be sent to that destination if the destination is soon retried after the call to store_destination_rooms_entries, with events that were "marked" during that call now not being sent to the destination.

My suggestion would be to move the store_destination_rooms_entries task (or a simpler version of it) to _TransactionQueueManager.__aexit__, where set_destination_last_successful_stream_ordering is already called to set the destinations table (but not destination_rooms), and to do this in a way that it is called once per successful sent batch of events (for all rooms).

This would also maybe require the pending_pdus list to be sorted based on stream_id (and optionally maybe deduplicated, i.e. an "ordered set"), to ensure that only the latest stream_id per room is persisted to the database, and the new store_destination_rooms_entries task need to only update the stream_id, if event.stream_id > database_entry.stream_id.

cc @erikjohnston

@ShadowJonathan ShadowJonathan changed the title destination_rooms in correlation to catch-up mechanic would likely not properly work How destination_rooms is currently persisted in correlation to the per-destination catch-up mechanic would likely not properly work as-is Mar 19, 2021
@ShadowJonathan
Copy link
Contributor Author

...now i'm a bit confused, as I just looked at _get_catch_up_outstanding_destinations_txn, which does stream_ordering > last_successful_stream_ordering.

last_successful_stream_ordering is set during _TransactionQueueManager.__aexit__, when a transaction with another destination has finished, and stream_ordering is (currently) set during every process_event_queue_for_federation loop.

Did i misunderstand destination_rooms? Is it a "marker" for which (server, room) pair has had the last event sent to the room, per server? If so, then this issue is void, as then I have misunderstood how it worked. (if so, that doesn't affect #9639, as that PR is a simple optimisation to compress a call per many events per 1 room to 1 call per 1 room at the end of the event stream)

@erikjohnston
Copy link
Member

So destination_rooms is the last event in a room that should be sent to the server, and so we can get the set of rooms that have unsent events by doing stream_ordering > last_successful_stream_ordering. (There is a race where we do send an event but Synapse dies before persisting last_successful_stream_ordering, but that is OK)

@richvdh
Copy link
Member

richvdh commented Apr 6, 2021

Can someone summarise this issue please? I'm unclear on what's a real issue and what is speculation/confusion.

@ShadowJonathan
Copy link
Contributor Author

Yeah this issue is a bit convoluted, it was made in the middle of me trying to figure out federation speedup.

I'll close it, but TL;DR: It's me thinking about destination_rooms the wrong way, i thought that it'd do federation catchup incorrectly, and likely would've made some servers miss out on some events, but then i discovered destination_rooms worked differently.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants