You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
While working on #9639 and thinking about how certain aspects of it actually work, I have a feeling that entries destination_rooms are not properly persisted for what it is supposed to do.
However, this store happens before the events are distributed to the individual per_destination transmission loops, which means it is possible for these loops to encounter that the remote is not available, and so destination_rooms will be updated with a stream_id from unsent events.
This would cause the catch-up transmission loop to miss some (or all) events that need to be sent to that destination if the destination is soon retried after the call to store_destination_rooms_entries, with events that were "marked" during that call now not being sent to the destination.
My suggestion would be to move the store_destination_rooms_entries task (or a simpler version of it) to _TransactionQueueManager.__aexit__, where set_destination_last_successful_stream_ordering is already called to set the destinations table (but not destination_rooms), and to do this in a way that it is called once per successful sent batch of events (for all rooms).
This would also maybe require the pending_pdus list to be sorted based on stream_id (and optionally maybe deduplicated, i.e. an "ordered set"), to ensure that only the latest stream_id per room is persisted to the database, and the new store_destination_rooms_entries task need to only update the stream_id, if event.stream_id > database_entry.stream_id.
The text was updated successfully, but these errors were encountered:
ShadowJonathan
changed the title
destination_rooms in correlation to catch-up mechanic would likely not properly work
How destination_rooms is currently persisted in correlation to the per-destination catch-up mechanic would likely not properly work as-is
Mar 19, 2021
last_successful_stream_ordering is set during _TransactionQueueManager.__aexit__, when a transaction with another destination has finished, and stream_ordering is (currently) set during every process_event_queue_for_federation loop.
Did i misunderstand destination_rooms? Is it a "marker" for which (server, room) pair has had the last event sent to the room, per server? If so, then this issue is void, as then I have misunderstood how it worked. (if so, that doesn't affect #9639, as that PR is a simple optimisation to compress a call per many events per 1 room to 1 call per 1 room at the end of the event stream)
So destination_rooms is the last event in a room that should be sent to the server, and so we can get the set of rooms that have unsent events by doing stream_ordering > last_successful_stream_ordering. (There is a race where we do send an event but Synapse dies before persisting last_successful_stream_ordering, but that is OK)
Yeah this issue is a bit convoluted, it was made in the middle of me trying to figure out federation speedup.
I'll close it, but TL;DR: It's me thinking about destination_rooms the wrong way, i thought that it'd do federation catchup incorrectly, and likely would've made some servers miss out on some events, but then i discovered destination_rooms worked differently.
While working on #9639 and thinking about how certain aspects of it actually work, I have a feeling that entries
destination_rooms
are not properly persisted for what it is supposed to do.Currently,
store_destination_rooms_entries
is called in_send_pdu
, which in effect is called in theprocess_event_queue_for_federation
loop on every event that is being batched to be sent.However, this store happens before the events are distributed to the individual per_destination transmission loops, which means it is possible for these loops to encounter that the remote is not available, and so
destination_rooms
will be updated with a stream_id from unsent events.This would cause the catch-up transmission loop to miss some (or all) events that need to be sent to that destination if the destination is soon retried after the call to
store_destination_rooms_entries
, with events that were "marked" during that call now not being sent to the destination.My suggestion would be to move the
store_destination_rooms_entries
task (or a simpler version of it) to_TransactionQueueManager.__aexit__
, whereset_destination_last_successful_stream_ordering
is already called to set thedestinations
table (but notdestination_rooms
), and to do this in a way that it is called once per successful sent batch of events (for all rooms).This would also maybe require the
pending_pdus
list to be sorted based on stream_id (and optionally maybe deduplicated, i.e. an "ordered set"), to ensure that only the latest stream_id per room is persisted to the database, and the newstore_destination_rooms_entries
task need to only update thestream_id
, ifevent.stream_id > database_entry.stream_id
.cc @erikjohnston
The text was updated successfully, but these errors were encountered: