How destination_rooms is currently persisted in correlation to the per-destination catch-up mechanic would likely not properly work as-is #9651

ShadowJonathan · 2021-03-19T16:37:58Z

While working on #9639 and thinking about how certain aspects of it actually work, I have a feeling that entries destination_rooms are not properly persisted for what it is supposed to do.

Currently, store_destination_rooms_entries is called in _send_pdu, which in effect is called in the process_event_queue_for_federation loop on every event that is being batched to be sent.

However, this store happens before the events are distributed to the individual per_destination transmission loops, which means it is possible for these loops to encounter that the remote is not available, and so destination_rooms will be updated with a stream_id from unsent events.

This would cause the catch-up transmission loop to miss some (or all) events that need to be sent to that destination if the destination is soon retried after the call to store_destination_rooms_entries, with events that were "marked" during that call now not being sent to the destination.

My suggestion would be to move the store_destination_rooms_entries task (or a simpler version of it) to _TransactionQueueManager.__aexit__, where set_destination_last_successful_stream_ordering is already called to set the destinations table (but not destination_rooms), and to do this in a way that it is called once per successful sent batch of events (for all rooms).

This would also maybe require the pending_pdus list to be sorted based on stream_id (and optionally maybe deduplicated, i.e. an "ordered set"), to ensure that only the latest stream_id per room is persisted to the database, and the new store_destination_rooms_entries task need to only update the stream_id, if event.stream_id > database_entry.stream_id.

cc @erikjohnston

The text was updated successfully, but these errors were encountered:

ShadowJonathan · 2021-03-19T16:57:44Z

...now i'm a bit confused, as I just looked at _get_catch_up_outstanding_destinations_txn, which does stream_ordering > last_successful_stream_ordering.

last_successful_stream_ordering is set during _TransactionQueueManager.__aexit__, when a transaction with another destination has finished, and stream_ordering is (currently) set during every process_event_queue_for_federation loop.

Did i misunderstand destination_rooms? Is it a "marker" for which (server, room) pair has had the last event sent to the room, per server? If so, then this issue is void, as then I have misunderstood how it worked. (if so, that doesn't affect #9639, as that PR is a simple optimisation to compress a call per many events per 1 room to 1 call per 1 room at the end of the event stream)

erikjohnston · 2021-03-22T15:38:36Z

So destination_rooms is the last event in a room that should be sent to the server, and so we can get the set of rooms that have unsent events by doing stream_ordering > last_successful_stream_ordering. (There is a race where we do send an event but Synapse dies before persisting last_successful_stream_ordering, but that is OK)

richvdh · 2021-04-06T15:53:40Z

Can someone summarise this issue please? I'm unclear on what's a real issue and what is speculation/confusion.

ShadowJonathan · 2021-04-06T16:09:21Z

Yeah this issue is a bit convoluted, it was made in the middle of me trying to figure out federation speedup.

I'll close it, but TL;DR: It's me thinking about destination_rooms the wrong way, i thought that it'd do federation catchup incorrectly, and likely would've made some servers miss out on some events, but then i discovered destination_rooms worked differently.

ShadowJonathan changed the title ~~destination_rooms in correlation to catch-up mechanic would likely not properly work~~ How destination_rooms is currently persisted in correlation to the per-destination catch-up mechanic would likely not properly work as-is Mar 19, 2021

ShadowJonathan mentioned this issue Mar 26, 2021

Experimental Federation Speedup #9702

Merged

4 tasks

ShadowJonathan closed this as completed Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How destination_rooms is currently persisted in correlation to the per-destination catch-up mechanic would likely not properly work as-is #9651

How destination_rooms is currently persisted in correlation to the per-destination catch-up mechanic would likely not properly work as-is #9651

ShadowJonathan commented Mar 19, 2021 •

edited

Loading

ShadowJonathan commented Mar 19, 2021

erikjohnston commented Mar 22, 2021

richvdh commented Apr 6, 2021

ShadowJonathan commented Apr 6, 2021

How destination_rooms is currently persisted in correlation to the per-destination catch-up mechanic would likely not properly work as-is #9651

How destination_rooms is currently persisted in correlation to the per-destination catch-up mechanic would likely not properly work as-is #9651

Comments

ShadowJonathan commented Mar 19, 2021 • edited Loading

ShadowJonathan commented Mar 19, 2021

erikjohnston commented Mar 22, 2021

richvdh commented Apr 6, 2021

ShadowJonathan commented Apr 6, 2021

ShadowJonathan commented Mar 19, 2021 •

edited

Loading