processing send_join on a large room is extremely inefficient #3495

richvdh · 2018-07-09T13:35:24Z

We received a send_join request over federation:

2018-07-09 11:21:46,361 - synapse.access.http.8080 - 144 - INFO - PUT-16391990 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 648.081sec (43.848sec, 1.580sec) (0.059sec/247.002sec/249) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]
2018-07-09 11:22:09,886 - synapse.access.http.8080 - 144 - INFO - PUT-16396650 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 611.176sec (38.560sec, 1.028sec) (0.042sec/184.576sec/244) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]
2018-07-09 11:22:31,112 - synapse.access.http.8080 - 144 - INFO - PUT-16401662 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 571.488sec (36.464sec, 1.072sec) (0.046sec/125.630sec/244) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]
2018-07-09 11:22:53,320 - synapse.access.http.8080 - 144 - INFO - PUT-16406605 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 531.009sec (34.356sec, 0.712sec) (0.043sec/104.011sec/244) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]

While processing these requests, the synapse master stopped logging anything for almost 60 seconds (twice); slave replication stopped, and request processing time went through the roof. Metrics suggest that the CPU was saturated with calls to _get_event_from_row. The 4x239401 (~=950000) events shown in the logs are reflected in the number of calls to _get_event_from_row.

There are several problems here.

Firstly, could we not deduplicate these four requests at the transaction level? They all have the same event id, so we could save ourselves a bunch of effort by deduplicating.

Secondly, why does each request lead to pulling 239401 events out of the database? The total room state in this room is only 120478 events: are we fetching the membership list twice, and if so, why isn't the event cache deduplicating them?

Thirdly, we are presumably requesting the same 239401 events for each of the four requests: can we not deduplicate these?

Finally, and related to the above, when fetching events, we first check the cache, then schedule a db fetch. By the time the db fetch happens, it may be entirely or partially redundant (if other threads have already fetched the relevant events), but we plough ahead anyway.

The text was updated successfully, but these errors were encountered:

richvdh · 2018-07-09T14:18:07Z

Would it be better to maintain a queue of events we're trying to fetch, with a record of the deferreds which are waiting for them, and then just process them in lumps of a few hundred at a time? It would provide deduplication and would allow other things to carry on happening while the large fetch takes place.

turt2live · 2018-07-09T14:26:45Z

semi-related: #3013

erikjohnston · 2018-07-20T15:13:47Z

Secondly, why does each request lead to pulling 239401 events out of the database? The total room state in this room is only 120478 events: are we fetching the membership list twice, and if so, why isn't the event cache deduplicating them?

We also return the full auth chain, which will up the number of events

Thirdly, we are presumably requesting the same 239401 events for each of the four requests: can we not deduplicate these?

The event cache size on matrix.org is 200K, so these requests would blow the caches

richvdh · 2020-10-20T15:28:01Z

this is probably better now? at least it's not wedging the main process?

richvdh · 2021-01-14T15:43:25Z

this exacerbates https://github.com/matrix-org/matrix-doc/issues/2963

richvdh · 2022-10-04T09:47:17Z

This will be mitigated by MSC3706, where servers choose to use it.

richvdh changed the title ~~processing send_join on a large room is extremely inefficient~~ processing send_join on a large room is extremely inefficient and wedges synapse master Jul 9, 2018

richvdh added the A-Performance Performance, both client-facing and admin-facing label Jul 9, 2018

neilisfragile added p1 z-major (Deprecated Label) labels Jul 10, 2018

richvdh mentioned this issue Jul 13, 2018

joining a large room completely hosed a homeserver #3527

Open

turt2live mentioned this issue Feb 2, 2019

replication wedged after spike in event traffic #4549

Closed

richvdh closed this as completed Jan 14, 2021

richvdh reopened this Jan 14, 2021

richvdh changed the title ~~processing send_join on a large room is extremely inefficient and wedges synapse master~~ processing send_join on a large room is extremely inefficient Jan 14, 2021

erikjohnston added S-Major Major functionality / product severely impaired, no satisfactory workaround. O-Uncommon Most users are unlikely to come across this or unexpected workflow A-Federated-Join joins over federation generally suck and removed z-p1 z-major (Deprecated Label) labels Sep 7, 2022

DMRobertson added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Sep 7, 2022

matrixbot mentioned this issue Dec 21, 2023

processing send_join on a large room is extremely inefficient element-hq/synapse#3495

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processing send_join on a large room is extremely inefficient #3495

processing send_join on a large room is extremely inefficient #3495

richvdh commented Jul 9, 2018 •

edited

Loading

richvdh commented Jul 9, 2018

turt2live commented Jul 9, 2018

erikjohnston commented Jul 20, 2018

richvdh commented Oct 20, 2020

richvdh commented Jan 14, 2021

richvdh commented Oct 4, 2022

processing send_join on a large room is extremely inefficient #3495

processing send_join on a large room is extremely inefficient #3495

Comments

richvdh commented Jul 9, 2018 • edited Loading

richvdh commented Jul 9, 2018

turt2live commented Jul 9, 2018

erikjohnston commented Jul 20, 2018

richvdh commented Oct 20, 2020

richvdh commented Jan 14, 2021

richvdh commented Oct 4, 2022

richvdh commented Jul 9, 2018 •

edited

Loading