processing send_join on a large room is extremely inefficient #3495

matrixbot · 2023-12-16T12:37:05Z

This issue has been migrated from #3495.

We received a send_join request over federation:

2018-07-09 11:21:46,361 - synapse.access.http.8080 - 144 - INFO - PUT-16391990 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 648.081sec (43.848sec, 1.580sec) (0.059sec/247.002sec/249) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]
2018-07-09 11:22:09,886 - synapse.access.http.8080 - 144 - INFO - PUT-16396650 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 611.176sec (38.560sec, 1.028sec) (0.042sec/184.576sec/244) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]
2018-07-09 11:22:31,112 - synapse.access.http.8080 - 144 - INFO - PUT-16401662 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 571.488sec (36.464sec, 1.072sec) (0.046sec/125.630sec/244) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]
2018-07-09 11:22:53,320 - synapse.access.http.8080 - 144 - INFO - PUT-16406605 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 531.009sec (34.356sec, 0.712sec) (0.043sec/104.011sec/244) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]

While processing these requests, the synapse master stopped logging anything for almost 60 seconds (twice); slave replication stopped, and request processing time went through the roof. Metrics suggest that the CPU was saturated with calls to _get_event_from_row. The 4x239401 (~=950000) events shown in the logs are reflected in the number of calls to _get_event_from_row.

There are several problems here.

Firstly, could we not deduplicate these four requests at the transaction level? They all have the same event id, so we could save ourselves a bunch of effort by deduplicating.

Secondly, why does each request lead to pulling 239401 events out of the database? The total room state in this room is only 120478 events: are we fetching the membership list twice, and if so, why isn't the event cache deduplicating them?

Thirdly, we are presumably requesting the same 239401 events for each of the four requests: can we not deduplicate these?

Finally, and related to the above, when fetching events, we first check the cache, then schedule a db fetch. By the time the db fetch happens, it may be entirely or partially redundant (if other threads have already fetched the relevant events), but we plough ahead anyway.

The text was updated successfully, but these errors were encountered:

matrixbot closed this as completed Dec 16, 2023

matrixbot changed the title ~~Dummy issue~~ processing send_join on a large room is extremely inefficient Dec 21, 2023

matrixbot added A-Federated-Join A-Performance S-Major O-Uncommon T-Defect labels Dec 21, 2023

matrixbot reopened this Dec 21, 2023

matrixbot mentioned this issue Dec 21, 2023

joining a large room completely hosed a homeserver #3527

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processing send_join on a large room is extremely inefficient #3495

processing send_join on a large room is extremely inefficient #3495

matrixbot commented Dec 16, 2023 •

edited

Loading

processing send_join on a large room is extremely inefficient #3495

processing send_join on a large room is extremely inefficient #3495

Comments

matrixbot commented Dec 16, 2023 • edited Loading

matrixbot commented Dec 16, 2023 •

edited

Loading