Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

processing send_join on a large room is extremely inefficient #3495

Open
richvdh opened this issue Jul 9, 2018 · 6 comments
Open

processing send_join on a large room is extremely inefficient #3495

richvdh opened this issue Jul 9, 2018 · 6 comments
Labels
A-Federated-Join joins over federation generally suck A-Performance Performance, both client-facing and admin-facing O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Jul 9, 2018

We received a send_join request over federation:

2018-07-09 11:21:46,361 - synapse.access.http.8080 - 144 - INFO - PUT-16391990 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 648.081sec (43.848sec, 1.580sec) (0.059sec/247.002sec/249) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]
2018-07-09 11:22:09,886 - synapse.access.http.8080 - 144 - INFO - PUT-16396650 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 611.176sec (38.560sec, 1.028sec) (0.042sec/184.576sec/244) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]
2018-07-09 11:22:31,112 - synapse.access.http.8080 - 144 - INFO - PUT-16401662 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 571.488sec (36.464sec, 1.072sec) (0.046sec/125.630sec/244) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]
2018-07-09 11:22:53,320 - synapse.access.http.8080 - 144 - INFO - PUT-16406605 - xxx.xxx.xxx.xxx - 8080 - {xxx} Processed request: 531.009sec (34.356sec, 0.712sec) (0.043sec/104.011sec/244) 0B 200 "PUT /_matrix/federation/v1/send_join/%21jOGhaTPuvWFnrWPrzv%3Amatrix.org/%241531134658158ydvgN%3Axxx HTTP/1.1" "Synapse/0.32.0" [239401 dbevts]

While processing these requests, the synapse master stopped logging anything for almost 60 seconds (twice); slave replication stopped, and request processing time went through the roof. Metrics suggest that the CPU was saturated with calls to _get_event_from_row. The 4x239401 (~=950000) events shown in the logs are reflected in the number of calls to _get_event_from_row.

There are several problems here.

Firstly, could we not deduplicate these four requests at the transaction level? They all have the same event id, so we could save ourselves a bunch of effort by deduplicating.

Secondly, why does each request lead to pulling 239401 events out of the database? The total room state in this room is only 120478 events: are we fetching the membership list twice, and if so, why isn't the event cache deduplicating them?

Thirdly, we are presumably requesting the same 239401 events for each of the four requests: can we not deduplicate these?

Finally, and related to the above, when fetching events, we first check the cache, then schedule a db fetch. By the time the db fetch happens, it may be entirely or partially redundant (if other threads have already fetched the relevant events), but we plough ahead anyway.

@richvdh richvdh changed the title processing send_join on a large room is extremely inefficient processing send_join on a large room is extremely inefficient and wedges synapse master Jul 9, 2018
@richvdh
Copy link
Member Author

richvdh commented Jul 9, 2018

Would it be better to maintain a queue of events we're trying to fetch, with a record of the deferreds which are waiting for them, and then just process them in lumps of a few hundred at a time? It would provide deduplication and would allow other things to carry on happening while the large fetch takes place.

@richvdh richvdh added the A-Performance Performance, both client-facing and admin-facing label Jul 9, 2018
@turt2live
Copy link
Member

semi-related: #3013

@erikjohnston
Copy link
Member

Secondly, why does each request lead to pulling 239401 events out of the database? The total room state in this room is only 120478 events: are we fetching the membership list twice, and if so, why isn't the event cache deduplicating them?

We also return the full auth chain, which will up the number of events

Thirdly, we are presumably requesting the same 239401 events for each of the four requests: can we not deduplicate these?

The event cache size on matrix.org is 200K, so these requests would blow the caches

@richvdh
Copy link
Member Author

richvdh commented Oct 20, 2020

this is probably better now? at least it's not wedging the main process?

@richvdh richvdh closed this as completed Jan 14, 2021
@richvdh richvdh reopened this Jan 14, 2021
@richvdh richvdh changed the title processing send_join on a large room is extremely inefficient and wedges synapse master processing send_join on a large room is extremely inefficient Jan 14, 2021
@richvdh
Copy link
Member Author

richvdh commented Jan 14, 2021

@erikjohnston erikjohnston added S-Major Major functionality / product severely impaired, no satisfactory workaround. O-Uncommon Most users are unlikely to come across this or unexpected workflow A-Federated-Join joins over federation generally suck and removed z-p1 z-major (Deprecated Label) labels Sep 7, 2022
@DMRobertson DMRobertson added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Sep 7, 2022
@richvdh
Copy link
Member Author

richvdh commented Oct 4, 2022

This will be mitigated by MSC3706, where servers choose to use it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federated-Join joins over federation generally suck A-Performance Performance, both client-facing and admin-facing O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

5 participants