-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Increased cpu/gc usage in event persistence workers in 1.91.0rc1 #16190
Comments
@jameskitt616 this started specifically on 1.91.0rc1, your issue is likely unrelated (might be #16101 considering the network traffic, but you should use synapse's prometheus metrics and grafana dashboard to confirm) |
Any particular jump in the Looking at matrix.org I'm seeing decreased CPU usage on event persistors starting from the update. |
Python version has been 3.11 for a while now so shouldn't be any changes there Nothing jumps out in per-block metrics, there was some spike soon after the upgrade, but that was fairly short GC from same period (restarted the two workers when I made this issue, the times went back down and started climbing again): |
@MatMaul i have presence disabled here and see the issue.. |
Tricky to advise without being able to reproduce it on our end. If anyone is able to, it would be super useful to bisect through Synapse |
i saw the misbehavior right after upgrading from 1.90.0 to version 1.91.0rc1. I think as soon as matrix.org upgrades too the same behavior will be observed there. |
Sorry, I misspoke: I should have said "bisect through Synapse commits" instead of "versions". Matrix.org is running v1.91.0rc1, except that its version string has not been updated to reflect this, and per #16190 (comment) we do not see the same behaviour. |
I also saw increasing CPU Usage and disk IO but already since 1.90 stable. Didnt upgrade to 1.91rc after I added this to homeserver.yml I saw instantly a positiv impact. Dont know if this similar to the issue on 1.91rc
|
Other avenues of investigation:
|
We're getting hit by this possibly in EMS as well for a lot of hosts, and have proceeded with a rollback out of caution. Some examples with internal host ID's:
GC for the same host: Host GC has the same trend: It does not look like all hosts have suffered in the same way. Some hosts with an event persister have their usual metrics pattern in event send times, CPU and GC. |
I'm also seeing this pattern on some hosts without an event persister, for example Event send time also regressed but CPU did not: This Synapse deployment has no workers, and on EMS we've seen quite a few deployments without workers alert on event send times regressing. |
This wasn't the whole truth 😞; we have a working hypothesis as to the cause. Current thinking is that #16220 will fix this. If we can verify this I think we'll likely put out a bugfix release early next week. |
Cherry-picked that onto maunium.net and it seems to be fixed, gc frequency of the event persisters has stayed relatively flat around around 0.2Hz for 2 hours now |
Thanks for cherry-picking and confirming @tulir. Let's consider this closed by that change. I will prepare a patch 1.91.1 patch release now. |
Description
After updating to 1.91.0rc1, the CPU usage of my event persisters started growing steadily. I can't find any obvious reason for it, but the GC charts are also way up
I just restarted the workers and the CPU usage went back down to normal. Will see tomorrow if it starts climbing again
Steps to reproduce
Homeserver
maunium.net
Synapse Version
1.91.0rc1
Database
PostgreSQL
Workers
Multiple workers
Platform
Custom docker image
The text was updated successfully, but these errors were encountered: