-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
memory leak since 1.53.0 #12160
Comments
Hi, thanks for your report! Out of interest, where is this graph from? The units are in K but I assume that's meant to be G. Do you happen to have metrics set up? The graphs there would give us some indication as to where the memory is being used. You could try reducing the cache factor, but if you've only needed 3 GB so far then I'm not sure this would help as it sounds like there's something wrong. Are you able to downgrade to try a few versions and see which one introduces the problem? |
This is shinken + graphite, unit should be G not K yes :) We don't have graphana so metrics are not enabled. No I can't do via apt because older version are not proposed:
I just downgraded to 1.50.2, let's see if there is still mem leaks.
Memory increases very regulary even during the night with no are little activity, could it be a cleanup process ? |
I'm not entirely sure if this is related, but the |
I added the flag expiry_time in purpose of testing theses last days by I didn't had it before I noticed the problem. |
Hi there thanks for getting back to us. One thing:
And just to be crystal clear, the |
retried with 1.54.0
same problem. We don't really use our bots so I can stay like that, but this could be problematic one some installations (ore nasty clients) and could be use to do a DOS. I can see an error on logs, but i don't think this is related since this error happened on 1.52.0
Louis |
Are these |
yes, maubot is doing a lot of sync queries. |
@lchanouha I'm guessing the spiky part of the graph from 2022-03-02 to 2022-03-08 is Synapse 1.53/1.54? |
Are all of matrix.org's cache's configured with time-based expiry? |
I think so? |
There appear to be three caches that are not subject to time-based expiry in Synapse:
In the case of matrix.org, The only change to these caches that I can see in 1.53 is that we reduced the frequency at which @lchanouha Would you be willing to try setting the |
@squahtx that's done, lets see the behaviour next days.
Here is the updated graph, (service was running since Sat 2022-03-12 18:43:33 CET; 3 weeks 5 days ago, v1.54.0) and used 4,1G We may have a ceiling here, but 5G of cache would be very high on our little installation. |
We have an experimental option to track the memory usage of caches, though it does add load to the server so I'm not sure its feasible for you to leave it running? If you want to try it then add the following to your config: caches:
track_memory_usage: true
# .. rest of the caches config You'll need to install the This will add the following Prometheus metric |
Is this still an issue/has this been resolved? |
We are still seeing this. We upgraded to |
@byanes what version are you on now though? The current version is 1.60.0. |
@aaronraimist We haven't upgraded off of 1.54.0 yet. Is there anything specific that you think might have fixed this? We can try upgrading later today. If it helps, below is what we see day-to-day. We don't have very many active users, but they are generally active between 7am and 4pm ET every weekday. If we don't bounce the container, it will consume all 8G of memory in 7-8 days. |
@tomsisk I'm not the developer so I don't know if specific changes fixed it but saying "We are still seeing this" while behind 6 versions behind (technically 11 counting patch releases) doesn't seem like a very helpful comment. I would recommend you at least see if the current version improves it and if not then enable #12160 (comment) and provide more information so they can fix it. (I'm assuming you operating the same server as @byanes?) Just glancing at the changelog there are several changes that specifically call out reducing memory usage. |
Our last upgrade unleashed a pretty large memory leak so we're not exactly excited to live on the bleeding edge when there isn't a lot of time to commit to this at the moment. You have a point that we haven't done anything, but I think @byanes comment was just so this didn't get blindly closed out as fixed when the actual issue was never diagnosed in the first place. We'll try upgrading and see where that gets us. |
Looks like this is still an issue in 0.60.0. I'll try to turn on |
After digging deeper, it turns out there's a bug in how Synapse counts pushers for the metric. Whenever the Element mobile apps are launched, they update their pushers on the server using Thanks for bringing the bug to the attention of the Synapse team. It's being tracked as #13295. |
The fixes for #13282 and #13286 will land in the 1.64.0 Docker image (rc0 on 2022-07-26 and release on 2022-08-02). |
The 1.64.0 Docker image has now been released, which includes fixes for the two @tomsisk @felixdoerre Could you upgrade if possible and report back in a few days on whether things have improved? |
We will upgrade later today and let you know. |
It's too early to tell for us whether this is fixed, mainly because 1.64.0 introduced something new: when presence is enabled, we are seeing mass memory consumption in a very short period. FWIW, we upgraded from 1.60.0. The first increase this morning was when users started to come online. The highlighted area is when we disabled presence. I re-enabled just to ensure this was the issue and it consumed over 5G in 10 minutes. I want to note also that we're using a pretty old version of the JavaScript Client SDK. I don't know if that could be contributing to the extremes we're seeing, but it definitely feels like there was a regression in a recent version. |
@tomsisk Oh that's annoying. We did change a few things with presence in 1.64.0 but I've not seen it blow up like that. Can you share the cache size metrics? And is there anything consuming a large chunk of CPU (e.g. requests, background processes, per-block metrics, etc)? |
@erikjohnston Sorry for the delay, was trying to get more data and also figure out if this is just something weird we're doing specifically that's causing our problems (still ongoing and very possible). It looks like memory consumption is better in 1.64.0, though it doesn't appear to ever release any memory. It leveled out around 25% usage, but is still increasing: That said, we still can't turn presence on without Synapse eventually just dying. It's difficult to get data on this because it will usually be fine until weekday mornings when users come online. It looks like Let me know if there's anything else I can give to assist. |
I just did an Backup Restore (because the instance isnt in real production). And Version 1.63.1 seems to work for me:
Memory stays on about 1GB and the instance is running just fine. The Mentiont Package (frozendict) is on Version 2.3.0
If there is anything i can help to solve this issue, please let me know! |
@jnthn-b Ugh. Do you have presence enabled? If so can you see if you can reproduce it with presence disabled? |
@erikjohnston I will check that later today (hopefully) :) |
@erikjohnston Presence is enabled, i just upgraded the instance to 1.66.0. The Bug seems to be solved for me. I see no more Memory Leaks.
|
It appears that the affected cases here have been resolved. If people notice similar issues with the latest version of Synapse (presence enabled or no), please file a new issue. Thanks! |
From the scrollback it doesn't seem like things were resolved for @tomsisk's deployment(s) and we are seeing presence-related memory problems e.g. in #13901. I notice that there is a new release of frozendict available which claims to fix more memory leaks. @tomsisk: would you be willing to build and test a docker image which includes that release if I make the changes on a branch? |
@DMRobertson Sure |
Thanks, let's take this to #13955. |
Description
Since upgrade tomatrix-synapse-py3==1.53.0+focal1 from 1.49.2+bionic1 i observe memory leak on my instance.
The upgrade is concomitant to OS upgrade from Ubuntu bionic => focal / Python 3.6 to 3.8
We didn't change homeserver.yaml during upgrade
Our machine had 3 GB memory for 2 years and now 10G isn't enough.
Steps to reproduce
root@srv-matrix1:~# systemctl status matrix-synapse.service
● matrix-synapse.service - Synapse Matrix homeserver
Loaded: loaded (/lib/systemd/system/matrix-synapse.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2022-03-04 10:21:05 CET; 4h 45min ago
Process: 171067 ExecStartPre=/opt/venvs/matrix-synapse/bin/python -m synapse.app.homeserver --config-path=/etc/matrix-synapse/homese>
Main PID: 171075 (python)
Tasks: 30 (limit: 11811)
Memory: 6.1G
CGroup: /system.slice/matrix-synapse.service
└─171075 /opt/venvs/matrix-synapse/bin/python -m synapse.app.homeserver --config-path=/etc/matrix-synapse/homeserver.yaml ->
I tried to change this config without success
expiry_time: 30m
syslogs says oom killer killed synapse:
Mar 4 10:20:54 XXXX kernel: [174841.111273] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/matrix-synapse.service,task=python,pid=143210,uid=114
Mar 4 10:20:54 srv-matrix1 kernel: [174841.111339] Out of memory: Killed process 143210 (python) total-vm:12564520kB, anon-rss:9073668kB, file-rss:0kB, shmem-rss:0kB, UID:114 pgtables:21244kB oom_score_adj:0
no further usefull information in homeserver.log
Version information
$ curl http://localhost:8008/_synapse/admin/v1/server_version
{"server_version":"1.53.0","python_version":"3.8.10"}
Version: 1.53.0
Install method:
Ubuntu apt repo
Platform:
VMWare
I could be happy to help getting python stacktrack to debug this, if I have any lead how to do so.
(sorry for my english)
The text was updated successfully, but these errors were encountered: