Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Synapse hangs for multiple seconds every couple of minutes #12778

Closed
Erethon opened this issue May 18, 2022 · 6 comments
Closed

Synapse hangs for multiple seconds every couple of minutes #12778

Erethon opened this issue May 18, 2022 · 6 comments
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@Erethon
Copy link
Contributor

Erethon commented May 18, 2022

Description

Hello, I'm running into a weird issue with Synapse that I'm failing to debug.

Since two days ago (I was running 1.58.1 back then) my Synapse homeserver seemingly completely hangs for some seconds (30-60) every 2-3 minutes. Upgrading to 1.59.0 didn't fix the issue.

The symptom I'm seeing is logs stop processing at all and synapse is unresponsive to http requests. Here is a snippet of the logs when this happens. Synapse CPU usage hovers close to 100% since this started, but I don't see anything else weird in the metrics (i.e. GCs taking so long). In any case, here's a grafana dashboard.

The one thing I notice whenever this happens, is that the logs have an entry for LRU caches, like 2022-05-18 00:53:47,150 - synapse.util.caches.lrucache - 154 - INFO - LruCache._expire_old_entries-38- Dropped 0 items from caches or 2022-05-18 00:59:47,603 - synapse.util.caches.lrucache - 154 - INFO - LruCache._expire_old_entries-1- Dropped 0 items from caches. Taking a py-spy record when this happens I can see that a lot of time is spend on cache_set. Attaching a py-spy as well.
synapse4

Any ideas how I further debug this or if this is a known issue?

Version information

  • Homeserver:
    erethon.com, synapse is hosted under matrix.erethon.com
  • Version:
    1.58.1 initially, upgraded to 1.59.0 yesterday. Single process with no workers.

  • Install method:
    Debian packages

  • Platform:
    Debian 10, in a VM in a dedicated host I control. The Postgres database is on the same physical host, just a different VM.

@Erethon
Copy link
Contributor Author

Erethon commented May 18, 2022

After finding out about #12730 and #12547, I can confirm that there are no bots in my HS (registration is closed/with token only and I double-checked the database to make sure). I can also confirm that the largest room in my HS is #python:matrix.org | !iuyQXswfjgxQMZGrfQ:matrix.org with 13202487 state_groups, so it doesn't look as bad as the numbers reported in those tickets.

@prurigro
Copy link

prurigro commented May 18, 2022

This is a bit disconcerting- I'd hoped the bot issue would explain this issue completely... Just thinking out loud here, but is your server old enough to have history it's self deleting (based on max_lifetime and allowed_lifetime_max in your homeserver.yaml)?

The reason I ask is because my server had reached that point, and I'd noticed previously that a script that would delete unused rooms would cause timeouts.

@babolivier
Copy link
Contributor

@Erethon It looks to me like you have a custom value for event_cache_size in your Synapse configuration file (set to "5K" if I'm reading the graphs correctly, as opposed to the default value of "10K"). Looking at the graph, I think what's happening is that Synapse is trying to pull events to perform state resolution but is struggling to keep them in cache. Can you try increasing this value and see if it helps?

@babolivier babolivier added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label May 19, 2022
@babolivier babolivier added the X-Needs-Info This issue is blocked awaiting information from the reporter label May 19, 2022
@Erethon
Copy link
Contributor Author

Erethon commented May 20, 2022

This is a bit disconcerting- I'd hoped the bot issue would explain this issue completely... Just thinking out loud here, but is your server old enough to have history it's self deleting (based on max_lifetime and allowed_lifetime_max in your homeserver.yaml)?

The reason I ask is because my server had reached that point, and I'd noticed previously that a script that would delete unused rooms would cause timeouts.

The HS was spun up in early 2019, so it is kinda old, but no automatic deleting is taking place. I have however a daily cron job running that purges room history using the admin API for specific (federated) rooms. I've read #11521 and #10792 which looks kinda related to purging and the host hanging, but I don't think it's related that much.

@Erethon
Copy link
Contributor Author

Erethon commented May 20, 2022

@Erethon It looks to me like you have a custom value for event_cache_size in your Synapse configuration file (set to "5K" if I'm reading the graphs correctly, as opposed to the default value of "10K"). Looking at the graph, I think what's happening is that Synapse is trying to pull events to perform state resolution but is struggling to keep them in cache. Can you try increasing this value and see if it helps?

I haven't changed event_cache_size in my Synapse config, in fact this is the the only config about caches I have:

caches:
  per_cache_factors:
      get_users_in_room: 4.0
      _get_joined_profile_from_event_id: 8.0

I think I've run into either a documentation bug or Synapse not following what the docs say. Synapse config states that:

# The number of events to cache in memory. Not affected by
# caches.global_factor.
#event_cache_size: 10K

However, since I was missing a global_factor the default of 0.5 was applied that halved all the caches including event_cache_size/getEvents. I verified this by setting global_factor to 2.0 and event_cache_size to 20000 and getting back synapse_util_caches_cache_max_size{name="*getEvent*"} 40000.0 via the metrics endpoints.

I believe this was also caught by @MurzNN in #8811 in this comment as a PS. To me this seems like either a bug on the document or the Synapse not following the documentation [1]. Shall I create a new bug report for this?

To get back to the original report, I've now set global_factor to 1.0 and getEvent is at 10k, so I'll report back if this happens again.

[1]: Or me having misunderstood and getEvent not being the same as event_cache_size.

@babolivier
Copy link
Contributor

However, since I was missing a global_factor the default of 0.5 was applied that halved all the caches including event_cache_size/getEvents

Ahah, that explains it. Bit confusing default but 🤷

To me this seems like either a bug on the document or the Synapse not following the documentation [1]. Shall I create a new bug report for this?

Yes please.

[1]: Or me having misunderstood and getEvent not being the same as event_cache_size.

event_cache_size is the setting that controls the size of the getEvent cache, yes.

To get back to the original report, I've now set global_factor to 1.0 and getEvent is at 10k, so I'll report back if this happens again.

Sweet, in this case I'll close this issue but feel free to shout if it happens again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

No branches or pull requests

3 participants