Synapse hangs for multiple seconds every couple of minutes #12778

Erethon · 2022-05-18T08:02:35Z

Description

Hello, I'm running into a weird issue with Synapse that I'm failing to debug.

Since two days ago (I was running 1.58.1 back then) my Synapse homeserver seemingly completely hangs for some seconds (30-60) every 2-3 minutes. Upgrading to 1.59.0 didn't fix the issue.

The symptom I'm seeing is logs stop processing at all and synapse is unresponsive to http requests. Here is a snippet of the logs when this happens. Synapse CPU usage hovers close to 100% since this started, but I don't see anything else weird in the metrics (i.e. GCs taking so long). In any case, here's a grafana dashboard.

The one thing I notice whenever this happens, is that the logs have an entry for LRU caches, like 2022-05-18 00:53:47,150 - synapse.util.caches.lrucache - 154 - INFO - LruCache._expire_old_entries-38- Dropped 0 items from caches or 2022-05-18 00:59:47,603 - synapse.util.caches.lrucache - 154 - INFO - LruCache._expire_old_entries-1- Dropped 0 items from caches. Taking a py-spy record when this happens I can see that a lot of time is spend on cache_set. Attaching a py-spy as well.

Any ideas how I further debug this or if this is a known issue?

Version information

Homeserver:
erethon.com, synapse is hosted under matrix.erethon.com

Version:
1.58.1 initially, upgraded to 1.59.0 yesterday. Single process with no workers.
Install method:
Debian packages
Platform:
Debian 10, in a VM in a dedicated host I control. The Postgres database is on the same physical host, just a different VM.

The text was updated successfully, but these errors were encountered:

Erethon · 2022-05-18T08:42:41Z

After finding out about #12730 and #12547, I can confirm that there are no bots in my HS (registration is closed/with token only and I double-checked the database to make sure). I can also confirm that the largest room in my HS is #python:matrix.org | !iuyQXswfjgxQMZGrfQ:matrix.org with 13202487 state_groups, so it doesn't look as bad as the numbers reported in those tickets.

prurigro · 2022-05-18T16:01:52Z

This is a bit disconcerting- I'd hoped the bot issue would explain this issue completely... Just thinking out loud here, but is your server old enough to have history it's self deleting (based on max_lifetime and allowed_lifetime_max in your homeserver.yaml)?

The reason I ask is because my server had reached that point, and I'd noticed previously that a script that would delete unused rooms would cause timeouts.

babolivier · 2022-05-19T10:31:16Z

@Erethon It looks to me like you have a custom value for event_cache_size in your Synapse configuration file (set to "5K" if I'm reading the graphs correctly, as opposed to the default value of "10K"). Looking at the graph, I think what's happening is that Synapse is trying to pull events to perform state resolution but is struggling to keep them in cache. Can you try increasing this value and see if it helps?

Erethon · 2022-05-20T01:15:12Z

This is a bit disconcerting- I'd hoped the bot issue would explain this issue completely... Just thinking out loud here, but is your server old enough to have history it's self deleting (based on max_lifetime and allowed_lifetime_max in your homeserver.yaml)?

The reason I ask is because my server had reached that point, and I'd noticed previously that a script that would delete unused rooms would cause timeouts.

The HS was spun up in early 2019, so it is kinda old, but no automatic deleting is taking place. I have however a daily cron job running that purges room history using the admin API for specific (federated) rooms. I've read #11521 and #10792 which looks kinda related to purging and the host hanging, but I don't think it's related that much.

Erethon · 2022-05-20T03:09:47Z

@Erethon It looks to me like you have a custom value for event_cache_size in your Synapse configuration file (set to "5K" if I'm reading the graphs correctly, as opposed to the default value of "10K"). Looking at the graph, I think what's happening is that Synapse is trying to pull events to perform state resolution but is struggling to keep them in cache. Can you try increasing this value and see if it helps?

I haven't changed event_cache_size in my Synapse config, in fact this is the the only config about caches I have:

caches:
  per_cache_factors:
      get_users_in_room: 4.0
      _get_joined_profile_from_event_id: 8.0

I think I've run into either a documentation bug or Synapse not following what the docs say. Synapse config states that:

# The number of events to cache in memory. Not affected by
# caches.global_factor.
#event_cache_size: 10K

However, since I was missing a global_factor the default of 0.5 was applied that halved all the caches including event_cache_size/getEvents. I verified this by setting global_factor to 2.0 and event_cache_size to 20000 and getting back synapse_util_caches_cache_max_size{name="*getEvent*"} 40000.0 via the metrics endpoints.

I believe this was also caught by @MurzNN in #8811 in this comment as a PS. To me this seems like either a bug on the document or the Synapse not following the documentation [1]. Shall I create a new bug report for this?

To get back to the original report, I've now set global_factor to 1.0 and getEvent is at 10k, so I'll report back if this happens again.

[1]: Or me having misunderstood and getEvent not being the same as event_cache_size.

babolivier · 2022-05-23T15:48:14Z

However, since I was missing a global_factor the default of 0.5 was applied that halved all the caches including event_cache_size/getEvents

Ahah, that explains it. Bit confusing default but 🤷

To me this seems like either a bug on the document or the Synapse not following the documentation [1]. Shall I create a new bug report for this?

Yes please.

[1]: Or me having misunderstood and getEvent not being the same as event_cache_size.

event_cache_size is the setting that controls the size of the getEvent cache, yes.

To get back to the original report, I've now set global_factor to 1.0 and getEvent is at 10k, so I'll report back if this happens again.

Sweet, in this case I'll close this issue but feel free to shout if it happens again.

Fizzadar mentioned this issue May 18, 2022

CPU increase on main process after 1.57.1 -> 1.58.1 upgrade (fetch_events background task) #12788

Closed

babolivier added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label May 19, 2022

lnicola mentioned this issue May 19, 2022

Poor performance because of two large rooms with spam bots #12730

Closed

babolivier added the X-Needs-Info This issue is blocked awaiting information from the reporter label May 19, 2022

babolivier closed this as completed May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synapse hangs for multiple seconds every couple of minutes #12778

Synapse hangs for multiple seconds every couple of minutes #12778

Erethon commented May 18, 2022 •

edited

Loading

Erethon commented May 18, 2022

prurigro commented May 18, 2022 •

edited

Loading

babolivier commented May 19, 2022

Erethon commented May 20, 2022

Erethon commented May 20, 2022

babolivier commented May 23, 2022

Synapse hangs for multiple seconds every couple of minutes #12778

Synapse hangs for multiple seconds every couple of minutes #12778

Comments

Erethon commented May 18, 2022 • edited Loading

Description

Version information

Erethon commented May 18, 2022

prurigro commented May 18, 2022 • edited Loading

babolivier commented May 19, 2022

Erethon commented May 20, 2022

Erethon commented May 20, 2022

babolivier commented May 23, 2022

Erethon commented May 18, 2022 •

edited

Loading

prurigro commented May 18, 2022 •

edited

Loading