-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Room retention *increases* size of state tables. #9406
Comments
Hello, very possibly the report came from a discussion I initiated on some Matrix admin related rooms, since we were facing such an issue and trying to get to the bottom of it before opening a thread on GitHub. We are however unable to pinpoint the exact source of defect, so here are as many details as I can collect, seeking your help. We are running a dedicated HS for bridges. Thus most rooms are initiated by appservices, have many (1-20k) local users from appservices and a handful remote users from (multiple) consuming HS (1-20). We recently decided to enable a retention policy on the server, deleting history, including local history, older than 6 months. Our rationale was guided by our ToS clearly stating we are not entitled to storing bridged or cached data for more than 6 months, plus a need to purge old events and associated media, which have stacked TB of data from bridging high activity Telegram rooms for instance. We are aware this might interfere with history purge on other homeservers: a seemingly non destructive operation of purging remote history elsewhere might become destructive because when paginating, the remote HS will get empty history from us. We are okay with this from a user experience perspective. The failureWe woke up to a full drive, which filled overnight, so we were unable to respond in time and postgres was now read-only with no room for vacuuming. Our bridges database had grown to over > 400 GB. Culprit was clearly state management, with The factsAfter failing to identify anything obvious in the logs or data, not finding any open issue on the matter, we pulled the database to a high performance machine for analytics, we gathered metrics from rooms and room state in hope of uncovering a pattern. Room count was stable.
Storing state group entries was involved indeed, here the total row count. No precise metric about previous value, but we believe it was around 50M, so this is a 10-times increase overnight.
After creating a couple views for investigating, we checked for the frequent issue of orphaned state groups:
They do indeed make up for half the rooms represented in state groups, yet only 5% of the rows, so we discarded it as a culprit and decided to come back and clean this later. We sorted rooms by number of state group entries (not state groups). We frequently monitor those when trying to slim down our database, so we were able to identify that top 10 rooms from the previous week were now barely making top 50. Top 3 rooms were showing 50M, 50M and 40M state group entries respectively, so we decided to focus on top 1 and see if any pattern applied to the others. This was no new room, and latest event was a week old already:
Most events were from initially joining the room, then not much happened in terms as state, and presumably if messages were sent, they were sent prior to retention period and now deleted. None of the actual users in the room were from our own HS so we were unable to properly confirm that last hypothesis.
Also close to every state event is a invite+join, which backs our understanding:
The aha moment was realizing that 20k invites+joins is 10k joins, thus 10k members if no one leaves, while the sum of 10k first natural integers is very close to 50M (rounding n²/2-n/2 to n² is a reasonable approximate). Supposedly there were now 1 state group per join event (confirmed), and each state groups had 1 entry per member, amounting to a terrible 50M rows. On a bridges server, many rooms show this pattern of state events mostly comprised of invite+join and very little people leaving, either massively when initially bridging or over time. We tried to spot the pattern in other rooms. Here we naively count events and state group entries. If the room is mostly comprised of join and invite events, with no or very little history, and our hypothesis holds, then these room will have in the order of n²/2 state group entries, with n half the number of events (half of 1 invite + 1 join), so count(events)²/8. This is a terrible approximate merely designed to check if the pattern held at all in top rooms:
We were not expecting every room exhibit the pattern (ratio close to 1), but we were not expecting so many either. This was our smoking gun. Somehow, somewhere, something was causing state groups to expand to a full decompressed form. Our interpretationDown the rabbit hole we went. Synapse code very rarely creates state group entries. In fact it does in two different places: when a state event is stored, trying to de-duplicate reasonably by creating state group edges between state groups (the very structure that state compressors try to optimize further), and right here when purging unreferenced state groups:
And purging unreferenced state groups is exactly what purging events does after computing a list of said groups:
It is in fact the only place where state groups are deleted and recreated except for normal event processing. And it is the very routine called by retention jobs. We think that in some rooms at least, depending on factors we were not able to pinpoint, after purging history, the purge routine determines that some events are unreferenced and need purging, the same events that are holding the state group hierarchy, leading to the entire hierarchy being de-deduplicated, and enormous amounts of state group entries being created. Our current statusThis is as much as we can go from our understanding of synapse internals. We did not have enough time yet to try and reproduce this on a small HS with small rooms. We gambled on the hypothesis that this would happen when retention jobs first cleans the room history, and we should be safe compressing the state table and be on our way. It did compress very well to about 5% of its initial size by manually running the state compressor with default settings on every room. If you remember about orphaned state groups from initial findings, these amounted to pretty much anything left and our It did fine for a couple days. Until retention jobs kicked in, and everything is now de-deduplicated again. We are thinking about running that database on a separate host until we find a more stable fix and are not trying to compress that table anymore. We do need help to go further, even pointers to places in the codebase we would have missed or ideas for reproducing this in a test case. |
Oh, the original report is pretty old already, no idea why I did not find it in the first place, but I surely was not the original reporter since we experienced this just weeks ago. |
Sorry about spamming this issue. After a first pessimistic report, it appears that rooms that were "de-deduplicated" this time are different from the first time. My take on this is the retention job did not finish last time, since the database crashed before it could. We were lucky that remaining rooms did not nuke our server once more. Symptoms are similar and the same pattern emerges, this time with even more entries: one room show 180M rows in I believe we are now out of most trouble, yet I will keep a copy of the database in case I can help pinpointing the root cause. |
A couple months later, this happened again on our HS, we were lucky to avoid a major downtime, so I went back down the rabbit hole of the history purging code in synapse. Here is a summary of what I found, and how I suggest we fix some of it. The process of locating state groups to be deleted is complex and looks flawedFirst, history purging happens at :
This is supposed to purge old events and their relationships, then return every state group to be considered first for deletion. It does this by returning every state group related to any event, including state events, that matches purge criteria, except for outliers. It also ends by removing the reference to state groups from state events matching the criteria and making them outliers. This means those same state events will never be considered in later purges (which explicitly exclude outliers). I am unsure this is an actual issue, since associated state groups should be deleted, and we might not care about these later, but I would feel more confident if we included more state groups as candidates than leaving them potentially forever. Maybe this is also related to accumulating unreferenced state groups. Later on, this initial list is expanded and filtered here:
Which basically climbs up the state group tree and stops when finding state groups that are still referenced and thus should not be deleted. This is fine, except it climbs up the tree using
From a quick overview, this does not return previous state groups but filters the state group list, keeping those that have children. This in turn breaks the code from Here is a reference to the separate issue about unreferenced state groups if my analysis can help there: matrix-org/synapse#12821 I suggest:
State groups are stored indefinitely for rejected client eventsWhen creating client events, synapse stores them very early in an For instance, a client event for joining an invite-only room will generate a state group, even if it is rejected. This leads to unreferenced entries in the state group tree, which also reference unpersisted events. I am unsure how to proceed about this. On paper, state groups are merely a structured cache, so it should do no harm except for taking up space that is never reclaimed, while changing this behavior sounds like a colossal task. Application services weaponize thisAt least, Mautrix does. For every bridged user, instead of checking the room permissions, Mautrix simply tries to join, then upon failure invites the user and joins again. It works fine. Except synapse stores an unreferenced state group in the tree for every bridged user, due to the denied initial join. A bridged room state group tree is pretty straightforward, except for every user being mentioned three times: one invite, one join, and one unreferenced branch. Due to being unreferenced, it is not considered for deletion when purging the room. However, when removing the main branch, all those state groups are de-delta-ed.
So after a history purge, a bridged room has one unreferenced de-delta-ed state group for every room member. This grows quadratically as exposed above, and large rooms may create dozens or hundreds of millions of state group lines in a couple of minutes when purging history due to this, effectively taking the database down. I suggest:
|
Reporting back on this issue, this kept happening to us and spamming our production database, so we disabled any purging logics from our main homeserver during the summer, until we took the time to devise a reasonable fix. In order to fix the issue, I actually tried and went the easy way: we are facing an issue with cached state groups that are unreferenced by any event (because none is persisted), and which do reference a previous state group, leading to them being de-delta-ed when the parent state group is deleted. I figured this only happens in the room purging code, so fixing the issue right there, and deleting these obviously state state groups not at the source, but before they cause any damage was sound. It prevents the issue, provides a decent mechanism to regularly spot these nasty groups and remove some rotting data without scanning the entire database. Most of all it does not require I play around with the event persistence code, which looked daunting. We will be testing this on our homeserver for the next couple of days, then I'll report back again and provide a PR. Until then, you'll find our work at: https://forge.tedomum.net/tedomum/synapse/-/compare/b066b3aa04248260bd0cfb2ac3569a6e4a3236ea...fix-exploding-state-groups?from_project_id=97&straight=false |
Reporting back one more time after a week of running the "fix". It seems my understanding was incomplete, or my code has issues, because the bug manifested again last night and took down our homeserver for a couple hours after filling the disk with state group lines. I am reviewing the purging code again, and it looks like it is still missing some comments, I am unsure about some of the logic. I am however building an intuition that cleaning state groups just isn't worth it.
This is however a drastic, no-return decision, and I am unable to weigh things properly. @erikjohnston sorry for mentioning you, from my lurking on synapse dev channels and Github issues, it sounds like you are the most knowledgeable on the matter. Do you think we should keep fixing the purging code and properly delete state groups while avoiding edge cases that potentially insert millions of new lines? or should we just simplify things and stop removing state groups when we purge history? Either way, I will keep working on this the next couple of weeks, and will provide a PR as soon as I have tested a working solution. |
Hey, this is actually something we've been meaning to deal with lately, though haven't been able to carve out the time just yet. Let me brain dump my working understanding of the issue. There are some quite complex interactions between deleting history, state groups, and event persistence. The major blocking issue with fixing the current code is that there is actually a nasty race when figuring out whether a state group is unreferenced or not: when we persist new state groups they start off as unreferenced and then we persist events that reference them. Changing the deleting state group logic to also delete unreferenced state groups runs a much greater risk of accidentally deleting state groups that are "in-flight" (i.e. currently unreferenced but are about to be). I have some thoughts on how we could delete state groups safely, but its non-trivial. Once we can safely delete state groups, then I think checking if a state group that is about to be de-deltaed is actually referenced should fix this issue. One thing that we have seen repeatedly is that using https://github.com/erikjohnston/synapse-find-unreferenced-state-groups and then deleting the matching state groups reclaims a lot of disk space. (I wouldn't necessarily recommend doing so though)
It very much depends on the size and complexity of the room. The state compressor helps a lot, but state can still be one of the major uses of disk space. Certainly we have found that deleting state groups in large rooms does lead to significant disk reduction. The reason for this is that while the state events themselves aren't deleted, we can delete the mapping of event ID to state at that event (which is what state groups actually are). |
This is basically the same as matrix-org/synapse#3364 (comment). |
Thanks to both of you for the replies.
This is indeed what's preventing from properly deleting these stale state groups, because we can never be certain they are in fact unreferenced.
We are indeed running this very specifically for the worst rooms as a last resort.
I have been thinking about this a lot lately, and I want to suggest another way of fixing things. From my understanding, the issue is twofold: 1. there are unreferenced state groups in the database, which we can never properly delete, and which sometimes make up such a structure that they get massively de-delta-ed (like here, when an isolated unreferenced state group is generated for every new user joining a Mautrix bridge room) 2. the purge code, and pretty much any code trying to maintain and delete state groups is very complex, making things hard to reason about and patch (the naive patch I posted above took me hours to devise and does not work, which cost me some more hours to understand why). We definitely need to fix the race between persisting state groups and the events that generated them, possibly by changing the core logics of event persistence, or using database transactions for instance. Once this (admittedly huge work) is done, nothing stops us from switching to a GC for deleting state groups. Switching to garabage collecting would allow us to keep the state group deletion code simpler (just delete unreferenced state groups instead of extrapolating from the list of purged events for instance) and closer to the rest of state group related code. We could remove anything related to state groups from the room purging code, and maybe explicitely trigger the GC after a room or room history is purged. The GC could even embed some of the state compression logic. In my mental model, state groups are a persistent cache of state history. I think that managing a cache data structure with a GC does make sense. Would you agree with my analysis? and in that case, would you agree we work on the race first, then switch to a GC model? If so, I will focus my effort on helping with the race first, making sure we only persist referenced state groups. |
This issue has been migrated from #9406.
We have reports that enabling room retention in a very large room (with frequent joins and leaves) actually causes the size of the state tables to increase, which is surprising.
Note that it appears that postgres is vacuuming the tables.
Also note that this is an experimental feature.
The text was updated successfully, but these errors were encountered: