Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Updated Synapse versions should fail to start if required background updates have not run yet. #16047

Open
bradtgmurray opened this issue Aug 1, 2023 · 5 comments · May be fixed by #16397
Open
Labels
O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@bradtgmurray
Copy link
Contributor

bradtgmurray commented Aug 1, 2023

We (Beeper) just had a fun adventure upgrading to 1.88.

Looks like back in 1.74 a background update was added populate_user_directory_process_users and unbeknownst to us we've been grinding away at that background update ever since we upgraded back in January, mostly because we have 28 million users in our users table thanks to all the appservice users, as well as https://github.com/matrix-org/synapse/pull/15435/files#r1281053946 causing us to miss an index and only being able to process a single user every few seconds. This also means that any more recently added background updates haven't run, most importantly, profiles_full_user_id_key_idx and room_membership_user_room_index. We had updated to 1.85 in mid June, but the background update has still been chugging along and we were still missing the indexes. When we upgraded to 1.88 earlier today it ended up blowing up our DB since queries like SELECT displayname FROM profiles WHERE full_user_id = '@brad:beeper.com' would just scan the 28 million row table. We resolved the issue by just adding the index by hand which only took a couple minutes to complete, after which our Synapse instance was usable again.

At the time of the update, our background_updates table looked like this:
image

After some discussion in #synapse-dev:matrix.org, it's clear that having these indexes be added by a background update is a good thing, as they can be added without a long blocking migration and done in a release before they're required. However, this leaves open a gap where if for whatever reason the previously assumed-to-be-there background updates don't complete before the update happens that depends on them you end up with a extremely poorly performing or worse broken Synapse installation.

In an ideal world, Synapse at startup could check that the background updates that it needs were completed and explode gracefully with a nice error message. Ideally ideally, this would happen prior to migrations being applied, as to make it easier to rollback the upgrade. This would require completed background updates remaining in the table with something like a completed_at column, but that's probably nice for auditability after the fact.

@clokep
Copy link
Member

clokep commented Aug 2, 2023

I wonder if running the schema update as part of the start-up is a bad idea. We can run them separately via the update_synapse_database, perhaps that should be the only way to update and Synapse should refuse to start until that's been separately run?

It somewhat doesn't solve the issue with background updates, although we might not have to use them as often if we know we won't be blocking the main process from starting.

I'm not sure how this would impact our forwards-/backwards-compatibility, but I don't think it would be that different. (Note that weirdly any background updates added would not run properly until the main process was restarted.)

@bradtgmurray
Copy link
Contributor Author

Interesting thought, maybe another way to think about it is to keep both paths available, but drop the requirement that migrations need to be fast on huge databases and strongly encourage large installs to run the migrations offline.

Is your database huge? You should probably run the script or you'll get slow migrations.
Is your database reasonably sized? You're fine to keep you installation simple and just let it run as part of the reboot, they should be quick anyway.

@erikjohnston erikjohnston added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Aug 14, 2023
@erikjohnston
Copy link
Member

I think its probably worth noting that there are two classes of background updates:

  1. Stuff that doesn't matter too much if they're not completed for ages (like rebuilding user directory); and
  2. Stuff you probably want to happen in a timely fashion, like building indexes.

Right now, I think we basically consider everything in the first category

@clokep
Copy link
Member

clokep commented Sep 8, 2023

Stuff that doesn't matter too much if they're not completed for ages (like rebuilding user directory); and

I wonder if these should now be done via the task scheduler instead of background updates?

@clokep clokep added O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. labels Sep 8, 2023
@clokep clokep linked a pull request Sep 28, 2023 that will close this issue
3 tasks
@clokep
Copy link
Member

clokep commented Sep 28, 2023

I put up a simple solution for this at #16397. I'd be curious if this would have fixed Beeper's issue @bradtgmurray.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants