-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EPIC: Handling big migrations for self-hosted users #7054
Comments
Had a chat with @guidoiaquinti about this. We concluded that there are multiple types of "big migrations" that are not necessarily data related. Think upgrading Postgres a major version for example. Thus, to make meaningful progress, my initial focus here will be to tackle data migrations specifically. Things like moving data from a table to another (e.g. events migration), or syncing a table from Postgres to CH. Some of the other types also maybe shouldn't even handle automatically, as they can be very bespoke. The best approach to some of them (e.g. version upgrades) might just be docs. |
Regarding to what we learned about ClickHouse changing tables do we really need to think about week long migrations or what length of migrations we care about now?
Depending on the answer above (if we care about a couple of hours migrations only) an alternative to consider, do :
I propose we always make the range minimal if it requires multiple code paths the more different paths we have in play the harder it will be to fix bugs etc. So I propose we limit it to 1 release only exactly (you must be on 1.33 to run the migration & it's required for 1.34). People can run multiple updates and that shouldn't be a problem.
Why UI? I think cli makes more sense as one needs to manage their self-hosted instal in the cli anyway (for now at least when we don't have parameters & upgrades in DO & we're not even on gcp marketplace). |
We do care about migrations that can last weeks. Events for example is a PK change, which requires a SELECT * and insert into the new table. That took us a week. |
Think about a week long migration. It would be nice to get updates on its progress, status, and be able to easily stop, rollback, debug if necessary. We don't need to start with this, but it does make management easier given these things would be running "in the background". |
Work has started in #7364, #7365, and #7425. I've now decided to split this across multiple PRs and will use this issue to keep track of things:
|
Ideas/notes:
|
* add special migration definition and example * types * special migrations runner * fix tests * fix tests 2 * add clickhouse runner * add temp fix for tests * wip * add special migrations api (#7448) * wip new structure * update example sourcing * Update .gitignore * yet another wip structure * code quality * cypress * test docker image build * implement resumable ops * code quality * add comments * add warning * add conditional requirements for migration * add comment on is_required * add dependency map * wip dependencies and run migration on startup * code quality * fix bugs * fix more bugs * format * types * remove api from this branch * types * types * update clickhouse script * add is_migration_in_range util * fix type * fix runner * add AUTO_START_SPECIAL_MIGRATIONS env var * reset migration on start * cleanup * wip per op rollback * prevent accidental status rollback * add utils and definition test * update example with rollback per op * wip test special migration * add first runner tests * add runner tests * add util for code paths * fix test * fix types * fix types again * cleanup * cleanup * add periodic healthcheck task tests * remove unused imports * safer row updates * fix coalescing none checks * code quality * add docstrings * fix * fix deploys issue * update scripts * add delay * address reviews * address review comments * address review comments * address final comments * fix import error * fix tests * remove unused imports * fix tests * fix task test * remove unused return value * remove unused special migrations code from migrate_clickhouse * tweaks to support fresh deployments
* add special migration definition and example * types * special migrations runner * fix tests * fix tests 2 * add clickhouse runner * add temp fix for tests * wip * add special migrations api (#7448) * wip new structure * update example sourcing * Update .gitignore * yet another wip structure * code quality * cypress * test docker image build * implement resumable ops * code quality * add comments * add warning * add conditional requirements for migration * add comment on is_required * add dependency map * wip dependencies and run migration on startup * code quality * fix bugs * fix more bugs * format * types * remove api from this branch * types * types * update clickhouse script * add is_migration_in_range util * fix type * Add special migrations API * fix api * update api with new columns * fix runner * add AUTO_START_SPECIAL_MIGRATIONS env var * reset migration on start * Special migrations UI (#7054 pt. 6) (#7493) * update UI with new cols * fix UI * new UI statuses * cleanup * wip per op rollback * add refresh button * wip tests * prevent accidental status rollback * finish api tests * Update bin/tests * add utils and definition test * update example with rollback per op * wip test special migration * add first runner tests * add runner tests * add util for code paths * fix test * fix types * fix types again * cleanup * cleanup * add periodic healthcheck task tests * remove unused imports * safer row updates * fix coalescing none checks * code quality * add handling for non-staff users * add docstrings * fix * fix deploys issue * update scripts * add delay * address reviews * address review comments * address review comments * address final comments * fix import error * fix tests * remove unused imports * fix tests * fix task test * remove unused return value * remove unused special migrations code from migrate_clickhouse * tweaks to support fresh deployments * make instance first user staff
* add special migration definition and example * types * special migrations runner * fix tests * fix tests 2 * add clickhouse runner * add temp fix for tests * wip * add special migrations api (#7448) * wip new structure * update example sourcing * Update .gitignore * yet another wip structure * code quality * cypress * test docker image build * implement resumable ops * code quality * add comments * add warning * add conditional requirements for migration * add comment on is_required * add dependency map * wip dependencies and run migration on startup * code quality * fix bugs * fix more bugs * format * types * remove api from this branch * types * types * update clickhouse script * add is_migration_in_range util * fix type * Add special migrations API * fix api * update api with new columns * fix runner * add AUTO_START_SPECIAL_MIGRATIONS env var * reset migration on start * Special migrations UI (#7054 pt. 6) (#7493) * update UI with new cols * fix UI * new UI statuses * cleanup * wip per op rollback * add refresh button * wip tests * prevent accidental status rollback * finish api tests * Update bin/tests * add utils and definition test * update example with rollback per op * wip test special migration * add first runner tests * add runner tests * add util for code paths * fix test * fix types * fix types again * cleanup * cleanup * add periodic healthcheck task tests * remove unused imports * safer row updates * fix coalescing none checks * code quality * add handling for non-staff users * Add special migrations to instance status * add docstrings * fix * fix deploys issue * update scripts * add delay * address reviews * address review comments * address review comments * address final comments * fix import error * fix tests * remove unused imports * fix tests * fix task test * remove unused return value * remove unused special migrations code from migrate_clickhouse * tweaks to support fresh deployments * make instance first user staff * fix import
* add special migration definition and example * types * special migrations runner * fix tests * fix tests 2 * add clickhouse runner * add temp fix for tests * wip * add special migrations api (#7448) * wip new structure * update example sourcing * Update .gitignore * yet another wip structure * code quality * cypress * test docker image build * implement resumable ops * code quality * add comments * add warning * add conditional requirements for migration * add comment on is_required * add dependency map * wip dependencies and run migration on startup * code quality * fix bugs * fix more bugs * format * types * remove api from this branch * types * types * update clickhouse script * add is_migration_in_range util * fix type * fix runner * add AUTO_START_SPECIAL_MIGRATIONS env var * reset migration on start * cleanup * wip per op rollback * prevent accidental status rollback * add utils and definition test * update example with rollback per op * wip test special migration * add first runner tests * add runner tests * add util for code paths * fix test * fix types * fix types again * cleanup * cleanup * add periodic healthcheck task tests * remove unused imports * safer row updates * fix coalescing none checks * code quality * add docstrings * fix * fix deploys issue * update scripts * add delay * address reviews * address review comments * address review comments * address final comments * fix import error * fix tests * remove unused imports * fix tests * fix task test * remove unused return value * remove unused special migrations code from migrate_clickhouse * tweaks to support fresh deployments
* add special migration definition and example * types * special migrations runner * fix tests * fix tests 2 * add clickhouse runner * add temp fix for tests * wip * add special migrations api (#7448) * wip new structure * update example sourcing * Update .gitignore * yet another wip structure * code quality * cypress * test docker image build * implement resumable ops * code quality * add comments * add warning * add conditional requirements for migration * add comment on is_required * add dependency map * wip dependencies and run migration on startup * code quality * fix bugs * fix more bugs * format * types * remove api from this branch * types * types * update clickhouse script * add is_migration_in_range util * fix type * Add special migrations API * fix api * update api with new columns * fix runner * add AUTO_START_SPECIAL_MIGRATIONS env var * reset migration on start * Special migrations UI (#7054 pt. 6) (#7493) * update UI with new cols * fix UI * new UI statuses * cleanup * wip per op rollback * add refresh button * wip tests * prevent accidental status rollback * finish api tests * Update bin/tests * add utils and definition test * update example with rollback per op * wip test special migration * add first runner tests * add runner tests * add util for code paths * fix test * fix types * fix types again * cleanup * cleanup * add periodic healthcheck task tests * remove unused imports * safer row updates * fix coalescing none checks * code quality * add handling for non-staff users * add docstrings * fix * fix deploys issue * update scripts * add delay * address reviews * address review comments * address review comments * address final comments * fix import error * fix tests * remove unused imports * fix tests * fix task test * remove unused return value * remove unused special migrations code from migrate_clickhouse * tweaks to support fresh deployments * make instance first user staff
* add special migration definition and example * types * special migrations runner * fix tests * fix tests 2 * add clickhouse runner * add temp fix for tests * wip * add special migrations api (#7448) * wip new structure * update example sourcing * Update .gitignore * yet another wip structure * code quality * cypress * test docker image build * implement resumable ops * code quality * add comments * add warning * add conditional requirements for migration * add comment on is_required * add dependency map * wip dependencies and run migration on startup * code quality * fix bugs * fix more bugs * format * types * remove api from this branch * types * types * update clickhouse script * add is_migration_in_range util * fix type * Add special migrations API * fix api * update api with new columns * fix runner * add AUTO_START_SPECIAL_MIGRATIONS env var * reset migration on start * Special migrations UI (#7054 pt. 6) (#7493) * update UI with new cols * fix UI * new UI statuses * cleanup * wip per op rollback * add refresh button * wip tests * prevent accidental status rollback * finish api tests * Update bin/tests * add utils and definition test * update example with rollback per op * wip test special migration * add first runner tests * add runner tests * add util for code paths * fix test * fix types * fix types again * cleanup * cleanup * add periodic healthcheck task tests * remove unused imports * safer row updates * fix coalescing none checks * code quality * add handling for non-staff users * Add special migrations to instance status * add docstrings * fix * fix deploys issue * update scripts * add delay * address reviews * address review comments * address review comments * address final comments * fix import error * fix tests * remove unused imports * fix tests * fix task test * remove unused return value * remove unused special migrations code from migrate_clickhouse * tweaks to support fresh deployments * make instance first user staff * fix import
More to-dos:
|
Discussed during the platform meeting on wednesday: Prior to releasing to everyone:
Thought about the status part a bit & proposal:
Btw while we're user testing we should probably default to turning off auto rollbacks initially |
@tiina303 both of your points are very specific to the events ingestion. I'd rather keep this thread to aspects relating to the system itself. |
Cloud-specific async migrations |
Run |
Manually update progress on refresh |
on complete hook, run is_required, rethink deps |
As an overall task, this is done. We can open smaller tickets for issues that come about. |
Problem
Currently, we have no good story for how self-hosted users should handle big migrations (e.g. changing the events table PK to unblock a CH upgrade requires creating and loading a new table, which took us a whole week in Cloud).
Big migrations are likely to timeout and leave users in inconsistent states. Plus, extending the timeout is not the way to go. We need to be prepared for migrations that might last e.g. weeks.
Proposed solution
We need to still spec this out, but the current line of thinking is to break out of the model of sequential parent-dependent migrations that we currently use and introduce "special migrations". Most migrations will remain the same, but bigger migrations will fall in this special category.
Some characteristics of these "special migrations" should be:
Ultimately we would want this management to happen in the UI, but an MVP might be a CLI tool/management command.
Further considerations
The text was updated successfully, but these errors were encountered: