Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: Handling big migrations for self-hosted users #7054

Closed
yakkomajuri opened this issue Nov 11, 2021 · 14 comments
Closed

EPIC: Handling big migrations for self-hosted users #7054

yakkomajuri opened this issue Nov 11, 2021 · 14 comments
Assignees
Labels
enhancement New feature or request epic P0 Critical, breaking issue (page crash, missing functionality) team/infra Everything related to deploying PostHog

Comments

@yakkomajuri
Copy link
Contributor

Problem

Currently, we have no good story for how self-hosted users should handle big migrations (e.g. changing the events table PK to unblock a CH upgrade requires creating and loading a new table, which took us a whole week in Cloud).

Big migrations are likely to timeout and leave users in inconsistent states. Plus, extending the timeout is not the way to go. We need to be prepared for migrations that might last e.g. weeks.

Proposed solution

We need to still spec this out, but the current line of thinking is to break out of the model of sequential parent-dependent migrations that we currently use and introduce "special migrations". Most migrations will remain the same, but bigger migrations will fall in this special category.

Some characteristics of these "special migrations" should be:

  • Non-sequential: Instead of having to run before X and after Y, they would need to be executed within a range of PostHog versions (e.g. you can't run it before 1.32.0 and must run it before 1.34.0)
  • Logical branching: Within the version range for a migration, we should support multiple code paths, such that we can appropriately handle things before and after the migration runs
  • Monitoring: Users should have a good way of monitoring the migration, being aware of its status, and having a rough sense of progress
  • Fail safe: We should be able to handle a migration failing safely
  • Pre-run checklist: We should be able to perform some necessary checks (e.g. remaining storage) before running the migration and surfacing issues to the users before they run it (e.g. increase storage for CH)

Ultimately we would want this management to happen in the UI, but an MVP might be a CLI tool/management command.

Further considerations

  • We may need to break out of Infi for this?
  • ...
@yakkomajuri yakkomajuri added enhancement New feature or request epic team/infra Everything related to deploying PostHog team-platform P0 Critical, breaking issue (page crash, missing functionality) labels Nov 11, 2021
@yakkomajuri
Copy link
Contributor Author

Had a chat with @guidoiaquinti about this.

We concluded that there are multiple types of "big migrations" that are not necessarily data related. Think upgrading Postgres a major version for example.

Thus, to make meaningful progress, my initial focus here will be to tackle data migrations specifically. Things like moving data from a table to another (e.g. events migration), or syncing a table from Postgres to CH.

Some of the other types also maybe shouldn't even handle automatically, as they can be very bespoke. The best approach to some of them (e.g. version upgrades) might just be docs.

@yakkomajuri yakkomajuri self-assigned this Nov 15, 2021
@tiina303
Copy link
Contributor

We need to be prepared for migrations that might last e.g. weeks.

Regarding to what we learned about ClickHouse changing tables do we really need to think about week long migrations or what length of migrations we care about now?

Non-sequential: Instead of having to run before X and after Y, they would need to be executed within a range of PostHog versions (e.g. you can't run it before 1.32.0 and must run it before 1.34.0)

Depending on the answer above (if we care about a couple of hours migrations only) an alternative to consider, do :

  1. only events needs to be able to handle multiple versions (which I don't think they ever need to do anything about, not sure)
  2. For the upgrade we
    1. kill web (if needed) and plugins (if needed - could be in migration config)
    2. run the data migration
    3. upgrade web/plugins code (if that was needed) (e.g. via helm upgrade)

Logical branching: Within the version range for a migration, we should support multiple code paths, such that we can appropriately handle things before and after the migration runs

I propose we always make the range minimal if it requires multiple code paths the more different paths we have in play the harder it will be to fix bugs etc. So I propose we limit it to 1 release only exactly (you must be on 1.33 to run the migration & it's required for 1.34). People can run multiple updates and that shouldn't be a problem.

Ultimately we would want this management to happen in the UI, but an MVP might be a CLI tool/management command.

Why UI? I think cli makes more sense as one needs to manage their self-hosted instal in the cli anyway (for now at least when we don't have parameters & upgrades in DO & we're not even on gcp marketplace).

@yakkomajuri
Copy link
Contributor Author

Regarding to what we learned about ClickHouse changing tables do we really need to think about week long migrations or what length of migrations we care about now?

We do care about migrations that can last weeks. Events for example is a PK change, which requires a SELECT * and insert into the new table. That took us a week.

@yakkomajuri
Copy link
Contributor Author

Why UI?

Think about a week long migration. It would be nice to get updates on its progress, status, and be able to easily stop, rollback, debug if necessary.

We don't need to start with this, but it does make management easier given these things would be running "in the background".

@yakkomajuri
Copy link
Contributor Author

yakkomajuri commented Nov 30, 2021

Work has started in #7364, #7365, and #7425. I've now decided to split this across multiple PRs and will use this issue to keep track of things:

@yakkomajuri
Copy link
Contributor Author

yakkomajuri commented Nov 30, 2021

Ideas/notes:

  • Migration checker
  • Consider not running init on apps
  • In memory codepaths support
  • When to setup special migrations? Add env var?
  • Full migration logs

yakkomajuri added a commit that referenced this issue Nov 30, 2021
* add names to celery nodes

* use python instead

* hostname naming
yakkomajuri added a commit that referenced this issue Nov 30, 2021
* SpecialMigration model

* Update latest_migrations.manifest

* update model

* add ph version range to model

* format

* progress -> PositiveSmallIntegerField
yakkomajuri added a commit that referenced this issue Nov 30, 2021
yakkomajuri added a commit that referenced this issue Nov 30, 2021
yakkomajuri added a commit that referenced this issue Nov 30, 2021
yakkomajuri added a commit that referenced this issue Nov 30, 2021
#7450)

* Revert "Revert "Add names to celery nodes (#7054 pt. 2) (#7433)" (#7449)"

This reverts commit ecf0679.

* Update bin/docker-worker-beat

* Update bin/docker-worker-celery

* Update bin/start-worker

* Update bin/docker-worker-beat

* Update bin/docker-worker-beat
yakkomajuri added a commit that referenced this issue Dec 1, 2021
* add special migration definition and example

* types

* update definition

* move example to examples dir
yakkomajuri added a commit that referenced this issue Dec 3, 2021
* update UI with new cols

* fix UI
yakkomajuri added a commit that referenced this issue Dec 13, 2021
* add special migration definition and example

* types

* special migrations runner

* fix tests

* fix tests 2

* add clickhouse runner

* add temp fix for tests

* wip

* add special migrations api (#7448)

* wip new structure

* update example sourcing

* Update .gitignore

* yet another wip structure

* code quality

* cypress

* test docker image build

* implement resumable ops

* code quality

* add comments

* add warning

* add conditional requirements for migration

* add comment on is_required

* add dependency map

* wip dependencies and run migration on startup

* code quality

* fix bugs

* fix more bugs

* format

* types

* remove api from this branch

* types

* types

* update clickhouse script

* add is_migration_in_range util

* fix type

* fix runner

* add AUTO_START_SPECIAL_MIGRATIONS env var

* reset migration on start

* cleanup

* wip per op rollback

* prevent accidental status rollback

* add utils and definition test

* update example with rollback per op

* wip test special migration

* add first runner tests

* add runner tests

* add util for code paths

* fix test

* fix types

* fix types again

* cleanup

* cleanup

* add periodic healthcheck task tests

* remove unused imports

* safer row updates

* fix coalescing none checks

* code quality

* add docstrings

* fix

* fix deploys issue

* update scripts

* add delay

* address reviews

* address review comments

* address review comments

* address final comments

* fix import error

* fix tests

* remove unused imports

* fix tests

* fix task test

* remove unused return value

* remove unused special migrations code from migrate_clickhouse

* tweaks to support fresh deployments
yakkomajuri added a commit that referenced this issue Dec 13, 2021
* add special migration definition and example

* types

* special migrations runner

* fix tests

* fix tests 2

* add clickhouse runner

* add temp fix for tests

* wip

* add special migrations api (#7448)

* wip new structure

* update example sourcing

* Update .gitignore

* yet another wip structure

* code quality

* cypress

* test docker image build

* implement resumable ops

* code quality

* add comments

* add warning

* add conditional requirements for migration

* add comment on is_required

* add dependency map

* wip dependencies and run migration on startup

* code quality

* fix bugs

* fix more bugs

* format

* types

* remove api from this branch

* types

* types

* update clickhouse script

* add is_migration_in_range util

* fix type

* Add special migrations API

* fix api

* update api with new columns

* fix runner

* add AUTO_START_SPECIAL_MIGRATIONS env var

* reset migration on start

* Special migrations UI (#7054 pt. 6) (#7493)

* update UI with new cols

* fix UI

* new UI statuses

* cleanup

* wip per op rollback

* add refresh button

* wip tests

* prevent accidental status rollback

* finish api tests

* Update bin/tests

* add utils and definition test

* update example with rollback per op

* wip test special migration

* add first runner tests

* add runner tests

* add util for code paths

* fix test

* fix types

* fix types again

* cleanup

* cleanup

* add periodic healthcheck task tests

* remove unused imports

* safer row updates

* fix coalescing none checks

* code quality

* add handling for non-staff users

* add docstrings

* fix

* fix deploys issue

* update scripts

* add delay

* address reviews

* address review comments

* address review comments

* address final comments

* fix import error

* fix tests

* remove unused imports

* fix tests

* fix task test

* remove unused return value

* remove unused special migrations code from migrate_clickhouse

* tweaks to support fresh deployments

* make instance first user staff
yakkomajuri added a commit that referenced this issue Dec 13, 2021
* add special migration definition and example

* types

* special migrations runner

* fix tests

* fix tests 2

* add clickhouse runner

* add temp fix for tests

* wip

* add special migrations api (#7448)

* wip new structure

* update example sourcing

* Update .gitignore

* yet another wip structure

* code quality

* cypress

* test docker image build

* implement resumable ops

* code quality

* add comments

* add warning

* add conditional requirements for migration

* add comment on is_required

* add dependency map

* wip dependencies and run migration on startup

* code quality

* fix bugs

* fix more bugs

* format

* types

* remove api from this branch

* types

* types

* update clickhouse script

* add is_migration_in_range util

* fix type

* Add special migrations API

* fix api

* update api with new columns

* fix runner

* add AUTO_START_SPECIAL_MIGRATIONS env var

* reset migration on start

* Special migrations UI (#7054 pt. 6) (#7493)

* update UI with new cols

* fix UI

* new UI statuses

* cleanup

* wip per op rollback

* add refresh button

* wip tests

* prevent accidental status rollback

* finish api tests

* Update bin/tests

* add utils and definition test

* update example with rollback per op

* wip test special migration

* add first runner tests

* add runner tests

* add util for code paths

* fix test

* fix types

* fix types again

* cleanup

* cleanup

* add periodic healthcheck task tests

* remove unused imports

* safer row updates

* fix coalescing none checks

* code quality

* add handling for non-staff users

* Add special migrations to instance status

* add docstrings

* fix

* fix deploys issue

* update scripts

* add delay

* address reviews

* address review comments

* address review comments

* address final comments

* fix import error

* fix tests

* remove unused imports

* fix tests

* fix task test

* remove unused return value

* remove unused special migrations code from migrate_clickhouse

* tweaks to support fresh deployments

* make instance first user staff

* fix import
paolodamico pushed a commit that referenced this issue Dec 15, 2021
* add special migration definition and example

* types

* special migrations runner

* fix tests

* fix tests 2

* add clickhouse runner

* add temp fix for tests

* wip

* add special migrations api (#7448)

* wip new structure

* update example sourcing

* Update .gitignore

* yet another wip structure

* code quality

* cypress

* test docker image build

* implement resumable ops

* code quality

* add comments

* add warning

* add conditional requirements for migration

* add comment on is_required

* add dependency map

* wip dependencies and run migration on startup

* code quality

* fix bugs

* fix more bugs

* format

* types

* remove api from this branch

* types

* types

* update clickhouse script

* add is_migration_in_range util

* fix type

* fix runner

* add AUTO_START_SPECIAL_MIGRATIONS env var

* reset migration on start

* cleanup

* wip per op rollback

* prevent accidental status rollback

* add utils and definition test

* update example with rollback per op

* wip test special migration

* add first runner tests

* add runner tests

* add util for code paths

* fix test

* fix types

* fix types again

* cleanup

* cleanup

* add periodic healthcheck task tests

* remove unused imports

* safer row updates

* fix coalescing none checks

* code quality

* add docstrings

* fix

* fix deploys issue

* update scripts

* add delay

* address reviews

* address review comments

* address review comments

* address final comments

* fix import error

* fix tests

* remove unused imports

* fix tests

* fix task test

* remove unused return value

* remove unused special migrations code from migrate_clickhouse

* tweaks to support fresh deployments
paolodamico pushed a commit that referenced this issue Dec 15, 2021
* add special migration definition and example

* types

* special migrations runner

* fix tests

* fix tests 2

* add clickhouse runner

* add temp fix for tests

* wip

* add special migrations api (#7448)

* wip new structure

* update example sourcing

* Update .gitignore

* yet another wip structure

* code quality

* cypress

* test docker image build

* implement resumable ops

* code quality

* add comments

* add warning

* add conditional requirements for migration

* add comment on is_required

* add dependency map

* wip dependencies and run migration on startup

* code quality

* fix bugs

* fix more bugs

* format

* types

* remove api from this branch

* types

* types

* update clickhouse script

* add is_migration_in_range util

* fix type

* Add special migrations API

* fix api

* update api with new columns

* fix runner

* add AUTO_START_SPECIAL_MIGRATIONS env var

* reset migration on start

* Special migrations UI (#7054 pt. 6) (#7493)

* update UI with new cols

* fix UI

* new UI statuses

* cleanup

* wip per op rollback

* add refresh button

* wip tests

* prevent accidental status rollback

* finish api tests

* Update bin/tests

* add utils and definition test

* update example with rollback per op

* wip test special migration

* add first runner tests

* add runner tests

* add util for code paths

* fix test

* fix types

* fix types again

* cleanup

* cleanup

* add periodic healthcheck task tests

* remove unused imports

* safer row updates

* fix coalescing none checks

* code quality

* add handling for non-staff users

* add docstrings

* fix

* fix deploys issue

* update scripts

* add delay

* address reviews

* address review comments

* address review comments

* address final comments

* fix import error

* fix tests

* remove unused imports

* fix tests

* fix task test

* remove unused return value

* remove unused special migrations code from migrate_clickhouse

* tweaks to support fresh deployments

* make instance first user staff
paolodamico pushed a commit that referenced this issue Dec 15, 2021
* add special migration definition and example

* types

* special migrations runner

* fix tests

* fix tests 2

* add clickhouse runner

* add temp fix for tests

* wip

* add special migrations api (#7448)

* wip new structure

* update example sourcing

* Update .gitignore

* yet another wip structure

* code quality

* cypress

* test docker image build

* implement resumable ops

* code quality

* add comments

* add warning

* add conditional requirements for migration

* add comment on is_required

* add dependency map

* wip dependencies and run migration on startup

* code quality

* fix bugs

* fix more bugs

* format

* types

* remove api from this branch

* types

* types

* update clickhouse script

* add is_migration_in_range util

* fix type

* Add special migrations API

* fix api

* update api with new columns

* fix runner

* add AUTO_START_SPECIAL_MIGRATIONS env var

* reset migration on start

* Special migrations UI (#7054 pt. 6) (#7493)

* update UI with new cols

* fix UI

* new UI statuses

* cleanup

* wip per op rollback

* add refresh button

* wip tests

* prevent accidental status rollback

* finish api tests

* Update bin/tests

* add utils and definition test

* update example with rollback per op

* wip test special migration

* add first runner tests

* add runner tests

* add util for code paths

* fix test

* fix types

* fix types again

* cleanup

* cleanup

* add periodic healthcheck task tests

* remove unused imports

* safer row updates

* fix coalescing none checks

* code quality

* add handling for non-staff users

* Add special migrations to instance status

* add docstrings

* fix

* fix deploys issue

* update scripts

* add delay

* address reviews

* address review comments

* address review comments

* address final comments

* fix import error

* fix tests

* remove unused imports

* fix tests

* fix task test

* remove unused return value

* remove unused special migrations code from migrate_clickhouse

* tweaks to support fresh deployments

* make instance first user staff

* fix import
@yakkomajuri
Copy link
Contributor Author

More to-dos:

  • CLI (still not done)
  • Better UX
  • Post-check: a check for after the migration runs to really ensure it did what it should (e.g. old_table_events_count == new_table_events_count) causing a rollback if it fails

@tiina303
Copy link
Contributor

Discussed during the platform meeting on wednesday: Prior to releasing to everyone:

  • precheck to verify kafka is set up reasonably (have 3 days of retention)
  • monitoring on ingestion not working

Thought about the status part a bit & proposal:

  • if a migration is in an errored state we show the red exclamation point, similar to plugin server down, (potentially maybe we should make this be a banner on top for self hosted folks for all of the ! ones to increase the visibility)?
  • if we have migrations that could be ran we'd show the yellow one similar to newer posthog version being available
  • while a migration is running that should be shown to the user somehow too as during the migration things could be a bit off, e.g. events migration means we don't have newest events

Btw while we're user testing we should probably default to turning off auto rollbacks initially

@yakkomajuri
Copy link
Contributor Author

yakkomajuri commented Dec 16, 2021

@tiina303 both of your points are very specific to the events ingestion. I'd rather keep this thread to aspects relating to the system itself.

@yakkomajuri
Copy link
Contributor Author

Cloud-specific async migrations

@yakkomajuri
Copy link
Contributor Author

Run is_required earlier, hide from UI

@yakkomajuri
Copy link
Contributor Author

Manually update progress on refresh

@yakkomajuri
Copy link
Contributor Author

on complete hook, run is_required, rethink deps

@yakkomajuri
Copy link
Contributor Author

As an overall task, this is done. We can open smaller tickets for issues that come about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic P0 Critical, breaking issue (page crash, missing functionality) team/infra Everything related to deploying PostHog
Projects
None yet
Development

No branches or pull requests

2 participants