Lazy VMs / Improve reloads #59

mariusandra · 2020-12-10T09:10:10Z

Currently when reloading, we:

shut down celery
wait 2 seconds
shut down all workers
restart all workers (they will reload all plugins)
restart celery

Downtime of min 2sec, but more like 3-4sec. Even more if you have hundreds of plugins.

Ideally we should:

only restart plugins that changed
restart them live (keep the old one running while the new one restarts) and abort reload if there's an init error
rolling restart between threads, not all at once

mariusandra · 2020-12-10T12:17:54Z

Also: add test for reloads. When running in the ingestion-save branch, reloads killed the server for me. Not so in master.

mariusandra · 2021-02-26T13:25:36Z

Speccing this out a bit.

Why?

When the plugin server reloads, event ingestion stops for about half a minute while we basically restart the entire server. These reloads happen whenever anyone changes anything for any plugin in any team (installing, enabling, disabling, etc).

In Grafana it looks like this:

This will become a problem if we allow anyone on cloud to use plugins, as any change in any team will bring down the server for everyone for up to a minute.

What?

When one team updates their plugins, event ingestion for all other teams should not be affected. Moreover, only the plugins that changed (added/deleted/updated) should be reloaded, instead of destroying and then restarting the entire worker pool.

How?

We need to make the following changes:

Required: System to "broadcast" tasks to all piscina worker threads. Currently if you piscina.runTask, it'll choose one worker thread to run the task on. We need to tell all worker threads that they should reload the relevant VMs. This probably requires submitting a PR to the piscina repository. Alternatively we could have an additional redis pubsub subscription (+1 connection per thread) inside each worker that listens to the reload event.
Required: System of "Lazy VMs". Right now we wait for all VMs to finish their setup before we start ingesting plugins. With this "Lazy VM" system, we would just do something like vm.methods.processEvent = (event) => vm.launch().then(launchedVm => launchedVm.methods.processEvent). If the VM has been launched, it'll just run processEvent as normal. If it hasn't been launched, processing the event will just take a few moments longer. Work on this has already started in Lazy VMs #220
Required: System to diff active/loaded pluginConfigs with what's now in the database... and then to add, recreate or destroy the VMs that need changes (just delete the existing promise, so that vm.launch() recreates it). We could send metadata of the changed plugin (or just the team) with the reload event, but it's more accurate to read everything from the db and make up our own mind as to what changed (eventual consistency --> e.g. what if we somehow miss one reload event?)
Required: We need a system to reload the schedule in the main thread when a reloaded plugin changes its scheduling capabilities (e.g. changing runEveryDay to runEveryHour).
Useful to have: System to record plugin capabilities. Without fully loading the plugin, we don't know if it has a runEveryMinute function or a processEvent function. Thus we don't know if we should add it to the scheduler or the processing pipeline without fully initializing it. We can still init all plugins on server startup to get this information, and then just have them be "lazy" when reloading. However a better option is to capture the plugin's capabilities when it's installed. Then we can already add it to the schedule and e.g. init only once per day when it needs to work. Eventually we might even split scheduled plugins into their own piscina worker pool.

mariusandra · 2021-03-01T14:49:26Z

I created a PR to address the first point here: piscinajs/piscina#113

mariusandra mentioned this issue Jan 15, 2021

Release 1.21.0 – 1 February 2021 PostHog/posthog#2948

Closed

mariusandra mentioned this issue Jan 28, 2021

Release 1.22.0 – 15 February 2021 PostHog/posthog#2985

Closed

mariusandra mentioned this issue Feb 11, 2021

Release 1.22.0 bis – 1 March 2021 PostHog/posthog#3287

Closed

mariusandra self-assigned this Feb 25, 2021

mariusandra changed the title ~~Improve reloads~~ Lazy VMs / Improve reloads Feb 25, 2021

mariusandra mentioned this issue Feb 26, 2021

Release 1.23.0 – 15 March 2021 PostHog/posthog#3423

Closed

mariusandra mentioned this issue Mar 1, 2021

Broadcast to workers piscinajs/piscina#113

Closed

macobo mentioned this issue Mar 9, 2021

Speed up boot times #234

Merged

2 tasks

macobo assigned macobo and unassigned mariusandra Mar 10, 2021

Twixes closed this as completed in #234 Mar 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy VMs / Improve reloads #59

Lazy VMs / Improve reloads #59

mariusandra commented Dec 10, 2020

mariusandra commented Dec 10, 2020

mariusandra commented Feb 26, 2021

mariusandra commented Mar 1, 2021

Lazy VMs / Improve reloads #59

Lazy VMs / Improve reloads #59

Comments

mariusandra commented Dec 10, 2020

mariusandra commented Dec 10, 2020

mariusandra commented Feb 26, 2021

Why?

What?

How?

mariusandra commented Mar 1, 2021