Skip to content
This repository has been archived by the owner on Nov 4, 2021. It is now read-only.

Lazy VMs / Improve reloads #59

Closed
mariusandra opened this issue Dec 10, 2020 · 3 comments · Fixed by #234
Closed

Lazy VMs / Improve reloads #59

mariusandra opened this issue Dec 10, 2020 · 3 comments · Fixed by #234
Assignees

Comments

@mariusandra
Copy link
Collaborator

Currently when reloading, we:

  • shut down celery
  • wait 2 seconds
  • shut down all workers
  • restart all workers (they will reload all plugins)
  • restart celery

Downtime of min 2sec, but more like 3-4sec. Even more if you have hundreds of plugins.

Ideally we should:

  • only restart plugins that changed
  • restart them live (keep the old one running while the new one restarts) and abort reload if there's an init error
  • rolling restart between threads, not all at once
@mariusandra
Copy link
Collaborator Author

Also: add test for reloads. When running in the ingestion-save branch, reloads killed the server for me. Not so in master.

@mariusandra
Copy link
Collaborator Author

Speccing this out a bit.

Why?

When the plugin server reloads, event ingestion stops for about half a minute while we basically restart the entire server. These reloads happen whenever anyone changes anything for any plugin in any team (installing, enabling, disabling, etc).

In Grafana it looks like this:

image

This will become a problem if we allow anyone on cloud to use plugins, as any change in any team will bring down the server for everyone for up to a minute.

What?

When one team updates their plugins, event ingestion for all other teams should not be affected. Moreover, only the plugins that changed (added/deleted/updated) should be reloaded, instead of destroying and then restarting the entire worker pool.

How?

We need to make the following changes:

  1. Required: System to "broadcast" tasks to all piscina worker threads. Currently if you piscina.runTask, it'll choose one worker thread to run the task on. We need to tell all worker threads that they should reload the relevant VMs. This probably requires submitting a PR to the piscina repository. Alternatively we could have an additional redis pubsub subscription (+1 connection per thread) inside each worker that listens to the reload event.

  2. Required: System of "Lazy VMs". Right now we wait for all VMs to finish their setup before we start ingesting plugins. With this "Lazy VM" system, we would just do something like vm.methods.processEvent = (event) => vm.launch().then(launchedVm => launchedVm.methods.processEvent). If the VM has been launched, it'll just run processEvent as normal. If it hasn't been launched, processing the event will just take a few moments longer. Work on this has already started in Lazy VMs #220

  3. Required: System to diff active/loaded pluginConfigs with what's now in the database... and then to add, recreate or destroy the VMs that need changes (just delete the existing promise, so that vm.launch() recreates it). We could send metadata of the changed plugin (or just the team) with the reload event, but it's more accurate to read everything from the db and make up our own mind as to what changed (eventual consistency --> e.g. what if we somehow miss one reload event?)

  4. Required: We need a system to reload the schedule in the main thread when a reloaded plugin changes its scheduling capabilities (e.g. changing runEveryDay to runEveryHour).

  5. Useful to have: System to record plugin capabilities. Without fully loading the plugin, we don't know if it has a runEveryMinute function or a processEvent function. Thus we don't know if we should add it to the scheduler or the processing pipeline without fully initializing it. We can still init all plugins on server startup to get this information, and then just have them be "lazy" when reloading. However a better option is to capture the plugin's capabilities when it's installed. Then we can already add it to the schedule and e.g. init only once per day when it needs to work. Eventually we might even split scheduled plugins into their own piscina worker pool.

@mariusandra
Copy link
Collaborator Author

I created a PR to address the first point here: piscinajs/piscina#113

@macobo macobo mentioned this issue Mar 9, 2021
2 tasks
@macobo macobo assigned macobo and unassigned mariusandra Mar 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants