Levels of Isolation #6888

mariusandra · 2021-03-02T17:12:22Z

In case we want to enable full plugin support (including writing your own), we should have a discussion about levels of isolation in the plugin server.

Right now, for EE, the plugin ingestion pipeline (simplified) looks like this:

EDIT 2022-02-11: the middle orange await is now gone --> events are ingested directly after they are processed.

Each task to run plugins (green box) or ingest them (pink box) is spread out to all the worker (cpu cores) and has great parallelism. What's not so great is that due to how Kafka wants us to work, we must make sure to fully ingest one batch of events before starting another. This means if we have a slow or broken plugin, it'll delay the entire ingestion.

Since plugins currently have a 30sec timeout (could be shortened), and there can be many 30sec plugins (green boxes) running after one another on an event, the entire batch could be delayed a lot. Actually, if the entire batch is not ingested in a minute (sessionTimeout), it's considered lost and some other worker will try again... thus everything will probably come crashing down 😅

On cloud, it will be easy for someone who is just playing around to accidentally create a plugin that delays the entire ingestion by many minutes. For example by fetch-ing some slow resource on every event and having a few of these plugins run after one another.

Immediate actions from this:

Make sure all plugins finish in $timeout (= 30 sec), not just one plugin.
Possibly increase the Kafka batch session timeout from 60sec
Possibly decrease the plugin/ingestion timeout from 30sec
Make a plugin coding & testing area so you don't run half-finished plugins on real data
Run ingestion right after plugins instead of waiting for all plugins (remove the first orange box)

However even with these changes in place, it would still be wise, on cloud, to separate plugins running on different teams from one another. There are some things we can do, all with different tradeoffs:

Separate Kafka topics per team or some other built in grouping mechanism (Ensure that a team's events always end up in the same Kafka partition #3303). I don't know how complicated this can get, but at the end of the day, if it's in the same Kafka topic, it's in the same batch. Unless we implement some buffering (see under the line in this comment). This could be something like 1) a separate topic for EE Cloud clients, maybe even one per client, 2) another topic regular paying cloud users, 3) another topic for free users.
VM level isolation. This is what we do now. The plugins can't read each other's memory, but they are all loaded and running at the same time in memory... and if one crashes fatally, it can take down all the others.
Thread level isolation. We have separate worker pools for plugins of different teams. Basically we divide CPUs of all running plugin servers with some clever routing mechanism to distribute tasks of different teams onto different cores. It's even complicated to write this, much less implement.
Process level isolation. We have multiple plugin servers running, each has a different KAFKA_CONSUMPTION_TOPIC env set and does its job. Perhaps another env limits the teams whose plugins are started.

The easiest way forward today is to create three different kafka event ingestion topics (e.g. priority 1, 2, 3) and divide events between them in capture.py... and then run separate plugin servers.

There's nothing to protect against some EE customer creating a plugin that brings down other EE customers whose events go through the same topic... 🤔

The text was updated successfully, but these errors were encountered:

Twixes · 2021-03-03T09:04:39Z

As for isolation solutions:

I'd love ~~Kafka-partition-per-team~~ meant Kafka teams always ending up in a specific partition, but we'd have to do some serious load balancing across plugin server instances somehow to avoid hammering specific instances too much. Though even then a S*****ily-like customer will throw balance off a lot.
VM2 is doing a good job already but I'm afraid it's not isolated enough - as said, collateral damage risk. V8 Isolates seem much more solid in this regard, that's the stuff used to facilitate Algolia Crawler or Cloudflare Workers running arbitrary JS at scale (good read by CF: https://blog.cloudflare.com/cloud-computing-without-containers). This library seems like a solid and common interface for Isolates: https://github.com/laverdet/isolated-vm. Mind you, this is definitely a more low level solution than VM2.
Doesn't seem all that beneficial with something like Isolates.
I'm all for at least two topics: free and paid customers. The easiest thing to implement and blast radius reduced right away.

mariusandra · 2021-03-03T09:55:09Z

I originally developed this server with isolated-vm, but bailed since I realised it's really really complicated to offer support for calling functions outside of the isolate and data crossing the boundaries. We could revisit it at some day, but at the very least we'll need to build some kind of proxying mechanism for functions in shared libraries like cloud.google.bigquery or even posthog.capture. VM2 took care of so many of the things I would have had to develop myself. Might be worth revisiting at some point though.

posthog-bot · 2024-02-12T07:31:28Z

This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

tiina303 transferred this issue from PostHog/plugin-server Nov 3, 2021

posthog-bot added the stale label Feb 12, 2024

Twixes closed this as completed Feb 12, 2024

github-project-automation bot added this to Extensibility Aug 29, 2024

github-project-automation bot moved this to Done This Sprint in Extensibility Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Levels of Isolation #6888

Levels of Isolation #6888

mariusandra commented Mar 2, 2021 •

edited

Loading

Twixes commented Mar 3, 2021 •

edited

Loading

mariusandra commented Mar 3, 2021

posthog-bot commented Feb 12, 2024

Levels of Isolation #6888

Levels of Isolation #6888

Comments

mariusandra commented Mar 2, 2021 • edited Loading

Twixes commented Mar 3, 2021 • edited Loading

mariusandra commented Mar 3, 2021

posthog-bot commented Feb 12, 2024

mariusandra commented Mar 2, 2021 •

edited

Loading

Twixes commented Mar 3, 2021 •

edited

Loading