Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Levels of Isolation #6888

Closed
mariusandra opened this issue Mar 2, 2021 · 3 comments
Closed

Levels of Isolation #6888

mariusandra opened this issue Mar 2, 2021 · 3 comments
Labels

Comments

@mariusandra
Copy link
Collaborator

mariusandra commented Mar 2, 2021

In case we want to enable full plugin support (including writing your own), we should have a discussion about levels of isolation in the plugin server.

Right now, for EE, the plugin ingestion pipeline (simplified) looks like this:

image

EDIT 2022-02-11: the middle orange await is now gone --> events are ingested directly after they are processed.

Each task to run plugins (green box) or ingest them (pink box) is spread out to all the worker (cpu cores) and has great parallelism. What's not so great is that due to how Kafka wants us to work, we must make sure to fully ingest one batch of events before starting another. This means if we have a slow or broken plugin, it'll delay the entire ingestion.

Since plugins currently have a 30sec timeout (could be shortened), and there can be many 30sec plugins (green boxes) running after one another on an event, the entire batch could be delayed a lot. Actually, if the entire batch is not ingested in a minute (sessionTimeout), it's considered lost and some other worker will try again... thus everything will probably come crashing down 😅

On cloud, it will be easy for someone who is just playing around to accidentally create a plugin that delays the entire ingestion by many minutes. For example by fetch-ing some slow resource on every event and having a few of these plugins run after one another.

Immediate actions from this:

  • Make sure all plugins finish in $timeout (= 30 sec), not just one plugin.
  • Possibly increase the Kafka batch session timeout from 60sec
  • Possibly decrease the plugin/ingestion timeout from 30sec
  • Make a plugin coding & testing area so you don't run half-finished plugins on real data
  • Run ingestion right after plugins instead of waiting for all plugins (remove the first orange box)

However even with these changes in place, it would still be wise, on cloud, to separate plugins running on different teams from one another. There are some things we can do, all with different tradeoffs:

  1. Separate Kafka topics per team or some other built in grouping mechanism (Ensure that a team's events always end up in the same Kafka partition #3303). I don't know how complicated this can get, but at the end of the day, if it's in the same Kafka topic, it's in the same batch. Unless we implement some buffering (see under the line in this comment). This could be something like 1) a separate topic for EE Cloud clients, maybe even one per client, 2) another topic regular paying cloud users, 3) another topic for free users.

  2. VM level isolation. This is what we do now. The plugins can't read each other's memory, but they are all loaded and running at the same time in memory... and if one crashes fatally, it can take down all the others.

  3. Thread level isolation. We have separate worker pools for plugins of different teams. Basically we divide CPUs of all running plugin servers with some clever routing mechanism to distribute tasks of different teams onto different cores. It's even complicated to write this, much less implement.

  4. Process level isolation. We have multiple plugin servers running, each has a different KAFKA_CONSUMPTION_TOPIC env set and does its job. Perhaps another env limits the teams whose plugins are started.


The easiest way forward today is to create three different kafka event ingestion topics (e.g. priority 1, 2, 3) and divide events between them in capture.py... and then run separate plugin servers.

There's nothing to protect against some EE customer creating a plugin that brings down other EE customers whose events go through the same topic... 🤔

@Twixes
Copy link
Member

Twixes commented Mar 3, 2021

As for isolation solutions:

  1. I'd love Kafka-partition-per-team meant Kafka teams always ending up in a specific partition, but we'd have to do some serious load balancing across plugin server instances somehow to avoid hammering specific instances too much. Though even then a S*****ily-like customer will throw balance off a lot.
  2. VM2 is doing a good job already but I'm afraid it's not isolated enough - as said, collateral damage risk. V8 Isolates seem much more solid in this regard, that's the stuff used to facilitate Algolia Crawler or Cloudflare Workers running arbitrary JS at scale (good read by CF: https://blog.cloudflare.com/cloud-computing-without-containers). This library seems like a solid and common interface for Isolates: https://github.com/laverdet/isolated-vm. Mind you, this is definitely a more low level solution than VM2.
  3. Doesn't seem all that beneficial with something like Isolates.
  4. I'm all for at least two topics: free and paid customers. The easiest thing to implement and blast radius reduced right away.

@mariusandra
Copy link
Collaborator Author

I originally developed this server with isolated-vm, but bailed since I realised it's really really complicated to offer support for calling functions outside of the isolate and data crossing the boundaries. We could revisit it at some day, but at the very least we'll need to build some kind of proxying mechanism for functions in shared libraries like cloud.google.bigquery or even posthog.capture. VM2 took care of so many of the things I would have had to develop myself. Might be worth revisiting at some point though.

@tiina303 tiina303 transferred this issue from PostHog/plugin-server Nov 3, 2021
@posthog-bot
Copy link
Contributor

This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done This Sprint
Development

No branches or pull requests

3 participants