-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Levels of Isolation #6888
Comments
As for isolation solutions:
|
I originally developed this server with |
This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the |
In case we want to enable full plugin support (including writing your own), we should have a discussion about levels of isolation in the plugin server.
Right now, for EE, the plugin ingestion pipeline (simplified) looks like this:
EDIT 2022-02-11: the middle orange await is now gone --> events are ingested directly after they are processed.
Each task to run plugins (green box) or ingest them (pink box) is spread out to all the worker (cpu cores) and has great parallelism. What's not so great is that due to how Kafka wants us to work, we must make sure to fully ingest one batch of events before starting another. This means if we have a slow or broken plugin, it'll delay the entire ingestion.
Since plugins currently have a 30sec timeout (could be shortened), and there can be many 30sec plugins (green boxes) running after one another on an event, the entire batch could be delayed a lot. Actually, if the entire batch is not ingested in a minute (sessionTimeout), it's considered lost and some other worker will try again... thus everything will probably come crashing down 😅
On cloud, it will be easy for someone who is just playing around to accidentally create a plugin that delays the entire ingestion by many minutes. For example by
fetch
-ing some slow resource on every event and having a few of these plugins run after one another.Immediate actions from this:
$timeout
(= 30 sec), not just one plugin.However even with these changes in place, it would still be wise, on cloud, to separate plugins running on different teams from one another. There are some things we can do, all with different tradeoffs:
Separate Kafka topics per team or some other built in grouping mechanism (Ensure that a team's events always end up in the same Kafka partition #3303). I don't know how complicated this can get, but at the end of the day, if it's in the same Kafka topic, it's in the same batch. Unless we implement some buffering (see under the line in this comment). This could be something like 1) a separate topic for EE Cloud clients, maybe even one per client, 2) another topic regular paying cloud users, 3) another topic for free users.
VM level isolation. This is what we do now. The plugins can't read each other's memory, but they are all loaded and running at the same time in memory... and if one crashes fatally, it can take down all the others.
Thread level isolation. We have separate worker pools for plugins of different teams. Basically we divide CPUs of all running plugin servers with some clever routing mechanism to distribute tasks of different teams onto different cores. It's even complicated to write this, much less implement.
Process level isolation. We have multiple plugin servers running, each has a different
KAFKA_CONSUMPTION_TOPIC
env set and does its job. Perhaps another env limits the teams whose plugins are started.The easiest way forward today is to create three different kafka event ingestion topics (e.g. priority 1, 2, 3) and divide events between them in
capture.py
... and then run separate plugin servers.There's nothing to protect against some EE customer creating a plugin that brings down other EE customers whose events go through the same topic... 🤔
The text was updated successfully, but these errors were encountered: