-
Notifications
You must be signed in to change notification settings - Fork 5
Effectively monitor batch processing times. #156
Comments
Hm, as for paragraph under divider: Postgres could work, though it'd kind of clog if we didn't delete entries of fully processed events. And I'm not sure about Postgres's delete performance at scale. |
For the monitoring part I think using statsd timers + tags for individual plugins will do the trick for monitoring part of the equation. We're already using tagging for tracking individual and cumulative query times via the grafana dashboard: https://metrics.posthog.com/d/h_MvYE8Gk/plugin-server-internal-metrics?orgId=1&from=now-30m&to=now That said making sure we're not bottle-necked by user code is also super important here. |
We're using a lot of |
This Kafka
eachBatch
is going to be an important unit of work. Since the nexteachBatch
runs only when the last one finished and is committed, the length of the slowest plugin will determine how quickly batches get processed.For 500 events, if 1 event takes 30sec to process (e.g. some long
await fetch
) and the other 499 take a combined 1sec, this instance of the plugin server will be sitting idle for 29 seconds out of 30.We need a way to reliably monitor, detect and alert about slow batch processing times. The console logs as shown in #154 are not enough.
PR #155 is about enforcing timeouts when we do encounter slow plugins to assure some level of throughput, and perhaps a prerequisite for starting this work.
To be specced out in another issue, but related to the above:
There is only one true way around this, and that is to convert both Kafka and Celery streams into some other, buffered, sequential, persistent and insanely fast queues that operate on the edge of the larger pipe. Possibly backed by Postgres? This buffer would be responsible for keeping track of where each of the 500 messages that arrive in one Kafka batch are in the processing pipeline. If the server crashes, it could pick up where it left off. Is Postgres fast enough for this or what could we use?
The text was updated successfully, but these errors were encountered: