Skip to content
This repository has been archived by the owner on Nov 4, 2021. It is now read-only.

Effectively monitor batch processing times. #156

Closed
mariusandra opened this issue Feb 16, 2021 · 3 comments
Closed

Effectively monitor batch processing times. #156

mariusandra opened this issue Feb 16, 2021 · 3 comments

Comments

@mariusandra
Copy link
Collaborator

This Kafka eachBatch is going to be an important unit of work. Since the next eachBatch runs only when the last one finished and is committed, the length of the slowest plugin will determine how quickly batches get processed.

For 500 events, if 1 event takes 30sec to process (e.g. some long await fetch) and the other 499 take a combined 1sec, this instance of the plugin server will be sitting idle for 29 seconds out of 30.

We need a way to reliably monitor, detect and alert about slow batch processing times. The console logs as shown in #154 are not enough.

PR #155 is about enforcing timeouts when we do encounter slow plugins to assure some level of throughput, and perhaps a prerequisite for starting this work.


To be specced out in another issue, but related to the above:

There is only one true way around this, and that is to convert both Kafka and Celery streams into some other, buffered, sequential, persistent and insanely fast queues that operate on the edge of the larger pipe. Possibly backed by Postgres? This buffer would be responsible for keeping track of where each of the 500 messages that arrive in one Kafka batch are in the processing pipeline. If the server crashes, it could pick up where it left off. Is Postgres fast enough for this or what could we use?

@Twixes
Copy link
Member

Twixes commented Feb 17, 2021

Hm, as for paragraph under divider: Postgres could work, though it'd kind of clog if we didn't delete entries of fully processed events. And I'm not sure about Postgres's delete performance at scale.

@macobo
Copy link
Contributor

macobo commented Mar 19, 2021

For the monitoring part I think using statsd timers + tags for individual plugins will do the trick for monitoring part of the equation.

We're already using tagging for tracking individual and cumulative query times via the grafana dashboard: https://metrics.posthog.com/d/h_MvYE8Gk/plugin-server-internal-metrics?orgId=1&from=now-30m&to=now

That said making sure we're not bottle-necked by user code is also super important here.

@mariusandra
Copy link
Collaborator Author

We're using a lot of timeoutGuards with various statsd metrics to monitor slow paths. Seems to work well enough, so closing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants