Effectively monitor batch processing times. #156

mariusandra · 2021-02-16T11:19:48Z

This Kafka eachBatch is going to be an important unit of work. Since the next eachBatch runs only when the last one finished and is committed, the length of the slowest plugin will determine how quickly batches get processed.

For 500 events, if 1 event takes 30sec to process (e.g. some long await fetch) and the other 499 take a combined 1sec, this instance of the plugin server will be sitting idle for 29 seconds out of 30.

We need a way to reliably monitor, detect and alert about slow batch processing times. The console logs as shown in #154 are not enough.

PR #155 is about enforcing timeouts when we do encounter slow plugins to assure some level of throughput, and perhaps a prerequisite for starting this work.

To be specced out in another issue, but related to the above:

There is only one true way around this, and that is to convert both Kafka and Celery streams into some other, buffered, sequential, persistent and insanely fast queues that operate on the edge of the larger pipe. Possibly backed by Postgres? This buffer would be responsible for keeping track of where each of the 500 messages that arrive in one Kafka batch are in the processing pipeline. If the server crashes, it could pick up where it left off. Is Postgres fast enough for this or what could we use?

The text was updated successfully, but these errors were encountered:

Twixes · 2021-02-17T10:23:09Z

Hm, as for paragraph under divider: Postgres could work, though it'd kind of clog if we didn't delete entries of fully processed events. And I'm not sure about Postgres's delete performance at scale.

macobo · 2021-03-19T08:24:29Z

For the monitoring part I think using statsd timers + tags for individual plugins will do the trick for monitoring part of the equation.

We're already using tagging for tracking individual and cumulative query times via the grafana dashboard: https://metrics.posthog.com/d/h_MvYE8Gk/plugin-server-internal-metrics?orgId=1&from=now-30m&to=now

That said making sure we're not bottle-necked by user code is also super important here.

mariusandra · 2021-04-29T13:51:32Z

We're using a lot of timeoutGuards with various statsd metrics to monitor slow paths. Seems to work well enough, so closing this.

mariusandra mentioned this issue Feb 26, 2021

Release 1.23.0 – 15 March 2021 PostHog/posthog#3423

Closed

macobo mentioned this issue Mar 19, 2021

Send timing information w/ team_id to statsd #265

Merged

2 tasks

mariusandra mentioned this issue Mar 22, 2021

Plugin audit log on events #269

Closed

mariusandra closed this as completed Apr 29, 2021

mariusandra mentioned this issue Nov 3, 2021

Levels of Isolation PostHog/posthog#6888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Effectively monitor batch processing times. #156

Effectively monitor batch processing times. #156

mariusandra commented Feb 16, 2021

Twixes commented Feb 17, 2021

macobo commented Mar 19, 2021

mariusandra commented Apr 29, 2021

Effectively monitor batch processing times. #156

Effectively monitor batch processing times. #156

Comments

mariusandra commented Feb 16, 2021

Twixes commented Feb 17, 2021

macobo commented Mar 19, 2021

mariusandra commented Apr 29, 2021