Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add internal event queuing and flushing #227

Merged
merged 2 commits into from
Jun 7, 2019

Conversation

claytono
Copy link
Contributor

@claytono claytono commented May 30, 2019

At high traffic levels, the locking around sending on channels can cause a large amount of blocking and CPU usage. This adds an event queue mechanism so that events are queued for short period of time, and flushed in batches to the main exporter goroutine periodically.

The default is is to flush every 1000 events, or every 200ms, whichever happens first.

@claytono claytono force-pushed the event-queuing branch 2 times, most recently from a9ff5d3 to d16e73e Compare May 30, 2019 11:56
@claytono claytono marked this pull request as ready for review May 30, 2019 12:04
@matthiasr
Copy link
Contributor

interesting, this somewhat relates to the discussion we had around measuring UDP packet drops here #196 (comment)

what would happen with the different listeners if the queue overflows?

@claytono
Copy link
Contributor Author

I skimmed through that a while back, and I like the idea in general, but I haven't actually read the PR. If nothing else, being able to have metrics for dropped packets I think has value. Another approach to getting those metrics might be to switch the front end listeners to do non-blocking sends on the event queue. That would allow dropping events when the queue is full and there could be a metric for that. That might be a simplest change than adding another stage to the processing pipeline.

I think the behavior regarding queue overflows stays mostly the same. The event queue that the front end listeners push events to now has a bigger buffer, and there is the small internal buffer, but eventually both of those will fill and it will block on trying to send on that channel.

@matthiasr
Copy link
Contributor

Another approach to getting those metrics might be to switch the front end listeners to do non-blocking sends on the event queue

That is my preferred approach as well, because it's platform independent. Does it make sense to add it here?

At high traffic levels, the locking around sending on channels can cause
a large amount of blocking and CPU usage.  These adds an event queue
mechanism so that events are queued for short period of time, and
flushed in batches to the main exporter goroutine periodically.

The default is is to flush every 1000 events, or every 200ms, whichever
happens first.

Signed-off-by: Clayton O'Neill <[email protected]>
@claytono
Copy link
Contributor Author

claytono commented Jun 4, 2019

If you're ok with it, I'd prefer to address that in a future PR. I'm not sure if I'll have the time to incorporate that into this PR in the short term.

Copy link
Contributor

@matthiasr matthiasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, my only concern is about visibility into what's happening with the queue. With only the flush counter, you can kinda sorta infer when it ran full because the rate goes over the baseline. I think it would be useful to get an earlier warning, and to be able to see when the queue is always nearly full, or there are short spikes that are over the batch size and could be handled in one batch with a slightly bigger queue size.

telemetry.go Outdated Show resolved Hide resolved
Co-Authored-By: Matthias Rampke <[email protected]>
Signed-off-by: Clayton O'Neill <[email protected]>
@claytono
Copy link
Contributor Author

claytono commented Jun 7, 2019

@matthiasr I think that makes sense. I'd also like to get that metric in place also, but I'm tied up with another project at the moment. Your comment about the metric naming makes sense and I've updated the PR to incorporate that.

@matthiasr matthiasr merged commit 5832aa9 into prometheus:master Jun 7, 2019
matthiasr pushed a commit that referenced this pull request Jun 13, 2019
and add the missing changelog entry for #227

Signed-off-by: Matthias Rampke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants