-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a threaded parsing option for statsD input plugin #6922
Adding a threaded parsing option for statsD input plugin #6922
Conversation
plugins/inputs/statsd/statsd.go
Outdated
|
||
MaxParserThreads int `toml:"max_parser_threads"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not expose this option, since it is so tied to an implementation detail. Let's just pick a good default number and always start this many. How about 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I agree with that. It's definitely tied to an implementation detail, but I think there is value in allowing the level of concurrency to be configurable. For instance, if I choose to run telegraf on the same machine as my app, I may want to limit it's footprint. But if I set up a dedicated instance to act as a metrics aggregator, being able to spend more CPU to increase throughput (and reduce the number of servers I need to use) could be very useful.
Looking at the existing telegraf.conf, it seems as if there are a few places where this is configurable (but maybe we should rename it?)
https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L6124
https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L373
https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L4961
Maybe there's some way we could expose a level of concurrency as an option without tying it directly to the number of goroutines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident enough yet that this is the right solution, at the very least we would need to do better performance testing before I'd be willing to add it. I'd also like to discuss this with @reimda before we commit to any new options.
time.Sleep(time.Millisecond) | ||
for len(listener.in) > 0 { | ||
fmt.Printf("Left in buffer: %v \n", len(listener.in)) | ||
time.Sleep(time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the current state of the tests for this plugin which encourages this, but let's not add any new tests/benchmarks that require sleep.
Here is a bit of testing I did. Until timestamp 07:10 it is running with this branch and 3 goroutines, after this timestamp it is running with the current code (master with #6921). The extra goroutines are reducing the dropped packets, but for some reason there is a slight increase in udp errors at the system level and a moderate decrease in the total number of packets captured. The number of processed packets is about the same. Could we be slowing the read goroutine by having more readers of the channel? If so, why is the system udp_inerrors only slightly decreased? |
I'm not sure if I'm seeing the decrease in packets captured? From your image, it looks like the green line is the total number of packets received, which remains relatively consistent across both sides of the 7:10 timestamp. It's just the rate that drops, but that makes sense, as the calculation for I agree that the udp_inerrors seem to be coming in at a higher rate with the goroutine changes based on your graph. But I'm unsure why. I'll do some looking around as well, but do you have any insight into what types of things cause these errors? (It looks like they come in regularly anyways. Is the error profile the same or different with the different uses?) I'll also try to get something spun up to check this out myself. |
@danielnelson I'd love to revisit this and see if we can get a resolution. (Sorry for the long delay, I've been on and off vacation for the last month or so). Did you get a chance to discuss with @reimda? We are running a forked version with these changes in a pre-production environment and having good results. (Up to ~120k-130k TPM for a single telegraf instance). |
I don't want to add any new options to the plugin interface but if you will update with a fixed number of reader goroutines, probably 3 or so, then we can merge. |
Fair enough. I updated the PR with 5 (I tend to think the slightly higher number gains us a bit more benefit. We tested up to 1000 without hitting CPU issues), but I can bump it down to 3 if you feel strongly. |
Required for all PRs: