Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a threaded parsing option for statsD input plugin #6922

Merged

Conversation

josephpeacock
Copy link
Contributor

@josephpeacock josephpeacock commented Jan 18, 2020

Required for all PRs:

  • [ X ] Signed CLA.
  • [ X] Associated README.md updated.
  • [X ] Has appropriate unit tests.

Comment on lines 132 to 133

MaxParserThreads int `toml:"max_parser_threads"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not expose this option, since it is so tied to an implementation detail. Let's just pick a good default number and always start this many. How about 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I agree with that. It's definitely tied to an implementation detail, but I think there is value in allowing the level of concurrency to be configurable. For instance, if I choose to run telegraf on the same machine as my app, I may want to limit it's footprint. But if I set up a dedicated instance to act as a metrics aggregator, being able to spend more CPU to increase throughput (and reduce the number of servers I need to use) could be very useful.

Looking at the existing telegraf.conf, it seems as if there are a few places where this is configurable (but maybe we should rename it?)

https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L6124
https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L373
https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L4961

Maybe there's some way we could expose a level of concurrency as an option without tying it directly to the number of goroutines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident enough yet that this is the right solution, at the very least we would need to do better performance testing before I'd be willing to add it. I'd also like to discuss this with @reimda before we commit to any new options.

time.Sleep(time.Millisecond)
for len(listener.in) > 0 {
fmt.Printf("Left in buffer: %v \n", len(listener.in))
time.Sleep(time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the current state of the tests for this plugin which encourages this, but let's not add any new tests/benchmarks that require sleep.

@danielnelson danielnelson added this to the 1.14.0 milestone Jan 22, 2020
@danielnelson danielnelson added area/statsd feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin labels Jan 22, 2020
@danielnelson
Copy link
Contributor

Here is a bit of testing I did. Until timestamp 07:10 it is running with this branch and 3 goroutines, after this timestamp it is running with the current code (master with #6921).

The extra goroutines are reducing the dropped packets, but for some reason there is a slight increase in udp errors at the system level and a moderate decrease in the total number of packets captured. The number of processed packets is about the same. Could we be slowing the read goroutine by having more readers of the channel? If so, why is the system udp_inerrors only slightly decreased?

2020-01-21-232501_2560x1379_scrot

@josephpeacock
Copy link
Contributor Author

I'm not sure if I'm seeing the decrease in packets captured? From your image, it looks like the green line is the total number of packets received, which remains relatively consistent across both sides of the 7:10 timestamp. It's just the rate that drops, but that makes sense, as the calculation for rate is packets_received - packets_dropped, and packet drops rose dramatically after the switch.

I agree that the udp_inerrors seem to be coming in at a higher rate with the goroutine changes based on your graph. But I'm unsure why. I'll do some looking around as well, but do you have any insight into what types of things cause these errors? (It looks like they come in regularly anyways. Is the error profile the same or different with the different uses?)

I'll also try to get something spun up to check this out myself.

@sjwang90 sjwang90 removed this from the 1.14.0 milestone Feb 21, 2020
@josephpeacock
Copy link
Contributor Author

@danielnelson I'd love to revisit this and see if we can get a resolution. (Sorry for the long delay, I've been on and off vacation for the last month or so). Did you get a chance to discuss with @reimda?

We are running a forked version with these changes in a pre-production environment and having good results. (Up to ~120k-130k TPM for a single telegraf instance).

@danielnelson
Copy link
Contributor

I don't want to add any new options to the plugin interface but if you will update with a fixed number of reader goroutines, probably 3 or so, then we can merge.

@josephpeacock
Copy link
Contributor Author

Fair enough. I updated the PR with 5 (I tend to think the slightly higher number gains us a bit more benefit. We tested up to 1000 without hitting CPU issues), but I can bump it down to 3 if you feel strongly.

@danielnelson danielnelson added this to the 1.14.0 milestone Mar 3, 2020
@danielnelson danielnelson merged commit ab8438d into influxdata:master Mar 3, 2020
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
veorlo pushed a commit to GlobalNOC/telegraf that referenced this pull request May 4, 2020
idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020
arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/statsd feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants