Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statsd processing slow after 1.5.3 #7070

Closed
stevenh opened this issue Feb 24, 2020 · 27 comments
Closed

statsd processing slow after 1.5.3 #7070

stevenh opened this issue Feb 24, 2020 · 27 comments
Labels
area/statsd performance problems with decreased performance or enhancements that improve performance

Comments

@stevenh
Copy link

stevenh commented Feb 24, 2020

Relevant telegraf.conf:

[agent]
  interval = "1s"
  round_interval = true
  metric_batch_size = 10000
  metric_buffer_limit = 100000000
  flush_buffer_when_full = true
  collection_jitter = "0s"
  flush_interval = "20s"
  flush_jitter = "0s"
  debug = false
  quiet = false

[[outputs.influxdb]]
  database = "XXX"
  password = "XXX"
  urls = ["http://influx"]
  username = "XXX"

[[inputs.statsd]]
  allowed_pending_messages = 100000
  delete_counters = true
  delete_gauges = true
  delete_sets = true
  delete_timings = true
  metric_separator = "."
  percentile_limit = 1000
  percentiles = [90, 95, 99]
  service_address = ":8125"

System info:

FreeBSD 11
Telegraf later 1.5.3 have tried 1.6.0 1.8.0 1.10.0 and 1.13.7 (latest)

Steps to reproduce:

  1. Send significant amounts of statsd to telegraf on anything later than 1.5.3
  2. tail the log file or look at the stats in influx

Expected behavior:

Stats should be processed

Actual behavior:

Most stats are dropped due to slow processing and the following is in the logs

2020-02-23T23:40:18Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-23T23:40:20Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-23T23:40:21Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)

Additional info:

After bisecting the commits all the way back to 1.5.3 the following commit is the cause:
1c0f63a

@stevenh
Copy link
Author

stevenh commented Feb 24, 2020

profiles.zip
Attached is a profile of before and after this patch for investigation

@stevenh
Copy link
Author

stevenh commented Feb 24, 2020

Enabling debugging shows a grim picture:
1.5.3

2020-02-24T15:02:22Z D! Output [influxdb] wrote batch of 202 metrics in 4.797833ms
2020-02-24T15:02:22Z D! Output [influxdb] wrote batch of 7647 metrics in 181.80504ms
2020-02-24T15:02:23Z D! Output [influxdb] wrote batch of 10000 metrics in 96.02726ms
2020-02-24T15:02:23Z D! Output [influxdb] wrote batch of 10000 metrics in 75.335159ms
2020-02-24T15:02:24Z D! Output [influxdb] wrote batch of 10000 metrics in 112.695297ms
2020-02-24T15:02:24Z D! Output [influxdb] wrote batch of 10000 metrics in 102.660173ms
2020-02-24T15:02:24Z D! Output [influxdb] wrote batch of 10000 metrics in 109.212659ms
2020-02-24T15:02:25Z D! Output [influxdb] wrote batch of 10000 metrics in 177.093983ms
2020-02-24T15:02:25Z D! Output [influxdb] wrote batch of 10000 metrics in 147.962495ms
2020-02-24T15:02:26Z D! Output [influxdb] wrote batch of 10000 metrics in 236.605836ms
2020-02-24T15:02:26Z D! Output [influxdb] wrote batch of 10000 metrics in 78.510571ms
2020-02-24T15:02:26Z D! Output [influxdb] wrote batch of 10000 metrics in 92.300102ms
2020-02-24T15:02:27Z D! Output [influxdb] wrote batch of 10000 metrics in 110.478375ms
2020-02-24T15:02:27Z D! Output [influxdb] wrote batch of 10000 metrics in 90.973399ms
2020-02-24T15:02:28Z D! Output [influxdb] wrote batch of 10000 metrics in 97.3282ms
2020-02-24T15:02:28Z D! Output [influxdb] wrote batch of 10000 metrics in 71.42018ms
2020-02-24T15:02:28Z D! Output [influxdb] wrote batch of 10000 metrics in 86.652889ms
2020-02-24T15:02:29Z D! Output [influxdb] wrote batch of 10000 metrics in 109.816836ms
2020-02-24T15:02:29Z D! Output [influxdb] wrote batch of 10000 metrics in 83.1345ms
2020-02-24T15:02:30Z D! Output [influxdb] wrote batch of 10000 metrics in 112.3121ms
2020-02-24T15:02:30Z D! Output [influxdb] wrote batch of 10000 metrics in 67.862412ms
2020-02-24T15:02:30Z D! Output [influxdb] wrote batch of 10000 metrics in 99.878214ms
2020-02-24T15:02:31Z D! Output [influxdb] wrote batch of 10000 metrics in 74.89151ms
2020-02-24T15:02:31Z D! Output [influxdb] wrote batch of 10000 metrics in 85.277209ms
2020-02-24T15:02:32Z D! Output [influxdb] wrote batch of 10000 metrics in 139.18762ms
2020-02-24T15:02:32Z D! Output [influxdb] wrote batch of 10000 metrics in 92.667221ms
2020-02-24T15:02:33Z D! Output [influxdb] wrote batch of 10000 metrics in 120.435877ms
2020-02-24T15:02:33Z D! Output [influxdb] wrote batch of 10000 metrics in 168.169058ms
2020-02-24T15:02:33Z D! Output [influxdb] wrote batch of 10000 metrics in 78.91272ms
2020-02-24T15:02:34Z D! Output [influxdb] wrote batch of 10000 metrics in 195.151852ms
2020-02-24T15:02:34Z D! Output [influxdb] wrote batch of 10000 metrics in 172.967804ms
2020-02-24T15:02:35Z D! Output [influxdb] wrote batch of 10000 metrics in 184.029354ms
2020-02-24T15:02:35Z D! Output [influxdb] wrote batch of 10000 metrics in 117.487935ms
2020-02-24T15:02:35Z D! Output [influxdb] wrote batch of 10000 metrics in 65.058525ms
2020-02-24T15:02:36Z D! Output [influxdb] wrote batch of 10000 metrics in 113.640539ms
2020-02-24T15:02:36Z D! Output [influxdb] wrote batch of 10000 metrics in 62.830476ms
2020-02-24T15:02:37Z D! Output [influxdb] wrote batch of 10000 metrics in 97.683096ms
2020-02-24T15:02:37Z D! Output [influxdb] wrote batch of 10000 metrics in 62.243451ms
2020-02-24T15:02:37Z D! Output [influxdb] wrote batch of 10000 metrics in 62.959833ms
2020-02-24T15:02:38Z D! Output [influxdb] wrote batch of 10000 metrics in 107.999259ms
2020-02-24T15:02:38Z D! Output [influxdb] wrote batch of 10000 metrics in 59.542778ms
2020-02-24T15:02:39Z D! Output [influxdb] wrote batch of 10000 metrics in 92.511075ms
2020-02-24T15:02:39Z D! Output [influxdb] wrote batch of 10000 metrics in 57.301765ms
2020-02-24T15:02:39Z D! Output [influxdb] wrote batch of 10000 metrics in 68.269444ms
2020-02-24T15:02:40Z D! Output [influxdb] wrote batch of 10000 metrics in 107.996287ms
2020-02-24T15:02:40Z D! Output [influxdb] wrote batch of 10000 metrics in 59.436764ms
2020-02-24T15:02:41Z D! Output [influxdb] wrote batch of 10000 metrics in 98.164357ms
2020-02-24T15:02:41Z D! Output [influxdb] wrote batch of 10000 metrics in 98.600295ms
2020-02-24T15:02:41Z D! Output [influxdb] wrote batch of 10000 metrics in 72.706926ms
2020-02-24T15:02:42Z D! Output [influxdb] wrote batch of 10000 metrics in 94.796907ms
2020-02-24T15:02:42Z D! Output [influxdb] wrote batch of 10000 metrics in 63.241432ms
2020-02-24T15:02:42Z D! Output [influxdb] buffer fullness: 7279 / 100000000 metrics. 

1.6.0-1c0f63a

2020-02-24T15:05:06Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:06Z D! Output [influxdb] wrote batch of 10000 metrics in 1.692500116s
2020-02-24T15:05:07Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:08Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:08Z D! Output [influxdb] wrote batch of 10000 metrics in 1.674150006s
2020-02-24T15:05:10Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:11Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:11Z D! Output [influxdb] wrote batch of 10000 metrics in 2.116327348s
2020-02-24T15:05:12Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:12Z D! Output [influxdb] wrote batch of 10000 metrics in 1.699090005s
2020-02-24T15:05:14Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:15Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:15Z D! Output [influxdb] wrote batch of 10000 metrics in 2.110488274s
2020-02-24T15:05:17Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:17Z D! Output [influxdb] wrote batch of 10000 metrics in 1.810292032s
2020-02-24T15:05:18Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:19Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:19Z D! Output [influxdb] wrote batch of 10000 metrics in 1.69589895s
2020-02-24T15:05:20Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:21Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:21Z D! Output [influxdb] wrote batch of 10000 metrics in 1.922090229s
2020-02-24T15:05:23Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:23Z D! Output [influxdb] wrote batch of 10000 metrics in 1.865353774s
2020-02-24T15:05:24Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:25Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1s)
2020-02-24T15:05:25Z D! Output [influxdb] wrote batch of 10000 metrics in 1.742904221s
2020-02-24T15:05:25Z D! Output [influxdb] buffer fullness: 108 / 100000000 metrics.

So its taking ~20x as long per 10k metrics after this commit.

@stevenh
Copy link
Author

stevenh commented Feb 24, 2020

Ok so I believe the above has helped identify the cause, which I now believe is the serializer.

Before 1c0f63a

go test -run=X -bench=BenchmarkReader
goos: linux
goarch: amd64
pkg: github.com/influxdata/telegraf/metric
BenchmarkReader/minimal-8               28332624                42.2 ns/op
BenchmarkReader/multiple_tags-8         26943465                43.9 ns/op
BenchmarkReader/multiple_fields-8       28021015                42.6 ns/op
BenchmarkReader/float_NaN-8             27909313                42.9 ns/op
BenchmarkReader/float_Inf-8             27574416                42.6 ns/op
BenchmarkReader/integer_field-8         27911779                42.5 ns/op
BenchmarkReader/bool_field-8            27952675                42.5 ns/op
BenchmarkReader/string_field-8          27120421                42.7 ns/op
BenchmarkReader/timestamp-8             27877866                42.6 ns/op
BenchmarkReader/split_fields_exact-8            27464982                43.5 ns/op
BenchmarkReader/split_fields_extra-8            27352546                43.4 ns/op
BenchmarkReader/need_more_space-8               27361215                43.9 ns/op
BenchmarkReader/procstat-8                      15090884                78.8 ns/op
PASS
ok      github.com/influxdata/telegraf/metric   34.209s

After 1c0f63a

go test -run=X -bench=BenchmarkReader
goos: linux
goarch: amd64
pkg: github.com/influxdata/telegraf/plugins/serializers/influx
BenchmarkReader/minimal-8                4087734               294 ns/op
BenchmarkReader/multiple_tags-8          2572887               463 ns/op
BenchmarkReader/multiple_fields-8        2619646               457 ns/op
BenchmarkReader/float_NaN-8              5308399               225 ns/op
BenchmarkReader/float_Inf-8              4671318               245 ns/op
BenchmarkReader/integer_field-8          4953704               223 ns/op
BenchmarkReader/bool_field-8             5861647               212 ns/op
BenchmarkReader/string_field-8           4722679               237 ns/op
BenchmarkReader/timestamp-8              3255915               370 ns/op
BenchmarkReader/split_fields_exact-8             2854992               423 ns/op
BenchmarkReader/split_fields_extra-8             2973882               408 ns/op
BenchmarkReader/procstat-8                        121318              9745 ns/op
PASS
ok      github.com/influxdata/telegraf/plugins/serializers/influx       20.818s

@stevenh
Copy link
Author

stevenh commented Feb 24, 2020

For reference here are the benchmarks.

// Old serializer
func BenchmarkReader(b *testing.B) {
        for _, tt := range tests {
                b.Run(tt.name, func(b *testing.B) {
                        d := make([]telegraf.Metric, b.N)
                        for i := range d {
                                d[i] = tt.input
                        }
                        r := NewReader(d)
                        var (
                                err error
                                n   int64
                        )

                        b.ResetTimer()

                        if n, err = io.Copy(ioutil.Discard, r); err != nil {
                                b.Error(err)
                        }
                        last = n
                })
        }
}

// New Serializer
func BenchmarkReader(b *testing.B) {
        for _, tt := range tests {
                b.Run(tt.name, func(b *testing.B) {
                        if tt.err != nil {
                                b.Skip("would error")
                        }
                        d := make([]telegraf.Metric, b.N)
                        for i := range d {
                                d[i] = tt.input
                        }
                        serializer := NewSerializer()
                        serializer.SetMaxLineBytes(tt.maxBytes)

                        var (
                                err error
                                n   int64
                        )
                        r := NewReader(d, serializer)
                        b.ResetTimer()

                        if n, err = io.Copy(ioutil.Discard, r); err != nil {
                                b.Error(err)
                        }
                        last = n
                })
        }
}

@stevenh
Copy link
Author

stevenh commented Feb 24, 2020

After digging some more is looks like before 1c0f63a 95% of serialization was done in metric.New which happens when receiving the metric and is likely highly parallelised however after this commit, serialise is done during the write to the output which is serial and in a time critical path.

Thoughts?

@danielnelson
Copy link
Contributor

There have been many changes since 1.5 and different areas of Telegraf have changed performance characteristics. I don't think the serializer is the problem, it just appears that way because at first because work is done at a different stage.

This benchmark is a rather expensive metric to serialize since it has many fields:

BenchmarkReader/procstat-8 121318 9745 ns/op

With those timings, you could still serialize this metric 1000000000 / 9745 = 102616 per second, and for most workloads I expect quite a bit simpler metrics. However, on your system it is taking 1.92 seconds to write 10000 to InfluxDB.

2020-02-24T15:05:21Z D! Output [influxdb] wrote batch of 10000 metrics in 1.922090229s

We should check how long is it taking for InfluxDB to process the request, in the InfluxDB logs the last value is the time in microseconds:

[httpd] 127.0.0.1 - - [24/Feb/2020:11:26:16 -0800] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "Telegraf/1.13.4" 8688a72a-573b-11ea-8015-c85b76c9e68c 3418

Finally, what error are you experiencing with Telegraf 1.13.3? I suspect you would no longer get the "took longer to collect than collection interval" warning.

@danielnelson danielnelson added area/statsd need more info performance problems with decreased performance or enhancements that improve performance labels Feb 24, 2020
@danielnelson
Copy link
Contributor

Would you be able to also attach an example of the metrics you are emitting? You could use the file output with data_format = "influx" and write to a file.

@stevenh
Copy link
Author

stevenh commented Feb 24, 2020

That procstat is just one, as you can see from all the sub benchmarks they are all 10x slower.

Not sure checking influx response times is worth it because the end requests should be the same hence the response time the same, as both are being processed by the same cluster.

All versions after 1.5.3 drop so many metrics due to the slow processing in telegraf its hard to test for long in our prod setup, which is where the real load is.

Below is a graph which shows the impact on simple DB connection count (15:05:00 -> 45) most metrics are dropped with only a few getting through.
image

Here's the result of an end to end of a metric, testing metric.New and Reader.
Pre 1c0f63a

go test -run=X -bench=BenchmarkNewAndReader -benchmem
goos: linux
goarch: amd64
pkg: github.com/influxdata/telegraf/metric
BenchmarkNewAndReader-8            45546             25778 ns/op            1792 B/op          7 allocs/op
PASS
ok      github.com/influxdata/telegraf/metric   1.473s

Post 1c0f63a

go test -run=X -bench=BenchmarkNewAndReader
-benchmem
goos: linux
goarch: amd64
pkg: github.com/influxdata/telegraf/plugins/serializers/influx
BenchmarkNewAndReader-8            33415             34596 ns/op            2928 B/op        124 allocs/op
PASS
ok      github.com/influxdata/telegraf/plugins/serializers/influx       1.549s

So the end to end picture is better but still nearly twice as slow and ~17x more allocs and double the memory.

To be clear all of these benchmarks have been done with commit 1c0f63a as the only change.

@stevenh
Copy link
Author

stevenh commented Feb 24, 2020

Would you be able to also attach an example of the metrics you are emitting? You could use the file output with data_format = "influx" and write to a file.

Is there anything in particular your looking for?

I ask as statsd doesn't have many options so I can describe the sort of metrics that would be in there, however I'm pretty sure the main driver is volume, from debug logs ~20-30k/s, not metric specifics.

@danielnelson
Copy link
Contributor

Is there anything in particular your looking for?

Primarily looking for the average number of tags/fields per metric to help gauge how much time is being spent on serialization vs sending the request.

If you add this to your configuration and run it for several minutes it will collect all output as line protocol.

[[outputs.file]]
  files = ["/tmp/metrics"]
  data_format = "influx"

@danielnelson
Copy link
Contributor

If you prefer, you could pass these files through support, or PGP encrypt to my public key.

@stevenh
Copy link
Author

stevenh commented Feb 25, 2020

If you prefer, you could pass these files through support, or PGP encrypt to my public key.

Thanks uploaded to the support site.

@influxdata influxdata deleted a comment from DarrenMendalka Feb 25, 2020
@danielnelson
Copy link
Contributor

I wrote a quick benchmark serializing the metrics:

func BenchmarkFromFile(b *testing.B) {
	file, err := os.Open("metrics")
	require.NoError(b, err)

	buf, err := ioutil.ReadAll(file)
	require.NoError(b, err)

	parser, err := parsers.NewInfluxParser()
	require.NoError(b, err)

	metrics, err := parser.Parse(buf)
	require.NoError(b, err)

	if len(metrics) > 10000 {
		metrics = metrics[:10000]
	}

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		readbuf := make([]byte, 4096, 4096)
		serializer := NewSerializer()
		reader := NewReader(metrics, serializer)
		for {
			_, err := reader.Read(readbuf)
			if err == io.EOF {
				break
			}

			if err != nil {
				panic(err.Error())
			}
		}
	}
}
$ go test -short ./plugins/serializers/influx/ -bench=BenchmarkFromFile -benchmem
BenchmarkFromFile-4          192           6375362 ns/op            6000 B/op         13 allocs/op

Based on this, it doesn't seem that the write time increase is substantially increased due to the relocated serialization.

Using the tail input to read the metrics file and an InfluxDB instance located on the same system, I ran the following test with both Telegraf 1.6 and master. Both versions had similar timings:

[[inputs.tail]]
  files = ["metrics"]
  from_beginning = true
  data_format = "influx"

[[outputs.influxdb]]
  urls = ["http://localhost:8086"]
$ telegraf --debug
...
2020-02-26T19:17:42Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 156.104294ms
2020-02-26T19:17:42Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 124.879433ms
2020-02-26T19:17:43Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 130.78567ms
2020-02-26T19:17:43Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 148.891921ms
2020-02-26T19:17:43Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 123.403691ms
2020-02-26T19:17:43Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 162.445557ms

I also inspected the profile images, which both seem to point towards statsd for cpu usage.

@stevenh Can you show log output with Telegraf 1.13.x and send HTTP access logs from the InfluxDB server (normally included in the main log)?

Also, I notice now that you are running on FreeBSD, are you able to compare against a Linux system?

stevenh added a commit to stevenh/telegraf that referenced this issue Feb 28, 2020
Improve the gather performance to the statsd input processor by
processing metric types in parallel.

Use Agent.MetricBatchSize to configure the processing channel
buffer, which allows high throughput inputs to process entire batches.

Together these reduces the processing time of 50k metrics from over
1 second to ~100ms when not limited by the current influx metric
serializer.

Helps: influxdata#7070
stevenh added a commit to stevenh/telegraf that referenced this issue Feb 28, 2020
Improve the gather performance to the statsd input processor by
processing metric types in parallel.

Use Agent.MetricBatchSize to configure the processing channel
buffer, which allows high throughput inputs to process entire batches.

Together these reduces the processing time of 50k metrics from over 1
second to ~100ms when not limited by the current influx metric
serializer with a good selection of metric types.

Helps: influxdata#7070
stevenh added a commit to stevenh/telegraf that referenced this issue Feb 28, 2020
Improve the gather performance to the statsd input processor by
processing metric types in parallel.

Use Agent.MetricBatchSize to configure the processing channel
buffer, which allows high throughput inputs to process entire batches.

Together these reduces the processing time of 50k metrics from over 1
second to ~100ms when not limited by the current influx metric
serializer with a good selection of metric types.

Also:
* Add debug information about gather timings.

Helps: influxdata#7070
stevenh added a commit to stevenh/telegraf that referenced this issue Feb 28, 2020
Improve the gather performance to the statsd input processor by processing metric types in parallel.

Use Agent.MetricBatchSize to configure the processing channel buffer, which allows high throughput inputs to process entire batches.

Together these reduces the processing time of 50k metrics from over 1 second to ~100ms when not limited by the current influx metric
serializer with a good selection of metric types.

Also:
* Add debug information about gather timings.

Helps: influxdata#7070
@stevenh
Copy link
Author

stevenh commented Feb 28, 2020

To clarify you need to compare 1.5.3 with later versions not 1.6.0, the issue was present in the first 1.6 release.

**503881d Log + #7094 **

2020-02-28T15:56:19Z D! Attempting connection to output: influxdb
2020-02-28T15:56:20Z D! Successfully connected to output: influxdb
2020-02-28T15:56:20Z I! Starting Telegraf v1.6.0~unknown
2020-02-28T15:56:20Z I! Loaded outputs: influxdb
2020-02-28T15:56:20Z I! Loaded inputs: inputs.statsd
2020-02-28T15:56:20Z I! Tags enabled: host=telegraf-1
2020-02-28T15:56:20Z I! Agent Config: Interval:1s, Quiet:false, Hostname:"telegraf-1", Flush Interval:20s 
2020-02-28T15:56:20Z I! Started the statsd service on :8125
2020-02-28T15:56:20Z I! Statsd UDP listener listening on:  0.0.0.0:8125
2020-02-28T15:56:21Z D! [inputs.statsd] collected 0 sets in 6.021µs
2020-02-28T15:56:21Z D! [inputs.statsd] collected 34 timings in 679.911µs
2020-02-28T15:56:21Z D! [inputs.statsd] collected 399 counters in 4.036409ms
2020-02-28T15:56:21Z D! [inputs.statsd] collected 1289 gauges in 6.609457ms
2020-02-28T15:56:21Z D! [inputs.statsd] collected 1722 metrics in 6.658319ms
2020-02-28T15:56:22Z D! [inputs.statsd] collected 0 sets in 11.023µs
2020-02-28T15:56:22Z D! [inputs.statsd] collected 33 timings in 788.461µs
2020-02-28T15:56:22Z D! [inputs.statsd] collected 3955 counters in 39.567788ms
2020-02-28T15:56:22Z D! [inputs.statsd] collected 21435 gauges in 118.429832ms
2020-02-28T15:56:22Z D! [inputs.statsd] collected 25423 metrics in 118.536806ms
2020-02-28T15:56:22Z D! Output [influxdb] wrote batch of 10000 metrics in 401.897552ms
2020-02-28T15:56:22Z D! Output [influxdb] wrote batch of 10000 metrics in 270.234304ms
2020-02-28T15:56:23Z D! [inputs.statsd] collected 0 sets in 5.515µs
2020-02-28T15:56:23Z D! [inputs.statsd] collected 35 timings in 774.443µs
2020-02-28T15:56:23Z D! [inputs.statsd] collected 3950 counters in 39.943855ms
2020-02-28T15:56:23Z D! Output [influxdb] wrote batch of 10000 metrics in 246.049251ms
2020-02-28T15:56:23Z D! [inputs.statsd] collected 21454 gauges in 271.037945ms
2020-02-28T15:56:23Z D! [inputs.statsd] collected 25439 metrics in 271.129471ms
2020-02-28T15:56:23Z D! Output [influxdb] wrote batch of 10000 metrics in 182.628177ms
2020-02-28T15:56:23Z D! Output [influxdb] wrote batch of 10000 metrics in 79.284274ms
2020-02-28T15:56:24Z D! [inputs.statsd] collected 0 sets in 38.048µs
2020-02-28T15:56:24Z D! [inputs.statsd] collected 39 timings in 856.645µs
2020-02-28T15:56:24Z D! [inputs.statsd] collected 3960 counters in 40.764967ms
2020-02-28T15:56:24Z D! Output [influxdb] wrote batch of 10000 metrics in 81.769943ms
2020-02-28T15:56:24Z D! [inputs.statsd] collected 21447 gauges in 118.947606ms
2020-02-28T15:56:24Z D! [inputs.statsd] collected 25446 metrics in 119.039844ms
2020-02-28T15:56:24Z D! Output [influxdb] wrote batch of 10000 metrics in 73.27261ms
2020-02-28T15:56:25Z D! [inputs.statsd] collected 0 sets in 6.093µs
2020-02-28T15:56:25Z D! [inputs.statsd] collected 43 timings in 828.555µs
2020-02-28T15:56:25Z D! [inputs.statsd] collected 3957 counters in 37.953406ms
2020-02-28T15:56:25Z D! Output [influxdb] wrote batch of 10000 metrics in 118.714727ms
2020-02-28T15:56:25Z D! [inputs.statsd] collected 21440 gauges in 144.443008ms
2020-02-28T15:56:25Z D! [inputs.statsd] collected 25440 metrics in 144.538283ms
2020-02-28T15:56:25Z D! Output [influxdb] wrote batch of 10000 metrics in 123.317273ms
2020-02-28T15:56:25Z D! Output [influxdb] wrote batch of 10000 metrics in 99.313251ms

1.13.3 log

2020-02-28T15:47:20Z I! Starting Telegraf 1.13.3
2020-02-28T15:47:21Z I! Loaded inputs: statsd
2020-02-28T15:47:21Z I! Loaded aggregators: 
2020-02-28T15:47:21Z I! Loaded processors: 
2020-02-28T15:47:21Z I! Loaded outputs: influxdb
2020-02-28T15:47:21Z I! Tags enabled: host=telegraf-1
2020-02-28T15:47:21Z I! [agent] Config: Interval:1s, Quiet:false, Hostname:"telegraf-1", Flush Interval:20s
2020-02-28T15:47:21Z D! [agent] Initializing plugins
2020-02-28T15:47:21Z D! [agent] Connecting outputs
2020-02-28T15:47:21Z D! [agent] Attempting connection to [outputs.influxdb]
2020-02-28T15:47:21Z W! [outputs.influxdb] When writing to [http://influxdb.prod.multiplay.farm:8086]: database "telegraf" creation failed: 403 Forbidden
2020-02-28T15:47:21Z D! [agent] Successfully connected to outputs.influxdb
2020-02-28T15:47:21Z D! [agent] Starting service inputs
2020-02-28T15:47:21Z I! [inputs.statsd] UDP listening on "0.0.0.0:8125"
2020-02-28T15:47:21Z I! [inputs.statsd] Started the statsd service on ":8125"
2020-02-28T15:47:42Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 2.157850767s
2020-02-28T15:47:42Z D! [outputs.influxdb] Buffer fullness: 498676 / 100000000 metrics
2020-02-28T15:47:45Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.566264595s
2020-02-28T15:47:45Z D! [outputs.influxdb] Buffer fullness: 565007 / 100000000 metrics
2020-02-28T15:47:49Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.513972923s
2020-02-28T15:47:49Z D! [outputs.influxdb] Buffer fullness: 656720 / 100000000 metrics
2020-02-28T15:47:52Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.526123867s
2020-02-28T15:47:52Z D! [outputs.influxdb] Buffer fullness: 722964 / 100000000 metrics
2020-02-28T15:47:56Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.70984312s
2020-02-28T15:47:56Z D! [outputs.influxdb] Buffer fullness: 814680 / 100000000 metrics
2020-02-28T15:48:00Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.715881594s
2020-02-28T15:48:00Z D! [outputs.influxdb] Buffer fullness: 906380 / 100000000 metrics
2020-02-28T15:48:03Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.092204855s
2020-02-28T15:48:06Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.30458417s
2020-02-28T15:48:10Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.420787917s
2020-02-28T15:48:13Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.259913787s
2020-02-28T15:48:16Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.308266389s
2020-02-28T15:48:19Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.330443987s
2020-02-28T15:48:20Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
2020-02-28T15:48:20Z D! [outputs.influxdb] Buffer fullness: 1355090 / 100000000 metrics
2020-02-28T15:48:23Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.271907154s
2020-02-28T15:48:26Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.503620906s
2020-02-28T15:48:29Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.304354979s
2020-02-28T15:48:33Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.659279575s
2020-02-28T15:48:37Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.582184448s
2020-02-28T15:48:40Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
2020-02-28T15:48:40Z D! [outputs.influxdb] Buffer fullness: 1813577 / 100000000 metrics
2020-02-28T15:48:40Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.327407759s
2020-02-28T15:48:43Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.296182971s
2020-02-28T15:48:47Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.329990985s
2020-02-28T15:48:50Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.508410407s
2020-02-28T15:48:53Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.23448115s
2020-02-28T15:48:57Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.496496328s
2020-02-28T15:49:00Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
2020-02-28T15:49:00Z D! [outputs.influxdb] Buffer fullness: 2262013 / 100000000 metrics
2020-02-28T15:49:00Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.161951487s
2020-02-28T15:49:04Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.455994566s
2020-02-28T15:49:07Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.208061889s
2020-02-28T15:49:10Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.465170208s
2020-02-28T15:49:14Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.290738626s
2020-02-28T15:49:17Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.465206411s
2020-02-28T15:49:20Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
2020-02-28T15:49:20Z D! [outputs.influxdb] Buffer fullness: 2709808 / 100000000 metrics
2020-02-28T15:49:20Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.314862333s
2020-02-28T15:49:24Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.250056425s
2020-02-28T15:49:27Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.600627946s
2020-02-28T15:49:31Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.458777886s
2020-02-28T15:49:34Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.361517827s

The other issue is I noticed with 1.13.3 is it doesn't restart in a sensible amount of time, see below:
1.13.3 restart log

2020-02-28T15:50:29Z D! [agent] Stopping service inputs
2020-02-28T15:50:29Z I! [inputs.statsd] Stopping the statsd service
2020-02-28T15:50:29Z I! [inputs.statsd] Stopped listener service on ":8125"
2020-02-28T15:50:29Z D! [agent] Input channel closed
2020-02-28T15:50:29Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-02-28T15:50:32Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.382553836s
2020-02-28T15:50:35Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.482946754s
2020-02-28T15:50:38Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 2.689079828s
2020-02-28T15:50:40Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
2020-02-28T15:50:40Z D! [outputs.influxdb] Buffer fullness: 4224388 / 100000000 metrics
2020-02-28T15:50:41Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 2.688427127s
2020-02-28T15:50:44Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 2.998598311s
2020-02-28T15:50:47Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.097103144s
2020-02-28T15:50:49Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 2.845904968s
2020-02-28T15:50:53Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.334889984s
2020-02-28T15:50:56Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.038865127s
2020-02-28T15:50:59Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.044566133s
...
2020-02-28T15:52:49Z D! [outputs.influxdb] Buffer fullness: 3784388 / 100000000 metrics
2020-02-28T15:52:52Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 2.840469831s
2020-02-28T15:52:55Z D! [outputs.influxdb] Wrote batch of 10000 metrics in 3.305453646s
2020-02-28T15:52:56Z E! [outputs.influxdb] When writing to [http://influxdb.prod.multiplay.farm:8086]: Post http://influxdb.prod.multiplay.farm:8086/write?db=telegraf: write tcp 10.10.96.75:30328->10.10.96.71:8086: use of closed network connection
2020-02-28T15:52:56Z D! [outputs.influxdb] Buffer fullness: 3764388 / 100000000 metrics
2020-02-28T15:52:56Z E! [agent] Error writing to outputs.influxdb: could not write any address
2020-02-28T15:52:56Z D! [agent] Closing outputs
2020-02-28T15:52:56Z D! [agent] Stopped Successfully

Here's the timings I see when comparing 503881d (the last good version) and 1c0f63a (the commit which changed the serializer) both these include a zero allocation WriteTo implentation I've created, but the key here is reading the data which includes any required serialisation is significantly slower.

I need to compare and contrast this with head, which may have some changes.

503881d benchmark
BenchmarkNewAndReaderDirect/reader-only-24 31962585 37.7 ns/op 0 B/op 0 allocs/op
BenchmarkNewAndReaderDirect/reader-only-24 31880418 37.7 ns/op 0 B/op 0 allocs/op

1c0f63a benchmark
BenchmarkNewAndReaderDirect/reader-only-24 231531 5174 ns/op 0 B/op 0 allocs/op
BenchmarkNewAndReaderDirect/reader-only-24 231993 5184 ns/op 0 B/op 0 allocs/op

We can't compare with Linux but as FreeBSD typically has a faster network stack its unlikely to impact this.

Hope this helps.

@danielnelson
Copy link
Contributor

On shutdown these days Telegraf makes an attempt to flush all inflight data, I assume that this is the reason it takes so long to shutdown.

Wrote batch of 10000 metrics in 3.044566133s

This is the core of the issue, it is taking 3 seconds to write 10000 points. This means your top speed is going to be ~3000 points per second.

On my system, the time to serialize 10000 points is 6375362 ns/op. If this is true on your system as well then we need to figure out what is going on the rest of the 3 seconds since even if it took 0ns to serialize it would still be too slow. If it is actually taking 3s to serialize on your system, then we need to try to figure out why it is so much slower on your side and if there is a way to replicate.

Can you add the HTTP access logs from the InfluxDB server to the support site and perform one more test for me:

  • Temporarily remove the influxdb output and replace with the file output, as we did before to collect the metrics.
  • run telegraf with --debug and add the logs here

@stevenh
Copy link
Author

stevenh commented Feb 28, 2020

That's my point only the newer versions of telegraf are that slow, compare that with the old version:
2020-02-28T15:56:23Z D! Output [influxdb] wrote batch of 10000 metrics in 246.049251ms

This is the same system with the same config.

@stevenh
Copy link
Author

stevenh commented Feb 28, 2020

Oh and to clarify I believe the wrote batch of XXXX messages includes the serialisation time, I don't believe it's as the message itself indicates purely the time to write the metric once serialisation has been completed.

@danielnelson
Copy link
Contributor

I believe the wrote batch of XXXX messages includes the serialisation time

This is correct, it includes both times and they happen concurrently.

In my testing with Telegraf 1.13.3 it takes ~150ms, Telegraf 1.5 takes more or less the same amount of time. Serializing 10k metrics also takes 0.006s, even if it was fully removed it wouldn't affect the performance significantly.

There are three possibilities as I see it:

  1. For some reason your system is very slow at serializing, much slower than in my testing.
  2. InfluxDB is the cause of the slowdown
  3. Network speed is the cause of the slowdown

If you can get the measurements with the file output that should help us determine if 1 is the issue, and the logs from InfluxDB will be a starting point for 2. Checking 3 is a bit harder, let's put that off for now.

@stevenh
Copy link
Author

stevenh commented Mar 7, 2020

We're currently in the middle of migrating from on premise Cluster to Influx Cloud, which is unfortunately complicating testing.

In addition we had some corrupt stats hitting telegraf which have now been fixed.

When we have gathered further stats I will update post an update.

@austin-barrington
Copy link

Hey, just picking this up from @stevenh.

We have gathered a 1G~ file of metrics and run the tail and output to file on three versions across Linux and FreeBSD.

Note:
Linux CPU: Intel(R) Xeon(R) CPU E3-1285L v4 @ 3.40GHz
FreeBSD CPU: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz

(I can test this on the same hardware but will take another day to reinstall a box)

Config is the same on all of them (except the path of the files):

# Telegraf Configuration
#
[global_tags]

[agent]
  hostname = "telegraf-1"
  interval = "1s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 100000000
  flush_buffer_when_full = true
  collection_jitter = "0s"
  flush_interval = "20s"
  flush_jitter = "0s"
  debug = true
  quiet = false

[[outputs.file]]
  files = ["/tmp/telegraf/output.file"]
  data_format = "influx"

[[inputs.tail]]
  files = ["/tmp/telegraf/metrics"]
  data_format = "influx"
  from_beginning = true

The average time it took to process a batch:

Telegraf FreeBSD Linux
telegraf-1-13-4 13.9879118ms 2.61414089ms
telegraf-1-6-4 13.7730599ms 2.51980233ms
telegraf-1-5-3 10.8472227ms 1.68554957ms

Total time taken to process the file:

Telegraf FreeBSD Linux
telegraf-1-13-4 67.37s 12.59s
telegraf-1-6-4 66.34s 12.13s
telegraf-1-5-3 52.25s 8.11s

I've attached the log files to this comment so you can look through them.
Linux:
1-5-3-linux.log
1-6-4-linux.log
1-13-4-linux.log

FreeBSD:
1-5-3-freebsd.log
1-6-4-freebsd.log
1-13-4-freebsd.log

From my understanding I don't think 2 or 3 is the issue.

@danielnelson
Copy link
Contributor

Hi @austin-barrington, I'd like to regroup a bit, there are a few possible issues raised above and I want to make sure we identify one in particular to fix here. Only considering Telegraf 1.13.4, is there a particular error you are receiving?

@pbootly
Copy link

pbootly commented Mar 18, 2020

Hey,

What we're currently looking to do is work through the three suggested issue we're encountering

For some reason your system is very slow at serializing, much slower than in my testing.
InfluxDB is the cause of the slowdown
Network speed is the cause of the slowdown

The above is a way we were able to test serialization performance without impacting our production traffic and load. Below I've got the average and total times for the telegraf versions in question running on an exact hardware match to the BSD machine.

Telegraf Average Total Time
telegraf-1-13-4 7.04 33.95s
telegraf-1-5-3 4.96ms 23.91s
telegraf-1-6-4 6.72ms 32.37s

From this I imagine it's worth carrying on and starting with 1 - is serialization on the system the issue.

Once that's proved/disproven as the issue we'll move on to 2, then 3 and by then hopefully have a string to pull at.

@danielnelson
Copy link
Contributor

It's still not clear to me what exactly the issue you are experiencing is. The performance characteristics of Telegraf have and will change across releases, keeping them the same for every use case is not a project goal. It would be helpful to know what error you run into when you attempt to upgrade, this will help us identify where we can take a pass over the code for ways we can improve performance.

Unfortunately, comparisons against Telegraf 1.5 are just too old to be helpful.

The average time it took to process a batch:

Telegraf FreeBSD Linux
telegraf-1-13-4 13.9879118ms 2.61414089ms

With these rates and a batch size of 1000, you should be able to write 1000 / 0.0139 = 71942 metrics per second on the FreeBSD system.

@pbootly
Copy link

pbootly commented Mar 19, 2020

Can you clarify how you're calculating the metric values from above? The issue is that if that is the case we are not getting near that performance, for example in stevenh's response above:

Output [influxdb] wrote batch of 10000 metrics in 2.110488274s

From the reply here it feels very apparent to me that 1.5.3 is the regression for whatever performance hit we're impacting, which is why the comparison for versions prior and after has been done.

@danielnelson
Copy link
Contributor

In theory you could write up to metric_batch_size / write_time. For that particular log message: 10000 / 2.11s = 4739.3/s.

If you were previously able to write in 100ms and now it is taking 2s, an extra 10ms in the serialization isn't the cause. We need to figure out what is happening during the other 1890ms.

To be clear, I'm not saying that performance is not reduced post-1.5 for your use case. Just that the extra time only appears to account for a small fraction of the slowdown you are seeing.

@stevenh
Copy link
Author

stevenh commented Mar 19, 2020

I'm not sure we're measuring the same thing in this latest test, specifically loading from disk and then writing to disk is not the same as converting from statsd protocol to influx objects and then serialising to write to the cluster on http.

The main variance identified above between FreeBSD and Linux is likely down to filesystem ZFS vs XFS.

I spoke to Matt who's been taking the lead on this and we've identified some other tests to see if we can't identify other impacting factors as you suggested.

This will also try to get visibility why the influx cloud cluster is so much slower than our onprem cluster. One of the possible causes in TLS setup, given requests are processed in serial.

@danielnelson
Copy link
Contributor

Closing issue, I believe this was addressed through InfluxData support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/statsd performance problems with decreased performance or enhancements that improve performance
Projects
None yet
Development

No branches or pull requests

4 participants