-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
statsd processing slow after 1.5.3 #7070
Comments
profiles.zip |
Enabling debugging shows a grim picture:
1.6.0-1c0f63a
So its taking ~20x as long per 10k metrics after this commit. |
Ok so I believe the above has helped identify the cause, which I now believe is the serializer. Before 1c0f63a
After 1c0f63a
|
For reference here are the benchmarks. // Old serializer
func BenchmarkReader(b *testing.B) {
for _, tt := range tests {
b.Run(tt.name, func(b *testing.B) {
d := make([]telegraf.Metric, b.N)
for i := range d {
d[i] = tt.input
}
r := NewReader(d)
var (
err error
n int64
)
b.ResetTimer()
if n, err = io.Copy(ioutil.Discard, r); err != nil {
b.Error(err)
}
last = n
})
}
}
// New Serializer
func BenchmarkReader(b *testing.B) {
for _, tt := range tests {
b.Run(tt.name, func(b *testing.B) {
if tt.err != nil {
b.Skip("would error")
}
d := make([]telegraf.Metric, b.N)
for i := range d {
d[i] = tt.input
}
serializer := NewSerializer()
serializer.SetMaxLineBytes(tt.maxBytes)
var (
err error
n int64
)
r := NewReader(d, serializer)
b.ResetTimer()
if n, err = io.Copy(ioutil.Discard, r); err != nil {
b.Error(err)
}
last = n
})
}
} |
After digging some more is looks like before 1c0f63a 95% of serialization was done in metric.New which happens when receiving the metric and is likely highly parallelised however after this commit, serialise is done during the write to the output which is serial and in a time critical path. Thoughts? |
There have been many changes since 1.5 and different areas of Telegraf have changed performance characteristics. I don't think the serializer is the problem, it just appears that way because at first because work is done at a different stage. This benchmark is a rather expensive metric to serialize since it has many fields:
With those timings, you could still serialize this metric 1000000000 / 9745 = 102616 per second, and for most workloads I expect quite a bit simpler metrics. However, on your system it is taking 1.92 seconds to write 10000 to InfluxDB.
We should check how long is it taking for InfluxDB to process the request, in the InfluxDB logs the last value is the time in microseconds:
Finally, what error are you experiencing with Telegraf 1.13.3? I suspect you would no longer get the "took longer to collect than collection interval" warning. |
Would you be able to also attach an example of the metrics you are emitting? You could use the |
That procstat is just one, as you can see from all the sub benchmarks they are all 10x slower. Not sure checking influx response times is worth it because the end requests should be the same hence the response time the same, as both are being processed by the same cluster. All versions after 1.5.3 drop so many metrics due to the slow processing in telegraf its hard to test for long in our prod setup, which is where the real load is. Below is a graph which shows the impact on simple DB connection count (15:05:00 -> 45) most metrics are dropped with only a few getting through. Here's the result of an end to end of a metric, testing metric.New and Reader.
Post 1c0f63a
So the end to end picture is better but still nearly twice as slow and ~17x more allocs and double the memory. To be clear all of these benchmarks have been done with commit 1c0f63a as the only change. |
Is there anything in particular your looking for? I ask as statsd doesn't have many options so I can describe the sort of metrics that would be in there, however I'm pretty sure the main driver is volume, from debug logs ~20-30k/s, not metric specifics. |
Primarily looking for the average number of tags/fields per metric to help gauge how much time is being spent on serialization vs sending the request. If you add this to your configuration and run it for several minutes it will collect all output as line protocol. [[outputs.file]]
files = ["/tmp/metrics"]
data_format = "influx" |
If you prefer, you could pass these files through support, or PGP encrypt to my public key. |
Thanks uploaded to the support site. |
I wrote a quick benchmark serializing the metrics: func BenchmarkFromFile(b *testing.B) {
file, err := os.Open("metrics")
require.NoError(b, err)
buf, err := ioutil.ReadAll(file)
require.NoError(b, err)
parser, err := parsers.NewInfluxParser()
require.NoError(b, err)
metrics, err := parser.Parse(buf)
require.NoError(b, err)
if len(metrics) > 10000 {
metrics = metrics[:10000]
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
readbuf := make([]byte, 4096, 4096)
serializer := NewSerializer()
reader := NewReader(metrics, serializer)
for {
_, err := reader.Read(readbuf)
if err == io.EOF {
break
}
if err != nil {
panic(err.Error())
}
}
}
}
Based on this, it doesn't seem that the write time increase is substantially increased due to the relocated serialization. Using the tail input to read the metrics file and an InfluxDB instance located on the same system, I ran the following test with both Telegraf 1.6 and master. Both versions had similar timings: [[inputs.tail]]
files = ["metrics"]
from_beginning = true
data_format = "influx"
[[outputs.influxdb]]
urls = ["http://localhost:8086"]
I also inspected the profile images, which both seem to point towards statsd for cpu usage. @stevenh Can you show log output with Telegraf 1.13.x and send HTTP access logs from the InfluxDB server (normally included in the main log)? Also, I notice now that you are running on FreeBSD, are you able to compare against a Linux system? |
Improve the gather performance to the statsd input processor by processing metric types in parallel. Use Agent.MetricBatchSize to configure the processing channel buffer, which allows high throughput inputs to process entire batches. Together these reduces the processing time of 50k metrics from over 1 second to ~100ms when not limited by the current influx metric serializer. Helps: influxdata#7070
Improve the gather performance to the statsd input processor by processing metric types in parallel. Use Agent.MetricBatchSize to configure the processing channel buffer, which allows high throughput inputs to process entire batches. Together these reduces the processing time of 50k metrics from over 1 second to ~100ms when not limited by the current influx metric serializer with a good selection of metric types. Helps: influxdata#7070
Improve the gather performance to the statsd input processor by processing metric types in parallel. Use Agent.MetricBatchSize to configure the processing channel buffer, which allows high throughput inputs to process entire batches. Together these reduces the processing time of 50k metrics from over 1 second to ~100ms when not limited by the current influx metric serializer with a good selection of metric types. Also: * Add debug information about gather timings. Helps: influxdata#7070
Improve the gather performance to the statsd input processor by processing metric types in parallel. Use Agent.MetricBatchSize to configure the processing channel buffer, which allows high throughput inputs to process entire batches. Together these reduces the processing time of 50k metrics from over 1 second to ~100ms when not limited by the current influx metric serializer with a good selection of metric types. Also: * Add debug information about gather timings. Helps: influxdata#7070
To clarify you need to compare 1.5.3 with later versions not 1.6.0, the issue was present in the first 1.6 release. **503881d Log + #7094 **
1.13.3 log
The other issue is I noticed with 1.13.3 is it doesn't restart in a sensible amount of time, see below:
Here's the timings I see when comparing 503881d (the last good version) and 1c0f63a (the commit which changed the serializer) both these include a zero allocation WriteTo implentation I've created, but the key here is reading the data which includes any required serialisation is significantly slower. I need to compare and contrast this with head, which may have some changes. 503881d benchmark 1c0f63a benchmark We can't compare with Linux but as FreeBSD typically has a faster network stack its unlikely to impact this. Hope this helps. |
On shutdown these days Telegraf makes an attempt to flush all inflight data, I assume that this is the reason it takes so long to shutdown.
This is the core of the issue, it is taking 3 seconds to write 10000 points. This means your top speed is going to be ~3000 points per second. On my system, the time to serialize 10000 points is 6375362 ns/op. If this is true on your system as well then we need to figure out what is going on the rest of the 3 seconds since even if it took 0ns to serialize it would still be too slow. If it is actually taking 3s to serialize on your system, then we need to try to figure out why it is so much slower on your side and if there is a way to replicate. Can you add the HTTP access logs from the InfluxDB server to the support site and perform one more test for me:
|
That's my point only the newer versions of telegraf are that slow, compare that with the old version: This is the same system with the same config. |
Oh and to clarify I believe the |
This is correct, it includes both times and they happen concurrently. In my testing with Telegraf 1.13.3 it takes ~150ms, Telegraf 1.5 takes more or less the same amount of time. Serializing 10k metrics also takes 0.006s, even if it was fully removed it wouldn't affect the performance significantly. There are three possibilities as I see it:
If you can get the measurements with the file output that should help us determine if 1 is the issue, and the logs from InfluxDB will be a starting point for 2. Checking 3 is a bit harder, let's put that off for now. |
We're currently in the middle of migrating from on premise Cluster to Influx Cloud, which is unfortunately complicating testing. In addition we had some corrupt stats hitting telegraf which have now been fixed. When we have gathered further stats I will update post an update. |
Hey, just picking this up from @stevenh. We have gathered a 1G~ file of metrics and run the tail and output to file on three versions across Linux and FreeBSD. Note: (I can test this on the same hardware but will take another day to reinstall a box) Config is the same on all of them (except the path of the files):
The average time it took to process a batch:
Total time taken to process the file:
I've attached the log files to this comment so you can look through them. FreeBSD: From my understanding I don't think 2 or 3 is the issue. |
Hi @austin-barrington, I'd like to regroup a bit, there are a few possible issues raised above and I want to make sure we identify one in particular to fix here. Only considering Telegraf 1.13.4, is there a particular error you are receiving? |
Hey, What we're currently looking to do is work through the three suggested issue we're encountering
The above is a way we were able to test serialization performance without impacting our production traffic and load. Below I've got the average and total times for the telegraf versions in question running on an exact hardware match to the BSD machine.
From this I imagine it's worth carrying on and starting with 1 - is serialization on the system the issue. Once that's proved/disproven as the issue we'll move on to 2, then 3 and by then hopefully have a string to pull at. |
It's still not clear to me what exactly the issue you are experiencing is. The performance characteristics of Telegraf have and will change across releases, keeping them the same for every use case is not a project goal. It would be helpful to know what error you run into when you attempt to upgrade, this will help us identify where we can take a pass over the code for ways we can improve performance. Unfortunately, comparisons against Telegraf 1.5 are just too old to be helpful.
With these rates and a batch size of 1000, you should be able to write 1000 / 0.0139 = 71942 metrics per second on the FreeBSD system. |
Can you clarify how you're calculating the metric values from above? The issue is that if that is the case we are not getting near that performance, for example in stevenh's response above:
From the reply here it feels very apparent to me that 1.5.3 is the regression for whatever performance hit we're impacting, which is why the comparison for versions prior and after has been done. |
In theory you could write up to If you were previously able to write in 100ms and now it is taking 2s, an extra 10ms in the serialization isn't the cause. We need to figure out what is happening during the other 1890ms. To be clear, I'm not saying that performance is not reduced post-1.5 for your use case. Just that the extra time only appears to account for a small fraction of the slowdown you are seeing. |
I'm not sure we're measuring the same thing in this latest test, specifically loading from disk and then writing to disk is not the same as converting from statsd protocol to influx objects and then serialising to write to the cluster on http. The main variance identified above between FreeBSD and Linux is likely down to filesystem ZFS vs XFS. I spoke to Matt who's been taking the lead on this and we've identified some other tests to see if we can't identify other impacting factors as you suggested. This will also try to get visibility why the influx cloud cluster is so much slower than our onprem cluster. One of the possible causes in TLS setup, given requests are processed in serial. |
Closing issue, I believe this was addressed through InfluxData support. |
Relevant telegraf.conf:
System info:
FreeBSD 11
Telegraf later 1.5.3 have tried 1.6.0 1.8.0 1.10.0 and 1.13.7 (latest)
Steps to reproduce:
Expected behavior:
Stats should be processed
Actual behavior:
Most stats are dropped due to slow processing and the following is in the logs
Additional info:
After bisecting the commits all the way back to 1.5.3 the following commit is the cause:
1c0f63a
The text was updated successfully, but these errors were encountered: