-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telegraf 1.12 poor performance (kafka_consumer -> influxdb) #6354
Comments
I can look into the kafka_consumer performance, we updated the Kafka library to a newer version which was required to support the latest Kafka versions. No major changes in the InfluxDB output or agent code. |
FYI, I have been using Telegraf 1.11.x with Confluent Platform 5.3 (Kafka
2.3) for a few weeks now and experienced no issues.
…On Thu 5. Sep 2019 at 22:15, Daniel Nelson ***@***.***> wrote:
I can look into the kafka_consumer performance, we updated the Kafka
library to a newer version which was required to support the latest Kafka
versions. No major changes in the InfluxDB output or agent code.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#6354?email_source=notifications&email_token=ACOZHKQ5DJXVIW5HNN6RHMDQIFSF7A5CNFSM4IUBZ7BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ATV2Q#issuecomment-528562922>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACOZHKTOUHHWL7M7CTN7KSLQIFSF7ANCNFSM4IUBZ7BA>
.
|
Here is the issue that initated the change #6136. In 1.11.5 we use sarama-cluster which has been deprecated for some time. |
A quick check of the Can you check the following:
|
The Kafka environment is: Here is the Telegraf config (only the uncommented lines and without credentials)...
Let's walk through some charts... last 24-hoursThe reason for the upgrade throughout the environment was to add the new inputs I will break this chart down step-by-step... steady-state prior to upgradeThis is the volume of writes before Telegraf was upgraded from 1.11.3 to 1.12.0. after the upgrade to 1.12.0Here you can see that the overall volume of traffic has decreased. However, since data is still coming into Kafka at the rates above (actually a little higher because of the new data) InfluxDB begins to fall behind. This was how I first recognized there was an issue, my charts were missing recent data and getting worse. troubleshootingAs I was trying to figure out what was wrong I restarted Telegraf a few times. What I noticed was that immediately after the restart Telegraf would consume data at a faster rate. That is the three little spikes seen on this graph. However it would then quickly fall back to a slower, insufficient rate. You mention that your quick tests showed good throughput. My observations would expect this for a short period after startup, however it won't last for a longer duration. The test should be run over a longer period reading a topic with a larges amount of data (at least a few million records). You will also notice in this chart the really big spike to the right. This was after the downgrade to Telegraf 1.11.5. It is only a brief spike because this previous version was able to consume the data lagged in Kafka at such a fast rate that the problem cleared itself almost immediately, and the overall system was back to normal. steady-state after the downgrade to 1.11.5Back at 1.11.5 everything is back to normal. Even with the higher rate from the new data being collected in this environment, there is no lag consuming data from Kafka. CPUI don't have CPU data specific for Telegraf, however the following charts still reveal a lot. On these charts you can still see the three smaller spikes from the Telegraf restarts and the larger spike after the downgrade. Additionally there is activity between the second and third small spike, and after the large downgrade spike. This system also has a 2.0-alpha instance with its own separate Telegraf instance that was consuming data from the same Kafka topic (different consumer group). This CPU activity is related to when I shutdown the 2.0-alpha container (the activity between small spikes 2 and 3) to eliminate it while troubleshooting, and when I later restarted it after the downgrade. Since this instance was shutdown for a while, it was further behind, and thus took longer to catchup. I actually felt that it processed the lagged data slower than the 1.x instance, but 2.0 is alpha and that isn't really related to this issue. With that cleared up, it is most telling to focus on the steady-state utilization between small spike 1 and 2, and the steady-state after the downgrade. You can see that there was actually less CPU utilization as less data was being consumed. This would lead me to suspect that the issue isn't related to a condition where excessive resources are being consumed by 1.12.0, resulting in insufficient resources to consume more data. It seems that there is either something wrong with how Telegraf is communicating with Kafka, or there is some other issue internal to Telegraf. I do wonder about the latter, because of the brief moment after a restart where data was consumed quickly. It is as if data is read fine until some internal buffers are full, however Telegraf can't clear those buffers fast enough and everything slows down. In this case the issue isn't the Kafka input, rather some other issue internal to Telegraf. This is of course speculation, but it does fit the observed information. |
Can you try setting this in the kafka_consumer plugin, you can set this for both 1.11 and 1.12, this is double the default. It may be that the batch size is not being filled to trigger an immediate flush.
|
I have read the description of that setting...
... and it sounds like this will only help to improve the nearness to realtime when the overall volume of data is low. In the case where there is data backing up in Kafka, there is no shortage of data to fill a batch. Can you explain why you believe that this could be an issue? |
Hi there, I had the same issue too with the latest telegraf release (1.12) so I had to downgrade to 1.11.5. This are some of the log messages after the installation of 1.12:
|
Telegraf triggers a write when the I'm considering this because of how uniform the writes / second are, looks like Telegraf is holding back or I would expect more irregularities.
I ran into this yesterday during my testing too, I have a fix for 1.12.1. It didn't occur to me until now, but you are right this would eventually eat into the undelivered messages and slow/stop the plugin.
I have not seen this one, maybe its a follow on issue from the prior bug? Let me put up a pull request with the fix for the |
@danielnelson writes are fairly uniform because this is all IT monitoring data, collecting on either 30 or 60 sec. intervals. If you can point me to a dev build, I can test it, but it won't be until late Sunday or early Monday. |
How about we test first and then decide whether to close this? |
We can reopen if the fix doesn't handle it. This is our workflow on issues, merging the PR closes the issue. Here are some builds with the fix: |
I've been testing the new build and everything seems to be working fine. Thanks for the fix @danielnelson. |
@oacosta40 Great news, thanks for the help! |
I was able to implement quickly this morning. Initial indications are positive. I will check back in on it in a few hours to confirm. The one thing I noticed is that this 1.13 build is missing some of the new stuff that was in 1.12, such as the |
Things look good regarding ingestion from Kafka. |
I was a bit late with using 1.12 and now I am facing the same performance issue. Hence I was searching for a solution and found this. I will update to the nightly build of 1.13 and try it out. |
@adithyamk This should be fixed in the 1.12.1 release, so it shouldn't be required to go to a nightly build. |
@danielnelson I'm running Telegraf 1.13.3 ( Telegraf unknown (git: master 88a8963) ) and when I'm trying to read from Kafka I see the following messages in my debug output: 2020-03-01T12:31:24Z D! [sarama] consumer/broker/1025 abandoned subscription to my-test-topic/4 because consuming was taking too long Seems like brokers are being torn down and rebuilt to process the ingestion. Is this normal? Does this error message attribute to a loss of data possibly? My Telegraf config looks like this: [agent] [[inputs.kafka_consumer]] [[outputs.influxdb]] |
@mohsin106 this is really a different issue. It looks like Telegraf is failing to connect to any broker and is trying each broker in the My first question is whether your Kafka brokers are intentionally set to listen on port 9093? The default is 9092. |
@robcowart I can open a different issue if this isn't related. I was just trying to understand what condition would generate those log messages I was seeing. Is it a limitation of Telegraf or something on the Kafka side? I am able to get a ton of data into my InfluxDB but was concerned if any data was being dropped. The Kafka cluster is not managed by me but yes it is intentionally set to listen to port 9093. |
@mohsin106 Yes, could you open a new issue? I think maybe it is taking too long from when the message is read from Kafka until the offsets are updated. |
I'm experiencing similar performance issues. I'm using 6 telegraf instances to produce collectd metrics to a kafka topic. I'm then reading this kafka topic with another telegraf instance and sending it to influxDB on the same server. The protocol is influxdb line protocol and no compression. The consumer telegraf and influx DB is running on an 8 core machine with 32GB ram and is hardly using any resources. However there is consumer lag from the telegraf consumer. I'm running influxDB 2.0-beta and telegraf 1.14.4 (git: HEAD c6fff6d) in docker. |
The fix for this issue is to make sure your [agent]
metric_batch_size = 1000
[[inputs.kafka_consumer]]
max_undelivered_messages = 2500 |
Thank you, how can I have it poll more often? |
OK, I made these changed, and I think it is working. It caught up to the 25M metrics lag and bounces around from 0 lag to around 2500 depending on the polling. I'm still not sure if this is "right" but seems to be working:
Thank you @danielnelson |
Hi, Do you know what should I do? |
Just a heads up. Someone needs to check the newly released Telegraf 1.12. I had Telegraf pulling data from Kafka and sending to InfluxDB. After the upgrade Telegraf started falling behind, and I could see a lag of over 2.5M events in Kafka (that is about an hour of data this environment).
I downgraded to 1.11.5 and this previous version was able to catch up and clear the Kafka backlog in less than 2 mins.
I don't know whether it was the Kafka input of the InfluxDB output that was the problem, as there were no error logs of any kind.
The text was updated successfully, but these errors were encountered: