-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telegraf 1.3.0 stops sending any metrics after a few hours processing them ... maybe kafka's fault? #2870
Comments
From the stack it looks like the output is deadlocked:
Do you see anything in the InfluxDB logs? |
No, nothing abnormal. After trimming out queries and httpd, it's basically this:
We do have one measurement exceeding the 100000 values per tag max, but I can't see how that would be related. |
Seems to be affecting the same VM out of the 6 telegrafs. Had an instance of the problem after only 10 minutes of running after the last restart. Am going to reboot the server to if it makes a difference |
Crossing my fingers |
Unfortunately that didn't fix it :( About 90 min after rebooting, one of the nodes had the problem. Got the stacktrace here This means I'll have to downgrade. |
Anything in dmesg? |
Nope. |
Is it possible to narrow down the plugins? I'm especially interested if it will occur if the influxdb output is replaced with a file output as well as if you remove the kafka plugin. |
Sorry, I've downgraded to 1.2 now. Reproing the problem is hard enough (takes hours or days, possibly depending on volume), let alone reproring and bisecting, so I'm going have to step away from this issue for the time-being. Will close for now. |
Bug report
(See https://community.influxdata.com/t/need-help-triaging-a-potential-telegraf-1-3-0-bug-with-kafka-consumer/1123 for original report.)
Since upgrading to telegraf 1.3.0, we're encountering a recurring problem
After a few hours of running fine, 1 random telegraf instance out of 6 starts complaining in a loop about basic local collectors like cpu, memory, etc. not being able to collect within the time limit (20s):
At this point, telegraf stops consuming from kafka and writing to influxdb (including from the
internal
plugin).A restart of telegraf fixes the problems, and it churns away for hours more before the problem crops up again.
The log message before the looping stuff says nothing out of the ordinary
Relevant telegraf.conf:
System info:
Steps to reproduce:
Cannot repro reliably.
Additional info:
Stack trace, taken about 20 min after the problem started:
telegraf-stuck.gostack.txt
The text was updated successfully, but these errors were encountered: