telegraf taking too long to collect net metrics #3318

adrianlzt · 2017-10-10T06:48:53Z

Bug report

We are seeing this message each 10" in some of our servers:

Oct 09 19:10:10 ESJC-OSH1-MA03P telegraf[1453]: 2017-10-09T17:10:10Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (10s)

Net metrics are not being sent, but the rest are working correctly.

If I run telegraf in test mode it works correctly:

telegraf --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d --test

I have restarted telegraf in one of the failing nodes and now is working correctly.

Killed with SIGQUIT another node: https://gist.github.com/7e999b78093bb41fd89e1314fe7e4b1b

Another SIGQUIT: https://gist.github.com/adrianlzt/09f3c4dcd5ff54d1ddc5fcb156003d7d

12h later one of the servers is having the same problem. Before failing there are some others plugins giving errors. Errors and SIGQUIT: https://gist.github.com/6c8173e791e26d78545ee7f5a00ba08e

Relevant telegraf.conf:

Loaded outputs: influxdb
Loaded inputs: inputs.disk inputs.diskio inputs.kernel inputs.mem inputs.processes inputs.swap inputs.system inputs.cpu inputs.procstat inputs.docker inputs.procstat inputs.prometheus inputs.kubernetes inputs.net inputs.netstat inputs.prometheus inputs.procstat inputs.procstat

System info:

After killing the agents they were running telegraf-1.3.5-1.x86_64.
Now they are running 1.4.1-1.x86_64

OS: Red Hat Enterprise Linux Server release 7.3 (Maipo)

Additional info:

Maybe related with #2870 and #3107

The text was updated successfully, but these errors were encountered:

danielnelson · 2017-10-10T18:17:27Z

Is there more to the SIGQUIT dumps?

adrianlzt · 2017-10-10T18:30:46Z

No, seems that journald cuts the end. I can switch to file log and repeat the SIGQUIT if it is needed. El mar., 10 oct. 2017 20:17, Daniel Nelson <[email protected]> escribió:

…

Is there more to the SIGQUIT dumps? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3318 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADFnmKpg-cM133RMWTLGERa96PwdNO2Qks5sq7TDgaJpZM4PzcwE> .

danielnelson · 2017-10-10T18:38:46Z

Yeah could you do that, also when this occurs try to make a request to https://127.0.0.1/metrics, perhaps by exec'ing a shell and using curl.

adrianlzt · 2017-10-11T08:40:16Z

Now it is not failing, although kubernetes input is giving timeout all the time (and docker socket from time to time)

danielnelson · 2017-10-12T23:36:49Z

It might be helpful to enable the internal input and watch the gather_time_ns, maybe it can be correlated with another metric. I'm also curious, these plugins would be slower as more containers are added, about how many containers are you running?

adrianlzt · 2017-11-15T08:35:58Z

These are two nodes having the problem:

Don't seem that the number of containers has relation.

Looking at other inputs gather time, I don't find any relation.

I have tried to find other metrics which correlate with the error but not found any.

danielnelson · 2017-11-15T22:39:56Z

Here are a few things to check into:

How are the output times to InfluxDB as reported by internal_write write_time_ns?
How does the internal_memstats alloc_bytes look over time?
Are you still having timeout issues with the net input? If so can you attach your /proc/net/dev and /proc/net/snmp files.

adrianlzt · 2017-11-16T15:08:05Z

https://gist.github.com/adrianlzt/2581bae44236cf69cc711c000f117390

Yes, the issue with the net input is still happening.

danielnelson · 2017-11-17T23:54:00Z

I haven't been able to figure out the problem, lets try to get the full sigquit stacktrace? It might be easier to run with --pprof-addr :6060 and grab the full stacetrace from there, but either way should work.

danielnelson · 2017-12-16T01:15:32Z

@adrianlzt I've learned that it can be quite time consuming for the net input to discover interfaces, because of the cost of checking if the interface is a loopback and if the it is up.

It may help if you add a list or glob(only in 1.5.0+) of interfaces:

[[inputs.net]]
  interfaces = ["eth*", "en*"]

Let me know if this helps.

adrianlzt · 2017-12-17T16:40:24Z

Thanks for the tip, but I cannot test anymore. New job :)

danielnelson · 2017-12-19T05:04:51Z

Congrats! I'm going to close this issue then, if someone else reading this has the issue and the tip above doesn't help then please open a new issue

jgitlin-bt · 2017-12-27T15:00:45Z

I am having this same issue, and have seen multiple related reports, but haven't found anything that has helped resolve it. Just enabled internal input and net inputs as described, will see what data that produces.

danielnelson added the need more info label Oct 26, 2017

danielnelson closed this as completed Dec 19, 2017

jgitlin-bt mentioned this issue Jan 2, 2018

Telegraf stops publishing metrics to InfluxDB; All plugins take too long to collect #3629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telegraf taking too long to collect net metrics #3318

telegraf taking too long to collect net metrics #3318

adrianlzt commented Oct 10, 2017

danielnelson commented Oct 10, 2017

adrianlzt commented Oct 10, 2017 via email

danielnelson commented Oct 10, 2017

adrianlzt commented Oct 11, 2017

danielnelson commented Oct 12, 2017

adrianlzt commented Nov 15, 2017

danielnelson commented Nov 15, 2017

adrianlzt commented Nov 16, 2017

danielnelson commented Nov 17, 2017

danielnelson commented Dec 16, 2017

adrianlzt commented Dec 17, 2017

danielnelson commented Dec 19, 2017

jgitlin-bt commented Dec 27, 2017

telegraf taking too long to collect net metrics #3318

telegraf taking too long to collect net metrics #3318

Comments

adrianlzt commented Oct 10, 2017

Bug report

Relevant telegraf.conf:

System info:

Additional info:

danielnelson commented Oct 10, 2017

adrianlzt commented Oct 10, 2017 via email

danielnelson commented Oct 10, 2017

adrianlzt commented Oct 11, 2017

danielnelson commented Oct 12, 2017

adrianlzt commented Nov 15, 2017

danielnelson commented Nov 15, 2017

adrianlzt commented Nov 16, 2017

danielnelson commented Nov 17, 2017

danielnelson commented Dec 16, 2017

adrianlzt commented Dec 17, 2017

danielnelson commented Dec 19, 2017

jgitlin-bt commented Dec 27, 2017