Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telegraf taking too long to collect net metrics #3318

Closed
adrianlzt opened this issue Oct 10, 2017 · 13 comments
Closed

telegraf taking too long to collect net metrics #3318

adrianlzt opened this issue Oct 10, 2017 · 13 comments

Comments

@adrianlzt
Copy link
Contributor

Bug report

We are seeing this message each 10" in some of our servers:

Oct 09 19:10:10 ESJC-OSH1-MA03P telegraf[1453]: 2017-10-09T17:10:10Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (10s)

Net metrics are not being sent, but the rest are working correctly.

If I run telegraf in test mode it works correctly:

telegraf --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d --test 

I have restarted telegraf in one of the failing nodes and now is working correctly.

Killed with SIGQUIT another node: https://gist.github.com/7e999b78093bb41fd89e1314fe7e4b1b

Another SIGQUIT: https://gist.github.com/adrianlzt/09f3c4dcd5ff54d1ddc5fcb156003d7d

12h later one of the servers is having the same problem. Before failing there are some others plugins giving errors. Errors and SIGQUIT: https://gist.github.com/6c8173e791e26d78545ee7f5a00ba08e

Relevant telegraf.conf:

Loaded outputs: influxdb
Loaded inputs: inputs.disk inputs.diskio inputs.kernel inputs.mem inputs.processes inputs.swap inputs.system inputs.cpu inputs.procstat inputs.docker inputs.procstat inputs.prometheus inputs.kubernetes inputs.net inputs.netstat inputs.prometheus inputs.procstat inputs.procstat

System info:

After killing the agents they were running telegraf-1.3.5-1.x86_64.
Now they are running 1.4.1-1.x86_64

OS: Red Hat Enterprise Linux Server release 7.3 (Maipo)

Additional info:

Maybe related with #2870 and #3107

@danielnelson
Copy link
Contributor

Is there more to the SIGQUIT dumps?

@adrianlzt
Copy link
Contributor Author

adrianlzt commented Oct 10, 2017 via email

@danielnelson
Copy link
Contributor

Yeah could you do that, also when this occurs try to make a request to https://127.0.0.1/metrics, perhaps by exec'ing a shell and using curl.

@adrianlzt
Copy link
Contributor Author

Now it is not failing, although kubernetes input is giving timeout all the time (and docker socket from time to time)

@danielnelson
Copy link
Contributor

It might be helpful to enable the internal input and watch the gather_time_ns, maybe it can be correlated with another metric. I'm also curious, these plugins would be slower as more containers are added, about how many containers are you running?

@adrianlzt
Copy link
Contributor Author

These are two nodes having the problem:
img-2017-11-15-091546

Don't seem that the number of containers has relation.

Looking at other inputs gather time, I don't find any relation.

I have tried to find other metrics which correlate with the error but not found any.

@danielnelson
Copy link
Contributor

Here are a few things to check into:

  • How are the output times to InfluxDB as reported by internal_write write_time_ns?
  • How does the internal_memstats alloc_bytes look over time?
  • Are you still having timeout issues with the net input? If so can you attach your /proc/net/dev and /proc/net/snmp files.

@adrianlzt
Copy link
Contributor Author

img-2017-11-16-155816
https://gist.github.com/adrianlzt/2581bae44236cf69cc711c000f117390

Yes, the issue with the net input is still happening.

@danielnelson
Copy link
Contributor

I haven't been able to figure out the problem, lets try to get the full sigquit stacktrace? It might be easier to run with --pprof-addr :6060 and grab the full stacetrace from there, but either way should work.

@danielnelson
Copy link
Contributor

@adrianlzt I've learned that it can be quite time consuming for the net input to discover interfaces, because of the cost of checking if the interface is a loopback and if the it is up.

It may help if you add a list or glob(only in 1.5.0+) of interfaces:

[[inputs.net]]
  interfaces = ["eth*", "en*"]

Let me know if this helps.

@adrianlzt
Copy link
Contributor Author

Thanks for the tip, but I cannot test anymore. New job :)

@danielnelson
Copy link
Contributor

Congrats! I'm going to close this issue then, if someone else reading this has the issue and the tip above doesn't help then please open a new issue

@jgitlin-bt
Copy link

I am having this same issue, and have seen multiple related reports, but haven't found anything that has helped resolve it. Just enabled internal input and net inputs as described, will see what data that produces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants