-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New snmp plugin a bit slow #1665
Comments
How many records do you have in those tables? It is true that the plugin doesn't do multiple agents in parallel, but the old one didn't either. Did the old version of the plugin perform faster? Or did you not use it? |
Not many. 6 interfaces only. I never tried the old plugin since I saw there was a new one around the corner. I agree that increasing serial performance is important to be able to query a single host/table fast enough. But at some point I think parallelizing would become necessary to be able to query enough hosts within the allotted interval. Of course a workaround would be to split up the config and run dozens of telegrafs simultaneous. |
Hrm, is this a wan link then? I'm just trying to figure out why it would be slow. Like even 2-3 seconds for Oh, I'm not saying we shouldn't do parallelization, just that fixing the serial performance should be prioritized. |
Agreed. Yes, this is over a WAN link so that is why even snmpwalk is rather slow. |
Thanks, I'll look into simulating a high latency link and getting the performance on par with the net-snmp tools. |
Great. I will hopefully have access to the low-latency environment where we will be using it next week and give you some performance numbers from there as soon as I can. |
I've done some experimentation, and while I'm not sure how
These settings should work better than the defaults on a high latency link. You might also be able to tweak them some more to get even better performance for your specific link. And changing the However I do have some code change ideas to speed things up which I'm trying out right now. @jwilder I wouldn't consider this a bug. Everything works as it's supposed to. This is just a request to make it faster. Nor is more info needed. Thanks :-) |
max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: https://godoc.org/github.com/soniah/gosnmp#pkg-variables Could deal with some parts of the performance issues reported by #1665
max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: https://godoc.org/github.com/soniah/gosnmp#pkg-variables Could deal with some parts of the performance issues reported by #1665
max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: https://godoc.org/github.com/soniah/gosnmp#pkg-variables Could deal with some parts of the performance issues reported by #1665
max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: https://godoc.org/github.com/soniah/gosnmp#pkg-variables Could deal with some parts of the performance issues reported by #1665
max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: https://godoc.org/github.com/soniah/gosnmp#pkg-variables Could deal with some parts of the performance issues reported by #1665
max-repetitions = 10 is the default of net-snmp utils according to http://net-snmp.sourceforge.net/docs/man/snmpbulkwalk.html retries = 3 is the default of gosnmp: https://godoc.org/github.com/soniah/gosnmp#pkg-variables Could deal with some parts of the performance issues reported by #1665
My case will of course be atypical from most, but I'm polling roughly ~600 clients at a time and pulling maybe 3-4 tables and a few odd OIDs. The Plugin is completely too slow to accomplish this. I've had to fall back to a suite of BASH scripts making forked snmpget/snmptable calls to make up the difference. Just for a comparison between the two, I'm using BASH to call snmptable on 2 tables with roughly 8 columns each as well as pulling down 7 OIDs using snmpget for 10 hosts. It's pulled together into InfluxDB line protocol and echoed back. Unfortunately I can't release the data being pulled but I could potentially release the code being called if interested.
Using the plugin to do exactly the same:
When I look through the plugin code, I see some attempts to use an SNMP library for some calls but then the much faster C built utilities in Linux are used as well. If the goal was to limit dependencies, it didn't work. Not to mention, the SNMP project seems to be relatively in it's infancy and probably not well suited for production collection. A lot of the slow downs in the code are caused by executing all operations serially. Why are channels/parallel functions not being used? |
I was able to get an improvement of almost 1/3 of the time simply by parallelizing that first piece of agent code in the Gather() function.
Code that I changed:
|
Because the underlying gosnmp library does not support it. We would have to spawn of dozens of copies of it to achieve parallelism. And doing so in such a manner that is controllable is difficult. We'd basically have to create a pool. |
@phemmer Seems we're both looking at this. See my obviously quick + dirty test code above. We don't need to necessarily parallelize the gosnmp library but all the calls that are happening serially and being waited on. |
You code will cause problems because you are reusing the same gosnmp object. It is not parallel safe. Doing so will result in receive errors. |
Posting the full code this time instead of the diffs... but no, I'm instantiating a separate gosnmp object in each parallel call:
|
Yes, that should in theory not cause any problems. But it is not how I would recommend addressing the issue. Much better results can be obtained by sending multiple simultaneous requests per-agent. For people requesting a large number of OIDs from one agent, your change won't help. The only way to send parallel requests per agent is to either create multiple gosnmp objects, or fix the gosnmp library so it's parallel safe. The latter is a much better solution as it scales far better than a pool. |
Agreed, but I'm on the scale of using ~600 agents so parallelizing this makes a huge difference. One question, why are you using this gosnmp library? The code makes calls to net-snmp-utils programs already for snmptranslate/snmptable/etc, why not just use them throughout? Making parallel calls to these programs would be parallel safe. The only reason I can think to stay with the gosnmp library would be to reduce dependencies however the dependencies are already implicit by using the aforementioned programs. |
These utilities are optional. They add additional functionality to the plugin, but the plugin does not require them. They are basically just used for parsing MIB files. |
Hi @StianOvrevage we are working in a snmp colector tool for influxdb that has a good behaviour with lots of metrics. Its different from telegraph because it is focused only on snmp devicss and It has also a web-ui interface which help us to configure in a easy way. Perhaps would you like to test its performance. https://github.com/toni-moreno/snmpcollector Thank you and sorry for the spam |
@toni-moreno Awesome! I will have a look at it when I have time. I would love to give you some feedback and performance numbers from real-world testing at a few different setups I have available. |
Hi @willemdh. I suggest to test snmpcollector (https://github.com/toni-moreno/snmpcollector) we are gathering 200k metrics by one minute from close to 300 devices with only one agent and very low cpu ( less than 10%) in a little vm with only 8 cores. I would like to get some more feedback about the performance of this tool. Thank you very much. |
No offense, but why is it that every single ticket that is opened that mentions the snmp plugin gets an advertisement for snmpcollector? |
Imho I also prefer to get this working in Telegraf itself. Network monitoring is an important piece of any monitoring tool and should work with reasonable load with Telegraf. If anyone can give me a suggestion to improve my posted Telegraf configuration? Or explain why the load is going up and down? |
@willemdh I would open up a new issue. Your problem is not what this ticket is about. I would also suspect your config is a lot more complex than what you show, as the config you provided cannot account for that much CPU usage. |
@phemmer Thanks for commenting and acknowledging this is not normal behaviour. I'll asap make some time to thoroughly document the setup in a new issue. (The config I provided really is the relevant part of my setup, except that I have 10 configuration files in telegraf.d, each file for 1 switch.) |
same issue here, i want to poll few hundred snmp network devices using telegraf snmp input plugin, every minute. But initial setup has shown that the plugin takes 15 seconds only to poll 3 devices, adding 20 more means telegraf wont finish poll before the next poll. |
Same issue here. I use SNMP input plugin collected 500 devices, each device 60 metrics, a total of 10 minutes ... but my demand is one minute |
Just wanted to share my experience in case it helps anyone else out We run all of our collection using the official telegraf docker image and up until I started to run into issues we ran everything within a single container. My CPU wasn't necessarily overly high, but I would notice that my graphs started to look very sporadic with high/low spikes rather than a smooth line like I was expecting. This started to get worse as I kept adding more new devices to be polled, I could see that the time stamps stored in influx were not consistently 1 minute apart, so due to the varying collection intervals functions like non_negative_derivative would report values out of range. Example Since we build a custom docker image from the telegraf as the base image, I elected to move a number of my snmp configs into separate containers. So rather than one container polling 25 devices, I broke things down into more device role type containers. e.g. Firewalls, Routers, Switches etc... the only extra work this required was a few extra Dockerfiles and updating my Makefile to produce different container names for these new roles (each container only had a copy of the config files for the devices which fell into that role) After doing this my graphs immediately corrected themselves I would obviously prefer to manage a single container for all devices, but this turned out to require very little effort to achieve similar results. |
Yeah, this issue, and everything else in this ticket boils down to the fact that the SNMP plugin runs serially. But the root issue keeping this from being addressed is the underlying SNMP library the plugin is using. It does not properly support parallel requests. Meaning you'd have to create multiple instances of the plugin running in memory. The memory usage of the plugin is rather high (due to buffering and such). Some users have thousands of network devices they want to poll, thus we cannot do this or the memory overhead would become huge. |
@phemmer I'm able to open a PR and contribute my earlier code (once I've updated it) that "fixed" some of the parallelization issues we saw. Are you okay with proceeding with it as a workaround until the goSNMP project can be fixed? I started deep-diving the goSNMP project and it's a bit of a mess. It almost needs to be rebuilt from the RFCs up. Interested in how you'd recommend tackling it. |
Hi @Will-Beninger , @phemmer . Sorry for my ignorance related with the snmp protocol and , parallelization issues. I would like to know the reason why are you telling that gosnmp can not "handle" parallelization . I've been doing some test working with multiple snmp paral.lel handlers with gonsmp, and working fine for me (gosnmp/gosnmp#64 (comment)) also fixed some performance issues detected while doing these parallelizations (gosnmp/gosnmp#102) I'm confused, I hope you can give me some light over the lack of snmp plugin to handle parallel request and its relation with ability to do this in the base library gosnmp. Thank you very much |
@toni-moreno the gosnmp plugin is built in such a way that each remote server is hardcoded into the base object. Looking at your parallel scripts like this, you're only attempting to poll 1 device (and the loopback address at that) and mainly just parallelize the oid walks that you're doing. What this plugin is attempting to leverage is the polling of hundreds of different devices with potentially different OIDs. (My original use case had 500+ devices pulling hundreds of similar OIDs each) This leaves us with 2 choices:
As to @phemmer 's concerns, I don't have a GREAT understanding of the underlying gosnmp library and would prefer he address that. I'm reading through but I see some areas where you'll see parallelized slowdowns and wait times for sending out requests such as the sendOneRequest() function and the send() function in marshal.go. There's a full pause, wait for retries, and check only at the beginning of the loop function for exceeding the retry timer. Honestly, I don't know the best way to solve this use case. Appreciate input from both of you. |
@Will-Beninger If you could open a pull request with the parallel execution that would be very much appreciated. |
A little something to add to this - we're currently bumping into this error in telegraf "socket: too many open files". This is related to us instantiating a separate input.snmp instance in our configuration file per device (they all have different community strings). We have about 1800 devices in total right now. I have the clause I bring this up as I'm unsure if this parallelization effort will also end up running into this wall when many devices are being polled. |
No, parallelization will make the issue worse, which is why I'm not fond of it. The issue really needs to be fixed within the gosnmp lib. @jasonkeller See also https://www.freedesktop.org/software/systemd/man/systemd.unit.html (search for "drop-in") about how to alleviate your issue with package upgrades clobbering your |
Thanks @phemmer ! I had begun to wonder about how to keep those local overrides in but that link spells it out quite plainly (and I now have it integrated properly). Saved me loads of searching - thank you again. |
Shouldn't the number of open sockets remain the same since we currently keep all sockets open between gathers? |
Support for concurrently gathering across agents has been merged into master and should show up in the nightly builds in the next 24 hours. I expect this should help significantly if you have many agents. I would appreciate any testing and feedback on how well this works in practice, we can determine if this issue can be closed based on what we learn. |
Thanks @danielnelson Time taken to poll 10 device on latest nightly Time taken to poll same 10 devices with stable real 0m16.728s I will add more devices and report times |
I am having trouble to monitor when there are errors for some devices. When I check the log, it seems that the snmp plugin is trying for a long time on each field, and the other's have to wait. I guess this must because I am putting all IP address in one agent list. When some error happen to one device, the others have to wait. Which means, if I want to avoid this, I have to seperate all device by copying the same config. Then it would be a long config file for big property that I monitor. Is there anyway to let snmp plugin work asynchronously for different IP listed in the agent list? If so, user will save a lot of time to create a snmp config file. |
@justindiaw What you are experiencing should be addressed in the 1.5 release, could you try the nightly build and let me know if it is working well for you. |
@danielnelson Thanks for the fast reply. Good to know that. I'm going to try the new release. |
Just to be clear, the 1.5 release with the change is not yet released, if you are able to help with testing you will need to use a nightly build or compile from source. |
Should be a big improvement in 1.5, I'm closing this issue and we can open more targeted issues if needed. |
Hi there, anyone still face this slow collection in 2022? |
I have a few problems with performance of the new SNMP plugin.
When doing
snmpwalk
of EtherLike-MIB::dot3StatsTable and IF-MIB::ifXTable and IF-MIB::ifTable on a Cisco router they complete in ~2, ~3 and ~3.3 seconds respectively (8.3 sec combined +/- 10%).When polling with the snmp plugin it takes 17-19 seconds for a single run.
I'm unsure if the snmp plugins polls every host in parallel or in sequence. I only have one host to test against and even when I put each of the three tables in separate
[[inputs.snmp]]
sections they are polled sequentially and not in parallel.Our needs are polling hundreds of devices with hundreds of interfaces every 5 or 10 seconds (which collectd and libsnmp does easily).
The text was updated successfully, but these errors were encountered: