-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNMP panic #15200
Comments
@llamafilm could you please try to reproduce this with latest master and maybe only the SNMP (and a file output) plugin?!? We shifted code for SNMP quite a bit between v1.29 and v1.30... |
You need to look at the correct version of the source code, this is Going though the stack trace, the panic actually happens here: octets := v.Bytes() |
Thanks for calling that out. I guess the question then is if Telegraf should make a change as well? If the value is nil, should Telegraf even be calling the format value function? |
I would try to get that fixed upstream and see what the maintainers say. |
I have put up issue sleepinggenius2/gosmi#44 and a PR sleepinggenius2/gosmi#45 Happy to have reviews or comments to those. I did not realize this library had not had a lot of updates in a while, so let's see if we get a response. |
It appears like the maintainer didn't do much anymore lately. Let's see indeed. |
It seems like the upstream library has been abandoned. What should be done about this? This same crash happened again today on version 1.30.3. Do you have any ideas how I could determine which SNMP device is the cause? It happens very intermittently, and I have hundreds of SNMP devices in the config, so I can't easily test them one by one. |
Here's a more concise log output
|
Hi, We chatted about this briefly today and the next steps will involve looking to see what Telegraf can do about this. Either by dealing with the nil or adding some sort of other check. We will not be forking the upstream project unless we absolutely must do so. |
There is an upstream PR to resolve an issue when a nil value is passed. However that PR has not been looked at or merged. As such, this attempts to catch that scenario in Telegraf first. See: sleepinggenius2/gosmi#45 fixes: influxdata#15200
I've put up #15743, but I'm not entirely sure if that resolves this or is the correct behavior. Essentially, I think your use-case is a nil value and we should return an empty string. Correct me if that is wrong. |
@llamafilm did you have any chance to test the mentioned PR? There is a release on Monday and we would really love to include this fix! |
I haven't updated yet. The crash has not happened again since I last mentioned it a month ago. If the fix is low risk then I would suggest you go ahead and include it in the release. Then I'll upgrade and if it ever happens again I'll reopen this issue. |
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.29.5-66b924ec, Ubuntu 22.04.4
Docker
No response
Steps to reproduce
Unknown
Expected behavior
no crash
Actual behavior
Telegraf has been running for several days under systemd, and this weekend it crashed. Systemd tried to restart it several times, and it kept crashing repeatedly. This log snippet from journald shows a full cycle, beginning after the first crash, until it crashes again. My telegraf config is several thousand lines long, so I'm not sure which part is relevant here. I have dozens of different SNMP devices with different input configs and processors.
There was a power outage Saturday morning, about 24 hours before this crash occurred, so it's likely some of the SNMP devices were in a bad state, but I can't reproduce it. This morning after restarting the service it's working fine.
Additional info
I built this telegraf binary using the custom builder to reduce the input and output plugins. But I did not customize anything else. So it's weird that the log references lines that don't exist like
snmp.go:323
.The text was updated successfully, but these errors were encountered: