Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNMP panic #15200

Closed
llamafilm opened this issue Apr 22, 2024 · 12 comments · Fixed by #15743
Closed

SNMP panic #15200

llamafilm opened this issue Apr 22, 2024 · 12 comments · Fixed by #15743
Assignees
Labels
bug unexpected problem or unintended behavior size/s 1 day effort, great beginniner issue upstream bug or issues that rely on dependency fixes

Comments

@llamafilm
Copy link
Contributor

llamafilm commented Apr 22, 2024

Relevant telegraf.conf

[agent]
  snmp_translator = "gosmi"

Logs from Telegraf

Starting Telegraf...
telegraf.service: Can't open PID file /run/telegraf.pid (yet?) after start: Operation not permitted
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/inputs_snmp_trap.conf
2024-04-21T14:50:06Z W! DeprecationWarning: Option "timeout" of plugin "inputs.snmp_trap" deprecated since version 1.20.0 and will be removed in 2.0.0: unused option
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/netbox-cloud1.conf
2024-04-21T14:50:06Z W! DeprecationWarning: Option "fieldpass" of plugin "inputs.prometheus" deprecated since version 1.29.0 and will be removed in 2.0.0: use 'fieldinclude' instead
2024-04-21T14:50:06Z W! DeprecationWarning: Option "fieldpass" of plugin "inputs.prometheus" deprecated since version 1.29.0 and will be removed in 2.0.0: use 'fieldinclude' instead
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/outputs_grafanacloud.conf
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/outputs_loki.conf
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/processors.conf
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/telegraf.conf
2024-04-21T14:50:06Z I! Starting Telegraf 1.29.5-66b924ec (customized) brought to you by InfluxData the makers of InfluxDB
2024-04-21T14:50:06Z I! Available plugins: 17 inputs, 0 aggregators, 7 processors, 5 parsers, 3 outputs, 0 secret-stores
2024-04-21T14:50:06Z I! Loaded inputs: cpu disk diskio exec (2x) internal mem net net_response netstat ping (66x) prometheus (8x) snmp (19x) snmp_trap swap system systemd_units
2024-04-21T14:50:06Z I! Loaded aggregators:
2024-04-21T14:50:06Z I! Loaded processors: converter (4x) enum (3x) regex (3x) rename (2x) scale (2x) starlark strings
2024-04-21T14:50:06Z I! Loaded secretstores:
2024-04-21T14:50:06Z I! Loaded outputs: http loki
2024-04-21T14:50:06Z I! Tags enabled: host=cloud1
2024-04-21T14:50:06Z W! Deprecated inputs: 0 and 1 options
2024-04-21T14:50:06Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"cloud1", Flush Interval:1m30s
2024-04-21T14:50:06Z D! [agent] Initializing plugins
2024-04-21T14:50:06Z I! Config watcher started
Started Telegraf.
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:07Z I! [inputs.prometheus::9] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::36] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::47] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::62] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::78] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::79] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::5844] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::5915] Using the label selector:  and field selector:
2024-04-21T14:50:07Z W! DeprecationWarning: Value "false" for option "ignore_protocol_stats" of plugin "inputs.net" deprecated since version 1.27.3 and will be removed in 1.36.0: use the 'inputs.nstat' plugin instead
2024-04-21T14:50:07Z I! [processors.regex] tags: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] tags: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z D! [agent] Connecting outputs
2024-04-21T14:50:07Z D! [agent] Attempting connection to [outputs.http::grafana-cloud]
2024-04-21T14:50:07Z D! [agent] Successfully connected to outputs.http::grafana-cloud
2024-04-21T14:50:07Z D! [agent] Attempting connection to [outputs.loki]
2024-04-21T14:50:07Z D! [agent] Successfully connected to outputs.loki
2024-04-21T14:50:07Z D! [agent] Starting service inputs
2024-04-21T14:50:07Z I! [inputs.snmp_trap] Listening on udp://:162
2024-04-21T14:50:07Z E! [inputs.snmp::70] Error in plugin: agent 10.92.177.47: performing get on field Cat1600SerialNumber: error reading from socket: read udp 100.65.16.139:11296->10.92.177.47:161: recvfrom: connection refused
2024-04-21T14:50:07Z E! [inputs.snmp::71] Error in plugin: agent 10.92.177.48: performing get on field Cat1600SerialNumber: error reading from socket: read udp 100.65.16.139:36407->10.92.177.48:161: recvfrom: connection refused
2024-04-21T14:50:07Z D! [outputs.http::grafana-cloud] Wrote batch of 1000 metrics in 237.499482ms
2024-04-21T14:50:07Z D! [outputs.http::grafana-cloud] Buffer fullness: 304 / 1000000 metrics
2024-04-21T14:50:07Z D! [processors.starlark] Removed string field: serverSoftwareVersion from snmp_adder_aim
panic: interface conversion: interface {} is nil, not []uint8
goroutine 642 [running]:
github.com/sleepinggenius2/gosmi/models.GetEnumBitsFormatted({0x0?, 0x0?}, 0x90?, 0xc00221da20?)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/[email protected]/models/format_bits.go:31 +0x598
github.com/sleepinggenius2/gosmi/models.Type.FormatValue({0xb, 0x1, {0x0, 0x0}, 0xc002112340, {0x0, 0x0}, {0x22a2cb2, 0x4}, {0x3939620, ...}, ...}, ...)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/[email protected]/models/format.go:163 +0x23c
github.com/sleepinggenius2/gosmi/models.Node.FormatValue(...)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/[email protected]/models/format.go:127
github.com/influxdata/telegraf/plugins/inputs/snmp.(*gosmiTranslator).SnmpFormatEnum(0xc0004c6037?, {0xc002143230?, 0x265ed10?}, {0x0, 0x0}, 0x0)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/gosmi.go:139 +0x338
github.com/influxdata/telegraf/plugins/inputs/snmp.fieldConvert({0x267f628, 0x3939620}, {0xc0004c6037, 0x4}, {{0x0, 0x0}, {0xc002143230, 0x23}, 0x5})
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:771 +0xb27
github.com/influxdata/telegraf/plugins/inputs/snmp.Table.Build({{0x22a29b6, 0x4}, {0x0, 0x0, 0x0}, 0x0, {0xc0001fb208, 0x10, 0x10}, {0x0, ...}, ...}, ...)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:439 +0x685
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).gatherTable(0xc00047db00, {0x2699be0, 0xc0003b1e80}, {0x268e2d0, 0xc002396f20}, {{0x22a29b6, 0x4}, {0x0, 0x0, 0x0}, ...}, ...)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:355 +0xa7
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather.func1(0xc00257cfd0?, {0xc0004b0de1, 0xc})
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:337 +0x20b
created by github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather in goroutine 641
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:323 +0x66
telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
telegraf.service: Failed with result 'exit-code'.
telegraf.service: Consumed 1.379s CPU time.
telegraf.service: Scheduled restart job, restart counter is at 2.
Stopped Telegraf.
telegraf.service: Consumed 1.379s CPU time.

System info

Telegraf 1.29.5-66b924ec, Ubuntu 22.04.4

Docker

No response

Steps to reproduce

Unknown

Expected behavior

no crash

Actual behavior

Telegraf has been running for several days under systemd, and this weekend it crashed. Systemd tried to restart it several times, and it kept crashing repeatedly. This log snippet from journald shows a full cycle, beginning after the first crash, until it crashes again. My telegraf config is several thousand lines long, so I'm not sure which part is relevant here. I have dozens of different SNMP devices with different input configs and processors.

There was a power outage Saturday morning, about 24 hours before this crash occurred, so it's likely some of the SNMP devices were in a bad state, but I can't reproduce it. This morning after restarting the service it's working fine.

Additional info

I built this telegraf binary using the custom builder to reduce the input and output plugins. But I did not customize anything else. So it's weird that the log references lines that don't exist like snmp.go:323.

@llamafilm llamafilm added the bug unexpected problem or unintended behavior label Apr 22, 2024
@srebhan
Copy link
Member

srebhan commented Apr 22, 2024

@llamafilm could you please try to reproduce this with latest master and maybe only the SNMP (and a file output) plugin?!? We shifted code for SNMP quite a bit between v1.29 and v1.30...

@Hipska
Copy link
Contributor

Hipska commented Apr 23, 2024

You need to look at the correct version of the source code, this is snmp.go:323: https://github.com/influxdata/telegraf/blob/v1.29.5/plugins/inputs/snmp/snmp.go#L323

Going though the stack trace, the panic actually happens here:
https://github.com/sleepinggenius2/gosmi/blob/v0.4.4/models/format_bits.go#L31
Which should be replaced by

	octets := v.Bytes()

@powersj
Copy link
Contributor

powersj commented Apr 23, 2024

Thanks for calling that out. I guess the question then is if Telegraf should make a change as well? If the value is nil, should Telegraf even be calling the format value function?

@Hipska
Copy link
Contributor

Hipska commented Apr 23, 2024

I would try to get that fixed upstream and see what the maintainers say.

@powersj
Copy link
Contributor

powersj commented Apr 24, 2024

I have put up issue sleepinggenius2/gosmi#44 and a PR sleepinggenius2/gosmi#45

Happy to have reviews or comments to those. I did not realize this library had not had a lot of updates in a while, so let's see if we get a response.

@powersj powersj added the upstream bug or issues that rely on dependency fixes label Apr 24, 2024
@Hipska
Copy link
Contributor

Hipska commented Apr 24, 2024

It appears like the maintainer didn't do much anymore lately. Let's see indeed.

@llamafilm
Copy link
Contributor Author

llamafilm commented Aug 7, 2024

It seems like the upstream library has been abandoned. What should be done about this?

This same crash happened again today on version 1.30.3. Do you have any ideas how I could determine which SNMP device is the cause? It happens very intermittently, and I have hundreds of SNMP devices in the config, so I can't easily test them one by one.

@llamafilm
Copy link
Contributor Author

Here's a more concise log output

Started Telegraf.
panic: interface conversion: interface {} is nil, not []uint8
goroutine 641 [running]:
github.com/sleepinggenius2/gosmi/models.GetEnumBitsFormatted({0x0?, 0x0?}, 0x2?, 0xc000534680?)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/[email protected]/models/format_bits.go:31 +0x598
github.com/sleepinggenius2/gosmi/models.Type.FormatValue({0xb, 0x1, {0x0, 0x0}, 0xc0032e7480, {0x0, 0x0}, {0x234a0fd, 0x4}, {0x3a331c0, ...}, ...}, ...)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/[email protected]/models/format.go:163 +0x23c
github.com/sleepinggenius2/gosmi/models.Node.FormatValue(...)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/[email protected]/models/format.go:127
github.com/influxdata/telegraf/internal/snmp.(*gosmiTranslator).SnmpFormatEnum(0xc0004bcd17?, {0xc003c41920?, 0x271da20?}, {0x0, 0x0}, 0x0)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/internal/snmp/translator_gosmi.go:68 +0x338
github.com/influxdata/telegraf/internal/snmp.(*Field).Convert(0xc003162b60, {{0x0, 0x0}, {0xc003c41920, 0x23}, 0x5})
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/internal/snmp/field.go:251 +0xab8
github.com/influxdata/telegraf/internal/snmp.Table.Build({{0x23499c1, 0x4}, {0x0, 0x0, 0x0}, 0x0, {0xc000113c08, 0x10, 0x10}, {0x0, ...}, ...}, ...)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/internal/snmp/table.go:175 +0x66d
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).gatherTable(0xc00043bb00, {0x275aec0, 0xc000866b60}, {0x274ec40, 0xc001e71b80}, {{0x23499c1, 0x4}, {0x0, 0x0, 0x0}, ...}, ...)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:135 +0x87
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather.func1(0xc001ecdfd0?, {0xc0004f4211, 0xc})
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:117 +0x20b
created by github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather in goroutine 608
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:103 +0x66
telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
telegraf.service: Failed with result 'exit-code'.

@powersj
Copy link
Contributor

powersj commented Aug 14, 2024

Hi,

We chatted about this briefly today and the next steps will involve looking to see what Telegraf can do about this. Either by dealing with the nil or adding some sort of other check. We will not be forking the upstream project unless we absolutely must do so.

@powersj powersj added help wanted Request for community participation, code, contribution size/s 1 day effort, great beginniner issue labels Aug 14, 2024
powersj added a commit to powersj/telegraf that referenced this issue Aug 14, 2024
There is an upstream PR to resolve an issue when a nil value is passed.
However that PR has not been looked at or merged. As such, this attempts
to catch that scenario in Telegraf first.

See: sleepinggenius2/gosmi#45

fixes: influxdata#15200
@powersj
Copy link
Contributor

powersj commented Aug 14, 2024

I've put up #15743, but I'm not entirely sure if that resolves this or is the correct behavior. Essentially, I think your use-case is a nil value and we should return an empty string. Correct me if that is wrong.

@srebhan srebhan added the waiting for response waiting for response from contributor label Aug 15, 2024
@srebhan srebhan self-assigned this Aug 26, 2024
@srebhan srebhan added waiting for response waiting for response from contributor and removed waiting for response waiting for response from contributor labels Aug 26, 2024
@srebhan
Copy link
Member

srebhan commented Sep 4, 2024

@llamafilm did you have any chance to test the mentioned PR? There is a release on Monday and we would really love to include this fix!

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Sep 4, 2024
@srebhan srebhan added waiting for response waiting for response from contributor and removed help wanted Request for community participation, code, contribution labels Sep 4, 2024
@llamafilm
Copy link
Contributor Author

I haven't updated yet. The crash has not happened again since I last mentioned it a month ago. If the fix is low risk then I would suggest you go ahead and include it in the release. Then I'll upgrade and if it ever happens again I'll reopen this issue.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior size/s 1 day effort, great beginniner issue upstream bug or issues that rely on dependency fixes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants