Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNMP plugin timeout when the response SRC IP is different than original request (HA Virtual IPs) #3320

Closed
derekmwright opened this issue Oct 10, 2017 · 18 comments
Labels
area/snmp bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes

Comments

@derekmwright
Copy link

Bug report

Relevant telegraf.conf:

[[inputs.snmp]]
  agents = [ "remote-hostname" ]
  community = "public"

  name = "system"
  [[inputs.snmp.field]]
    name = "hostname"
    oid = "1.3.6.1.2.1.1.5.0"
    is_tag = true

System info:

Ran from Docker Hub (library)
Telegraf v1.4.1 (git: release-1.4 2de7aa2)

Steps to reproduce:

  1. Ensure Remote-Hostname is listening on 161 for SNMP GET
  2. Execute container with snmp config mounted: docker run -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro telegraf
  3. Receive error: 2017-10-10T17:18:31Z E! Error in plugin [inputs.snmp]: agent remote-hostname: performing get on field hostname: Request timeout (after 3 retries)

Expected behavior:

Telegraf should be able to extract the response from the remote host when using OID numbers.

Actual behavior:

Telegraf doesn't see the response from the remote host as valid and then retries the query.

Additional info:

13:18:30.287374 IP (tos 0x0, ttl 63, id 42471, offset 0, flags [DF], proto UDP (17), length 71)
    1.2.3.4.36803 > 1.2.3.5.161:  { SNMPv2c { GetRequest(28) R=-1297462436  .1.3.6.1.2.1.1.5.0 } }
        0x0000:  4500 0047 a5e7 4000 3f11 3882 0a50 0be1  E..G..@.?.8..P..
        0x0010:  0a02 3d0a 8fc3 00a1 0033 5d81 3029 0201  ..=......3].0)..
        0x0020:  0104 0670 7562 6c69 63a0 1c02 04b2 aa4b  ...public......K
        0x0030:  5c02 0100 0201 0030 0e30 0c06 082b 0601  \......0.0...+..
        0x0040:  0201 0105 0005 00                        .......
13:18:30.288262 IP (tos 0x0, ttl 62, id 63659, offset 0, flags [none], proto UDP (17), length 83)
    1.2.3.5.161 > 1.2.3.4.36803:  { SNMPv2c { GetResponse(40) R=-1297462436  .1.3.6.1.2.1.1.5.0="remote-hostname" } }
        0x0000:  4500 0053 f8ab 0000 3e11 26b1 0a02 3d0b  E..S....>.&...=.
        0x0010:  0a50 0be1 00a1 8fc3 003f bb81 3035 0201  .P.......?..05..
        0x0020:  0104 0670 7562 6c69 63a2 2802 04b2 aa4b  ...public.(....K
        0x0030:  5c02 0100 0201 0030 1a30 1806 082b 0601  \......0.0...+..
        0x0040:  .... .... .... .... .... .... .... ....  .......REMOTE-HOSTNAME
        0x0050:  312d 41          

(I removed the "actual" hostname from this capture)

@derekmwright derekmwright changed the title SNMP, SNMP can't collect data when using OID Number Oct 10, 2017
@derekmwright
Copy link
Author

Also, might be worth noting, adding the leading . doesn't matter:
oid = "1.3.6.1.2.1.1.5.0" or oid = ".1.3.6.1.2.1.1.5.0" both give the same result.

@danielnelson
Copy link
Contributor

Can you run snmpget -v2c -c public remote-hostname:161 1.3.6.1.2.1.1.5.0

@derekmwright
Copy link
Author

snmpget -v2c -c public remote-hostname:161 1.3.6.1.2.1.1.5.0
returns
SNMPv2-MIB::sysName.0 = STRING: REMOTE-HOSTNAME

I'm really perplexed, all snmputils are working (had to install them locally as they aren't present in the docker container). I tested with a ruby script and i can get data from the remote host. But I can't get telegraf to get data out of the remote host. I will see if I can get go installed on this system and try to see if the issue is the gosnmp lib or if its in the telegraf code.

@danielnelson
Copy link
Contributor

That would help a ton, thanks

@derekmwright
Copy link
Author

OK, found the issue. These were run against a Cisco UCS Fabric Interconnect which has a cluster IP address that the primary Switch handles traffic for. If I used that cluster address, in telegraf, it doesn't work. Any other tool doesnt mind it (thinking its actually a gosnmp imp issue). When I change my config to point directly to a specific fabric interconnect, SNMP starts working. Closing!

@danielnelson
Copy link
Contributor

Could you open an issue with https://github.com/soniah/gosnmp

@derekmwright
Copy link
Author

Will do, I'm going to attempt a failing example and I'll get one open there and reference it here.

@danielnelson
Copy link
Contributor

Sounds good, I think I will reopen this issue for tracking.

@danielnelson danielnelson reopened this Oct 10, 2017
@danielnelson danielnelson added area/snmp bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes labels Oct 10, 2017
@derekmwright derekmwright changed the title SNMP can't collect data when using OID Number SNMP plugin timeout when the response SRC IP is different than original request (HA Virtual IPs) Oct 11, 2017
@derekmwright
Copy link
Author

Updated title to more accurately reflect the issue.

@derekmwright
Copy link
Author

@danielnelson looks like this was an issue back in 2015 and has sort of stalled. Seems Juniper routers can have the same problem. Perhaps the telegraf team would be able to help them get some priority on a resolution?

@abaluta
Copy link

abaluta commented Feb 14, 2018

@derekmwright , i use patch (gosnmp/gosnmp@3fb2b90).
Works perfectly the last year with my junipers.

@danielnelson
Copy link
Contributor

@abaluta Glad you have this working, but if you could create the packet captures requested in gosnmp/gosnmp#47 it would be very appreciated. Here are the gosnmp docs on how to do this: https://github.com/soniah/gosnmp#packet-captures

@khord
Copy link
Contributor

khord commented Nov 28, 2018

How do you apply that gosnmp patch? I'm uncertain as to where those 2 files would be.

@abaluta
Copy link

abaluta commented Nov 28, 2018 via email

@dhoppe
Copy link
Contributor

dhoppe commented Apr 4, 2019

Thank you very much for the detailed description.

I had the same problem today and solved it by configuring the IP address of the devices instead of the cluster IP address.

However, I think the plugin should give a detailed log message in such a case. A timeout has nothing to do with a changing IP address.

@glinton
Copy link
Contributor

glinton commented Apr 4, 2019

A timeout has nothing to do with a changing IP address.

Though it kind of does. Correct me if I'm wrong; When using direct server return, if an endpoint fails over mid-request then any client trying to connect to it would get a timeout because it couldn't reach the original destination.

@Hipska
Copy link
Contributor

Hipska commented Mar 2, 2021

@derekmwright @abaluta does this issue still occur on latest versions of gosnmp and Telegraf?

@Hipska
Copy link
Contributor

Hipska commented May 7, 2021

Since gosnmp library has been updated in telegraf, I'm assuming this issue is resolved. Please reopen if it isn't.

@Hipska Hipska closed this as completed May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes
Projects
None yet
Development

No branches or pull requests

7 participants