Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ipdevpoll workers freeze with "Error building ASN.1 representation" message, causing Graphite statistics to go blank #2494

Closed
lunkwill42 opened this issue Nov 9, 2022 · 2 comments
Labels
Milestone

Comments

@lunkwill42
Copy link
Member

lunkwill42 commented Nov 9, 2022

Describe the bug

On installations of NAV 5.5.0 and 5.5.1 that monitors Cisco devices, ipdevpoll
workers tend to freeze up with a strange error message/traceback:

pynetsnmp.netsnmp.SnmpError: get: snmp_send cliberr=0, snmperr=-11, errstring=b"Error building ASN.1 representation (Can't build OID for variable)"

Quickly followed by

builtins.AttributeError: 'NoneType' object has no attribute 'addCallbacks'

The error messages and tracebacks do not really contain enough information to know what the underlying problem is, but it appears that the effect is that the worker process stops collecting at this point, causing delays to all other collectionsjobs - and thereby disturbing the collection intervals for the 1minstats and 5minstats jobs.

The main symptom for those not watching the logs is that any system and traffic statistics collection appears to stop working.

To Reproduce

The problem appears to be readily reproducible by just running the ipdevpoll 1minstats job against any SNMP-enabled Cisco switch:

E.g.:

$ ipdepvolld -J 1minstats -n cisco-sw.example.org

Expected behavior

No crashing. Collection should continue in the face of most collection issues.

Tracebacks

In ipdevpoll.log, lots of these messages appear:

2022-11-09 13:22:03,025 [2789339] [ERROR zen.pynetsnmp.netsnmp] b'snmp_build: unknown failure\n'
2022-11-09 13:22:03,025 [2789339] [ERROR zen.pynetsnmp.netsnmp] b'g: Error building ASN.1 representation\n'
Unhandled error in Deferred:

Traceback (most recent call last):
  File "/opt/venvs/nav/lib/python3.9/site-packages/pynetsnmp/tableretriever.py", line 30, in v2v3how
    return proxy._getbulk(0, min(maxRepetitions, limit), [oids])
  File "/opt/venvs/nav/lib/python3.9/site-packages/nav/ipdevpoll/snmp/common.py", line 65, in _wrapper
    return func(*args, **kwargs)
  File "/opt/venvs/nav/lib/python3.9/site-packages/nav/ipdevpoll/snmp/common.py", line 134, in _getbulk
    return super(AgentProxyMixIn, self)._getbulk(*args, **kwargs)
  File "/opt/venvs/nav/lib/python3.9/site-packages/pynetsnmp/twistedsnmp.py", line 385, in _getbulk
    return defer.fail(ex)
--- <exception caught here> ---
  File "/opt/venvs/nav/lib/python3.9/site-packages/pynetsnmp/twistedsnmp.py", line 381, in _getbulk
    self.defers[self.session.getbulk(nonrepeaters,
  File "/opt/venvs/nav/lib/python3.9/site-packages/pynetsnmp/netsnmp.py", line 759, in getbulk
    self._handle_send_status(req, send_status, 'get')
  File "/opt/venvs/nav/lib/python3.9/site-packages/pynetsnmp/netsnmp.py", line 717, in _handle_send_status
    raise SnmpError(msg_fmt % msg_args)
pynetsnmp.netsnmp.SnmpError: get: snmp_send cliberr=0, snmperr=-11, errstring=b"Error building ASN.1 representation (Can't build OID for variable)"

Unhandled Error
Traceback (most recent call last):
  File "/opt/venvs/nav/lib/python3.9/site-packages/nav/ipdevpoll/daemon.py", line 458, in start_ipdevpoll
    process.run()
  File "/opt/venvs/nav/lib/python3.9/site-packages/nav/ipdevpoll/daemon.py", line 101, in run
    reactor.run()
  File "/opt/venvs/nav/lib/python3.9/site-packages/twisted/internet/base.py", line 1283, in run
    self.mainLoop()
  File "/opt/venvs/nav/lib/python3.9/site-packages/twisted/internet/base.py", line 1292, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/opt/venvs/nav/lib/python3.9/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/opt/venvs/nav/lib/python3.9/site-packages/nav/mibs/mibretriever.py", line 479, in _schedule_next
    deferred = self.retrieve_column(column)
  File "/opt/venvs/nav/lib/python3.9/site-packages/nav/mibs/mibretriever.py", line 442, in retrieve_column
    deferred.addCallbacks(_result_formatter, _valueerror_handler)
builtins.AttributeError: 'NoneType' object has no attribute 'addCallbacks'

Environment (please complete the following information):

NAV 5.5.1 from source or Debian package.

Additional context

Through judicious use of log level settings, the problem has been tracked down to the statsystem being run to collect memory information from Cisco devices.

The MIB dump used to implement the new support for CISCO-ENHANCED-MEMPOOL-MIB is full of invalid OIDs, which the NET-SNMP library refuses to construct ASN.1 representations of. ipdevpoll in turn does not seem equipped to handle this low-level error.

Example of one of the invalid objects:

"cempMemPoolName" : {
"nodetype" : "column",
"moduleName" : "CISCO-ENHANCED-MEMPOOL-MIB",
"oid" : "0.221.1.1.1.1.3",
"status" : "current",
"syntax" : {
"type" : { "module" :"SNMP-FRAMEWORK-MIB", "name" : "SnmpAdminString"},
},
"access" : "readonly",
"description" :
"""A textual name assigned to the memory pool. This
object is suitable for output to a human operator,
and may also be used to distinguish among the various
pool types.""",

@lunkwill42 lunkwill42 added the bug label Nov 9, 2022
@lunkwill42 lunkwill42 added this to the 5.5.2 milestone Nov 9, 2022
@lunkwill42
Copy link
Member Author

This functionality was introduced in #2439 - but is unclear how manual tests of those changes could have been successful.

The problem itself is easily remedied by dumping a new representation of CISCO-ENHANCED-MEMPOOL-MIB using the smidump command line program, but in addition to a hotfix in itself, we really need:

  1. Automated tests to find these types of problems
  2. Improved error handling in ipdevpoll so that this type of error doesn't take down the entire collector process.

lunkwill42 added a commit to lunkwill42/nav that referenced this issue Nov 9, 2022
Not sure what when wrong with the first smidump, since it was full of
broken OIDs.

Fixes Uninett#2494
@lunkwill42
Copy link
Member Author

Closed by #2495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant