-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Juniper CHASSIS and SYSTEM alerts #2388
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2388 +/- ##
==========================================
+ Coverage 54.52% 54.60% +0.08%
==========================================
Files 558 560 +2
Lines 40644 40709 +65
==========================================
+ Hits 22160 22231 +71
+ Misses 18484 18478 -6
... and 2 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Currently, if a netbox has a non-zero count of red or yellow alarms, a start-event is sent. If there is a zero-count an end-event is sent. There is no checking whether a state is already open and there should be, and there is no checking of whether the specific netbox has the mib in question. Also, tests needed. |
e37ed9d
to
ecccb6d
Compare
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
The actual count could possibly be stored together the event with the help of EventQueueVar. Any good examples where this is done? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about as simple as it gets, and I like the stateless approach to posting events (although there will be a lot of logging from ipdevpoll and eventengine, eventengine will by-design ignore the end-events that appear without a corresponding start-event having been posted first).
A few minor inline comments, and of course, the bigger issue:
- A SQL change script is required to actually add the new event- and alert-types to the database (once we know what to call them)
Cool, how convenient! |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, now on to round two (and it's not soggy-brain late afternoon this time)...
Whether this example is good could be debatable, but here is once instance of setting arbitrary event variables through the "varmap" (line 160 should be highlighted): nav/python/nav/ipdevpoll/shadows/gwpeers.py Lines 147 to 171 in e6634e5
There are two issues that would make it difficult to come to an ideal solution:
Usually, if you want to carry arbitrary variables over to the permanent record of alerthist/AlertHistory, you need to write an event handler plugin that does so explicitly. Currently, I think perhaps the only plugin that does so is the event plugin for maintenance events. which does it here:
So, presently, you can generate an alert when the "red count" transitions from 0 to 1, and this alert can say "there's 1 red alert". However, when the counter subsequently transitions from 1 to 2, there is no way to notify the NAV user that "there are now 2 red alerts". Again, this is analysis is from memory. Unless it is already possible, we could jig event engine to be able to override handling of "duplicate" events in a custom plugin. |
The maintenanceState plugin already suggests that we could work around the "duplicate" handling: nav/python/nav/eventengine/plugins/maintenancestate.py Lines 43 to 46 in e6634e5
This means that we could potentially detect a change in the red/green alert count, update the existing alert history state and send an extra notification. However, there still is no good way to maintain a history/log of the changing red/green alert count over time. Maybe storing a current value and a maximum value as alerthistvars? It might be time for a fuller design discussion with the CNaaS team who wanted this feature :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better, but one important piece is missing:
I would argue that this plugin belongs in the statuscheck
ipdevpoll job, which would ensure it runs every 5 minutes. The example ipdevpoll.conf
should be updated accordingly.
I'm also a bit worried that eventengine might go berserk with logging: If you have 100 Juniper devices with no alarms, it might log 100 messages every five minutes about rejecting an end-event from this plugin for having no corresponding start event. This should be verified...
I can confirm that every five minutes we get two logging messages about ignoring an end event for each netbox. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, but a few questions about the logical flow of this remain for me :)
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nearly there, but still a couple of issues with error handling and non-Juniper devices.
The new plugin runs for all devices, regardless of vendor. You cannot expect to get an alarm count for non-Juniper devices. You should also not expect to always get a number from Juniper devices.
Running the new plugin on a test installation that contained an HP switch, I immediately got it to crash with this traceback:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 501, in errback
self._startRunCallbacks(fail)
File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
self._runCallbacks()
File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
_inlineCallbacks(r, g, status)
--- <exception caught here> ---
File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python3.9/dist-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/source/python/nav/ipdevpoll/plugins/juniperalarm.py", line 65, in handle
current_yellow_count = yield mib.get_yellow_alarm_count()
File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/source/python/nav/mibs/juniper_alarm_mib.py", line 41, in _get_alarm_count
count = int(count) or 0
builtins.TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
I suggest:
- The code needs to handle result values of
None
, which typically mean that no alarm count could be fetched from this device. - You should consider adding a vendor id restriction to the plugin. I can only find this single example atm:
RESTRICT_TO_VENDORS = VENDOR_MIBS.keys()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me now (after lots of fiddling with non-related issues in my dev environments).
However, as I got the alerts, I noticed they all use the term "netbox". This is something we generally don't do in alert messages. The "netbox" term is at best an internal term in NAV, while the outward facing term is usually "IP Device". However, we generally only refer to the sysname of a netbox in alert messages (see all the other alert message templates for examples)
python/nav/etc/alertmsg/juniperRedAlarmState/juniperRedAlarmOff-email.txt
Outdated
Show resolved
Hide resolved
python/nav/etc/alertmsg/juniperRedAlarmState/juniperRedAlarmOff-sms.txt
Outdated
Show resolved
Hide resolved
python/nav/etc/alertmsg/juniperRedAlarmState/juniperRedAlarmOn-email.txt
Outdated
Show resolved
Hide resolved
python/nav/etc/alertmsg/juniperRedAlarmState/juniperRedAlarmOn-sms.txt
Outdated
Show resolved
Hide resolved
python/nav/etc/alertmsg/juniperYellowAlarmState/juniperYellowAlarmOff-email.txt
Outdated
Show resolved
Hide resolved
python/nav/etc/alertmsg/juniperYellowAlarmState/juniperYellowAlarmOn-email.txt
Outdated
Show resolved
Hide resolved
python/nav/etc/alertmsg/juniperYellowAlarmState/juniperYellowAlarmOff-sms.txt
Outdated
Show resolved
Hide resolved
python/nav/etc/alertmsg/juniperYellowAlarmState/juniperYellowAlarmOn-sms.txt
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bah, didn't catch all the netboxes on the first try, apparently...
python/nav/etc/alertmsg/juniperRedAlarmState/juniperRedAlarmOff-email.txt
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! 🎉
204d406
to
dcff55b
Compare
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Closes #2358