Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Juniper CHASSIS and SYSTEM alerts #2388

Merged
merged 8 commits into from
Jun 1, 2023

Conversation

hmpf
Copy link
Contributor

@hmpf hmpf commented Apr 6, 2022

Closes #2358

@codecov
Copy link

codecov bot commented Apr 6, 2022

Codecov Report

Merging #2388 (204d406) into master (33b5913) will increase coverage by 0.08%.
The diff coverage is 100.00%.

❗ Current head 204d406 differs from pull request most recent head dcff55b. Consider uploading reports for the commit dcff55b to get more accurate results

@@            Coverage Diff             @@
##           master    #2388      +/-   ##
==========================================
+ Coverage   54.52%   54.60%   +0.08%     
==========================================
  Files         558      560       +2     
  Lines       40644    40709      +65     
==========================================
+ Hits        22160    22231      +71     
+ Misses      18484    18478       -6     
Impacted Files Coverage Δ
python/nav/ipdevpoll/plugins/juniperalarm.py 100.00% <100.00%> (ø)
python/nav/mibs/juniper_alarm_mib.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link

github-actions bot commented Apr 6, 2022

Test results

     12 files       12 suites   11m 21s ⏱️
3 256 tests 3 160 ✔️   96 💤 0
9 243 runs  8 955 ✔️ 288 💤 0

Results for commit dcff55b.

♻️ This comment has been updated with latest results.

@hmpf hmpf requested a review from lunkwill42 April 7, 2022 10:38
@hmpf
Copy link
Contributor Author

hmpf commented Apr 7, 2022

Currently, if a netbox has a non-zero count of red or yellow alarms, a start-event is sent. If there is a zero-count an end-event is sent. There is no checking whether a state is already open and there should be, and there is no checking of whether the specific netbox has the mib in question.

Also, tests needed.

@hmpf hmpf force-pushed the juniper-chassis-alarms branch from e37ed9d to ecccb6d Compare April 8, 2022 10:32
@sonarqubecloud
Copy link

sonarqubecloud bot commented Apr 8, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@hmpf
Copy link
Contributor Author

hmpf commented Apr 8, 2022

The actual count could possibly be stored together the event with the help of EventQueueVar. Any good examples where this is done?

Copy link
Member

@lunkwill42 lunkwill42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about as simple as it gets, and I like the stateless approach to posting events (although there will be a lot of logging from ipdevpoll and eventengine, eventengine will by-design ignore the end-events that appear without a corresponding start-event having been posted first).

A few minor inline comments, and of course, the bigger issue:

  • A SQL change script is required to actually add the new event- and alert-types to the database (once we know what to call them)

python/nav/mibs/juniper_alarm_mib.py Outdated Show resolved Hide resolved
python/nav/ipdevpoll/plugins/juniperalarm.py Outdated Show resolved Hide resolved
python/nav/ipdevpoll/plugins/juniperalarm.py Outdated Show resolved Hide resolved
@hmpf
Copy link
Contributor Author

hmpf commented Jun 16, 2022

eventengine will by-design ignore the end-events that appear without a corresponding start-event having been posted first

Cool, how convenient!

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Member

@lunkwill42 lunkwill42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, now on to round two (and it's not soggy-brain late afternoon this time)...

python/nav/mibs/juniper_alarm_mib.py Show resolved Hide resolved
python/nav/ipdevpoll/plugins/juniperalarm.py Outdated Show resolved Hide resolved
@lunkwill42
Copy link
Member

The actual count could possibly be stored together the event with the help of EventQueueVar. Any good examples where this is done?

Whether this example is good could be debatable, but here is once instance of setting arbitrary event variables through the "varmap" (line 160 should be highlighted):

def _make_bgpstate_event(self, start=True, is_adminstatus=False):
model = self.get_existing_model()
peername = self._get_peer_name() or str(model.peer)
peerid = self._get_peer_id()
varmap = {
'peer': str(model.peer),
'peername': peername,
'state': self.state,
'adminstatus': self.adminstatus,
}
event = EVENT.start if start else EVENT.end
if start and is_adminstatus:
event = partial(event, alert_type='bgpAdmDown')
event = event(netbox=self.netbox.id, subid=model.id, varmap=varmap)
proto = self._protocol_map.get(self.protocol, None)
self._logger.info(
"dispatching event (%s) for %s %s state change" " from %s to %s",
event.varmap['alerttype'],
proto,
peerid,
model.state,
self.state,
)
event.save()

There are two issues that would make it difficult to come to an ideal solution:

  1. EventQueueVars aren't automatically carried over to the corresponding alerthist entries that eventengine generates (though I think they are copied into the alert queue - however, alert queue entries represent notifications and are removed once notifications are sent).

Usually, if you want to carry arbitrary variables over to the permanent record of alerthist/AlertHistory, you need to write an event handler plugin that does so explicitly. Currently, I think perhaps the only plugin that does so is the event plugin for maintenance events. which does it here:

alert.history_vars = dict(alert)

  1. Secondly, a state is a state in NAV, there isn't really a mechanism to add more events or information to an existing alerthist state. So, if the alert count changes over time (but remains non-zero), there isn't really an effective way to update an existing "juniper red alert non-zero" state, it will just go down as "oh, here's a duplicate start-event, I'll throw it away". You might, however, be able to add some magic by implementing an eventengine plugin for your new event type.

So, presently, you can generate an alert when the "red count" transitions from 0 to 1, and this alert can say "there's 1 red alert". However, when the counter subsequently transitions from 1 to 2, there is no way to notify the NAV user that "there are now 2 red alerts". Again, this is analysis is from memory. Unless it is already possible, we could jig event engine to be able to override handling of "duplicate" events in a custom plugin.

@lunkwill42
Copy link
Member

lunkwill42 commented Jun 17, 2022

So, presently, you can generate an alert when the "red count" transitions from 0 to 1, and this alert can say "there's 1 red alert". However, when the counter subsequently transitions from 1 to 2, there is no way to notify the NAV user that "there are now 2 red alerts". Again, this is analysis is from memory. Unless it is already possible, we could jig event engine to be able to override handling of "duplicate" events in a custom plugin.

The maintenanceState plugin already suggests that we could work around the "duplicate" handling:

if alert.is_event_duplicate():
self._logger.info('Ignoring duplicate event')
else:
alert.post()

This means that we could potentially detect a change in the red/green alert count, update the existing alert history state and send an extra notification. However, there still is no good way to maintain a history/log of the changing red/green alert count over time. Maybe storing a current value and a maximum value as alerthistvars? It might be time for a fuller design discussion with the CNaaS team who wanted this feature :)

@lunkwill42
Copy link
Member

It might be time for a fuller design discussion with the CNaaS team who wanted this feature :)

So I did have a short discussion with @knutvi on this. I'm adding our conclusion to the original issue #2358.

Copy link
Member

@lunkwill42 lunkwill42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better, but one important piece is missing:

I would argue that this plugin belongs in the statuscheck ipdevpoll job, which would ensure it runs every 5 minutes. The example ipdevpoll.conf should be updated accordingly.

I'm also a bit worried that eventengine might go berserk with logging: If you have 100 Juniper devices with no alarms, it might log 100 messages every five minutes about rejecting an end-event from this plugin for having no corresponding start event. This should be verified...

tests/integration/ipdevpoll/plugins/juniper_alarm_test.py Outdated Show resolved Hide resolved
tests/integration/ipdevpoll/plugins/juniper_alarm_test.py Outdated Show resolved Hide resolved
tests/integration/ipdevpoll/plugins/juniper_alarm_test.py Outdated Show resolved Hide resolved
tests/integration/ipdevpoll/plugins/juniper_alarm_test.py Outdated Show resolved Hide resolved
@johannaengland
Copy link
Contributor

I can confirm that every five minutes we get two logging messages about ignoring an end event for each netbox.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 1, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Member

@lunkwill42 lunkwill42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, but a few questions about the logical flow of this remain for me :)

python/nav/ipdevpoll/plugins/juniperalarm.py Outdated Show resolved Hide resolved
python/nav/ipdevpoll/plugins/juniperalarm.py Outdated Show resolved Hide resolved
python/nav/ipdevpoll/plugins/juniperalarm.py Outdated Show resolved Hide resolved
python/nav/ipdevpoll/plugins/juniperalarm.py Outdated Show resolved Hide resolved
python/nav/ipdevpoll/plugins/juniperalarm.py Outdated Show resolved Hide resolved
python/nav/ipdevpoll/plugins/juniperalarm.py Outdated Show resolved Hide resolved
python/nav/ipdevpoll/plugins/juniperalarm.py Outdated Show resolved Hide resolved
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Member

@lunkwill42 lunkwill42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nearly there, but still a couple of issues with error handling and non-Juniper devices.

The new plugin runs for all devices, regardless of vendor. You cannot expect to get an alarm count for non-Juniper devices. You should also not expect to always get a number from Juniper devices.

Running the new plugin on a test installation that contained an HP switch, I immediately got it to crash with this traceback:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 501, in errback
    self._startRunCallbacks(fail)
  File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
    self._runCallbacks()
  File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
    _inlineCallbacks(r, g, status)
--- <exception caught here> ---
  File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.9/dist-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/source/python/nav/ipdevpoll/plugins/juniperalarm.py", line 65, in handle
    current_yellow_count = yield mib.get_yellow_alarm_count()
  File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/source/python/nav/mibs/juniper_alarm_mib.py", line 41, in _get_alarm_count
    count = int(count) or 0
builtins.TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

I suggest:

  1. The code needs to handle result values of None, which typically mean that no alarm count could be fetched from this device.
  2. You should consider adding a vendor id restriction to the plugin. I can only find this single example atm:
    RESTRICT_TO_VENDORS = VENDOR_MIBS.keys()

python/nav/models/sql/changes/sc.05.06.0001.sql Outdated Show resolved Hide resolved
@johannaengland johannaengland requested a review from lunkwill42 May 30, 2023 12:39
Copy link
Member

@lunkwill42 lunkwill42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me now (after lots of fiddling with non-related issues in my dev environments).

However, as I got the alerts, I noticed they all use the term "netbox". This is something we generally don't do in alert messages. The "netbox" term is at best an internal term in NAV, while the outward facing term is usually "IP Device". However, we generally only refer to the sysname of a netbox in alert messages (see all the other alert message templates for examples)

@johannaengland johannaengland requested a review from lunkwill42 May 31, 2023 13:55
Copy link
Member

@lunkwill42 lunkwill42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bah, didn't catch all the netboxes on the first try, apparently...

@johannaengland johannaengland requested a review from lunkwill42 June 1, 2023 07:14
Copy link
Member

@lunkwill42 lunkwill42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! 🎉

@johannaengland johannaengland force-pushed the juniper-chassis-alarms branch from 204d406 to dcff55b Compare June 1, 2023 10:42
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jun 1, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@johannaengland johannaengland merged commit 2d2ef70 into Uninett:master Jun 1, 2023
@hmpf hmpf deleted the juniper-chassis-alarms branch November 21, 2023 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Juniper CHASSIS and SYSTEM alerts
3 participants