Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay ARP record closures for devices that have been down for a while #2913

Merged
merged 11 commits into from
May 22, 2024

Conversation

lunkwill42
Copy link
Member

Instead of immediately closing the ARP records read from a netbox as soon as its pping up/down status flaps, we should wait for the netbox to stay down for longer. At this point, we cannot accomplish this usefully as a PostgreSQL rule or trigger, so this suggests removing the old database rule and replacing it with a new argument to the already-existing navclean command. In this way, the expiry interval is even configurable for end users (or can be disabled entirely).

The terminology used in the navclean code was heavily centered around the concept on deletion, so this PR also introduces a renaming of this concept to just the generic "cleaning", since the new cleaner type doesn't actually delete records, it just sets a timestamp in them.

Fixes #2910

We're about to introduce cleaning operations that do not entail actual
record deletion, just record updating.  For that reason, it's best to
change the "deleter" concept in the code to a more generic "cleaner"
concept.
This new command can be used to explicitly close ARP records from
netboxes that have been down for too long.
This rule has unintended and bad side-effects.  It will cause ARP
records to be immediately closed when NAV experience intermittent ICMP
packet losses from a router, specifically the kind that are so
short-lived that NAV never actually declares the device as down through
a new `alerthist` record.  `pping` will, however, use `netbox.up` as its
persistent internal up/down status, which can cause ARP records to be
closed as a result of non-serious flapping.
@lunkwill42 lunkwill42 self-assigned this May 14, 2024
@lunkwill42
Copy link
Member Author

lunkwill42 commented May 14, 2024

Lacks tests, and possibly documentation. These will be added.

Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 97.67442% with 1 line in your changes missing coverage. Please review.

Project coverage is 60.27%. Comparing base (90ead68) to head (d12fcbc).
Report is 76 commits behind head on 5.10.x.

Files Patch % Lines
python/nav/bin/navclean.py 97.67% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           5.10.x    #2913      +/-   ##
==========================================
+ Coverage   60.21%   60.27%   +0.06%     
==========================================
  Files         601      602       +1     
  Lines       43981    44102     +121     
==========================================
+ Hits        26481    26581     +100     
- Misses      17500    17521      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented May 14, 2024

Test results

     12 files       12 suites   11m 48s ⏱️
3 326 tests 3 326 ✔️ 0 💤 0
9 453 runs  9 453 ✔️ 0 💤 0

Results for commit d12fcbc.

♻️ This comment has been updated with latest results.

@lunkwill42 lunkwill42 marked this pull request as ready for review May 15, 2024 07:24
@lunkwill42 lunkwill42 added the bug label May 15, 2024
Copy link
Contributor

@johannaengland johannaengland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this, only one question:
Why did you chose the variable name expiry_type instead of table? I agree that table isn't ideal, but don't understand the new name.

And please have another look at the changelog fragment. I can't make a suggestion for improvement because I do not get what it's supposed to say 😁

@lunkwill42
Copy link
Member Author

I'm happy with this, only one question: Why did you chose the variable name expiry_type instead of table? I agree that table isn't ideal, but don't understand the new name.

Because naming things is hard, and table obviously became useless as a name once it no longer necessarily contains an actual table name. I might have preferred type as a nicely generic name, but I don't like shadowing Python built-ins. Any better suggestions?

And please have another look at the changelog fragment. I can't make a suggestion for improvement because I do not get what it's supposed to say 😁

Gargh... comes down to usage of backticks as markdown markup inside "" quotes in my shell. Some command got executed by the shell, produced a weird error and left that part of the news fragment empty - and I didn't even notice it (too busy trying to figure out why I got the weird error message!) 😃

@lunkwill42
Copy link
Member Author

Test coverage mostly achieved by 410fd80. News fragment fixed and existing docs about the dbclean cron job were updated.

@johannaengland
Copy link
Contributor

I'm happy with this, only one question: Why did you chose the variable name expiry_type instead of table? I agree that table isn't ideal, but don't understand the new name.

Because naming things is hard, and table obviously became useless as a name once it no longer necessarily contains an actual table name. I might have preferred type as a nicely generic name, but I don't like shadowing Python built-ins. Any better suggestions?

Fair, I cannot think of a better name myself. Then it might be nice to have a comment next to the definition in the base class to clarify what it should be used for.

johannaengland
johannaengland previously approved these changes May 16, 2024
Copy link
Contributor

@johannaengland johannaengland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only my comment about documenting expiry_type a bit better.

Copy link
Contributor

@stveit stveit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya seems fine, definitely was easier to go through commit by commit with all the renaming cluttering things and making it difficult to discern the real changes

python/nav/bin/navclean.py Outdated Show resolved Hide resolved
python/nav/bin/navclean.py Outdated Show resolved Hide resolved
@lunkwill42 lunkwill42 changed the base branch from master to 5.10.x May 16, 2024 14:07
@lunkwill42 lunkwill42 dismissed johannaengland’s stale review May 16, 2024 14:07

The base branch was changed.

Copy link
Contributor

@stveit stveit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now

This adds a default navclean job every five minutes to close ARP
records of devices that have been down for more than 30 minutes.
Needs to be a separate job line, since it doesn't use the built-in
default of 6 month expiry interval.
Use the expiry type name without necessarily referring
to it as a table.  As discussed in code review.
As mentioned in code review, the double downsince name is slightly
confusing, so this renames the intermediate lookup table to something
hopefully more understandable.
@lunkwill42 lunkwill42 force-pushed the bugfix/arp-closure-on-netbox-packet-loss branch from fa115b4 to d12fcbc Compare May 21, 2024 13:21
@lunkwill42 lunkwill42 merged commit 0f81276 into 5.10.x May 22, 2024
19 checks passed
@lunkwill42 lunkwill42 deleted the bugfix/arp-closure-on-netbox-packet-loss branch May 22, 2024 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants