-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] All ARP records associated with a router are immediately closed when there is a short ICMP echo packet loss #2910
Comments
Although the current behavior is undesirable, it is likely still desirable for NAV to close ARP records after a netbox has been down for a while, since we can no longer verify those records by polling the netbox. The question is: When is it acceptable to close ARP records associated with a "dead" netbox? Some suggestions:
|
I would have to say that I'm leaning towards the latter solution, with some default value provided by NAV. ARP collection runs every 30 minutes by default, so a sensible default could be to close ARP records for devices that have been down for longer than this. |
This was supposed to be fixed by #2913, but this PR managed to delete the incorrect database rule. The target rule to delete was |
Fixed by #2928 - expected in a 5.10.2 release |
Describe the bug
When the
pping
daemon detects a box down event (i.e. a number of ICMP echo replies are missing), it both dispatches aboxDown
event and immediately setsnetbox.up
ton
(a value that indicates the device is down).However, the state machinery of NAV (through eventengine) will not actually give the netbox a state of down until it has been unresponsive for more than 4 minutes (default value) - and no alerts are sent until it has been unresponsive for at least 1 minute.
The net effect is that a short-term packet loss will cause the
netbox.up
database attribute to flip back and forth before anyone notices.However, there is a database rule that will forcibly close all ARP records associated with this netbox as soon as
netbox.up
is set to the down-state. This rule was introduced in 3e6f2df as a result of #596 (i.e. the rule is about 13 years old by now).The rule may have been well-intentioned. It was likely intended to close ARP records for a device that went "permanently" offline (since NAV cannot collect from the device while it is offline, it cannot reliably decide if ARP records should remain open or closed). However, using
netbox.up
for this is unreliable, since this flag may flap without signifying any kind of "permanence" of the down-state.To Reproduce
Do not attempt to reproduce in a production environment.
Steps to reproduce the behavior:
arp
table, e.g. netbox withid=42
:42
have now been closed.Expected behavior
A netbox' ARP records should not be closed as a consequence of a short-lived ICMP packet loss.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: