[BUG] All ARP records associated with a router are immediately closed when there is a short ICMP echo packet loss #2910

lunkwill42 · 2024-05-14T08:58:30Z

Describe the bug

When the pping daemon detects a box down event (i.e. a number of ICMP echo replies are missing), it both dispatches a boxDown event and immediately sets netbox.up to n (a value that indicates the device is down).

However, the state machinery of NAV (through eventengine) will not actually give the netbox a state of down until it has been unresponsive for more than 4 minutes (default value) - and no alerts are sent until it has been unresponsive for at least 1 minute.

The net effect is that a short-term packet loss will cause the netbox.up database attribute to flip back and forth before anyone notices.

However, there is a database rule that will forcibly close all ARP records associated with this netbox as soon as netbox.up is set to the down-state. This rule was introduced in 3e6f2df as a result of #596 (i.e. the rule is about 13 years old by now).

The rule may have been well-intentioned. It was likely intended to close ARP records for a device that went "permanently" offline (since NAV cannot collect from the device while it is offline, it cannot reliably decide if ARP records should remain open or closed). However, using netbox.up for this is unreliable, since this flag may flap without signifying any kind of "permanence" of the down-state.

To Reproduce

Do not attempt to reproduce in a production environment.

Steps to reproduce the behavior:

Find any netbox (router) that has any number of open ARP records in the arp table, e.g. netbox with id=42:

Issue the following SQL:

UPDATE netbox SET up='n' WHERE netboxid=42;

Observe that all ARP records for netbox 42 have now been closed.

Expected behavior

A netbox' ARP records should not be closed as a consequence of a short-lived ICMP packet loss.

Environment (please complete the following information):

NAV version installed: 5.9.1
Method of installation: Any

The text was updated successfully, but these errors were encountered:

lunkwill42 · 2024-05-14T10:46:59Z

Although the current behavior is undesirable, it is likely still desirable for NAV to close ARP records after a netbox has been down for a while, since we can no longer verify those records by polling the netbox.

The question is: When is it acceptable to close ARP records associated with a "dead" netbox? Some suggestions:

It could be acceptable to close them when the netbox is actually declared down by a new boxState entry in alerthist. This happens after 4 minutes of unresponsiveness (by default, configured in eventengine.conf).
We might want to wait even longer, in which case it would not be achievable through a database rule. We might instead want to add a new subcommand to the navclean program that will close open ARP records for netboxes that have been down for any number of minutes. The limit could then be configurable on each NAV site (the dbclean cron job runs by default every 5 minutes)

lunkwill42 · 2024-05-14T10:48:19Z

I would have to say that I'm leaning towards the latter solution, with some default value provided by NAV. ARP collection runs every 30 minutes by default, so a sensible default could be to close ARP records for devices that have been down for longer than this.

lunkwill42 · 2024-05-30T05:33:58Z

This was supposed to be fixed by #2913, but this PR managed to delete the incorrect database rule. The target rule to delete was netbox_status_close_arp, but instead the schema changes delete netbox_close_arp, which is responsible for closing ARP records when a router is deleted from NAV - which is entirely different.

lunkwill42 · 2024-05-30T07:27:48Z

Fixed by #2928 - expected in a 5.10.2 release

lunkwill42 changed the title ~~[BUG] All ARP records associated with a router are immediately closed when there is a short ICMP echo packet losd~~ [BUG] All ARP records associated with a router are immediately closed when there is a short ICMP echo packet loss May 14, 2024

lunkwill42 added 802.1X bug and removed 802.1X labels May 14, 2024

lunkwill42 mentioned this issue May 14, 2024

Delay ARP record closures for devices that have been down for a while #2913

Merged

lunkwill42 self-assigned this May 30, 2024

lunkwill42 mentioned this issue May 30, 2024

Reinstate database rule netbox_close_arp and delete netbox_status_close_arp instead #2928

Merged

lunkwill42 closed this as completed May 30, 2024

lunkwill42 mentioned this issue Nov 24, 2024

ARP records are closed in bulk, seemingly at random #2885

Closed

lunkwill42 mentioned this issue Dec 6, 2024

[BUG] ARP records from Palo Alto firewalls keep getting closed and re-opened #3252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] All ARP records associated with a router are immediately closed when there is a short ICMP echo packet loss #2910

[BUG] All ARP records associated with a router are immediately closed when there is a short ICMP echo packet loss #2910

lunkwill42 commented May 14, 2024 •

edited by johannaengland

Loading

lunkwill42 commented May 14, 2024

lunkwill42 commented May 14, 2024

lunkwill42 commented May 30, 2024

lunkwill42 commented May 30, 2024

[BUG] All ARP records associated with a router are immediately closed when there is a short ICMP echo packet loss #2910

[BUG] All ARP records associated with a router are immediately closed when there is a short ICMP echo packet loss #2910

Comments

lunkwill42 commented May 14, 2024 • edited by johannaengland Loading

Describe the bug

To Reproduce

Expected behavior

Environment (please complete the following information):

lunkwill42 commented May 14, 2024

lunkwill42 commented May 14, 2024

lunkwill42 commented May 30, 2024

lunkwill42 commented May 30, 2024

lunkwill42 commented May 14, 2024 •

edited by johannaengland

Loading