Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset expiry of entries in the neighbor cache on packet reception #966

Merged
merged 3 commits into from
Aug 5, 2024

Conversation

korken89
Copy link
Contributor

@korken89 korken89 commented Aug 2, 2024

Tentative fix for #965

This does solve the issue in my testing, but I'm not sure if it's the "right thing to do".

@korken89 korken89 force-pushed the neighbor-cache-expiry-fix branch 3 times, most recently from 5aee679 to 11941a5 Compare August 2, 2024 09:15
@thvdveld
Copy link
Contributor

thvdveld commented Aug 2, 2024

I'm not sure either if this is the correct thing to do, I'll have to look into it.

I don't think you need to pass the timestamp every time, you can just call self.now() I think.

@korken89
Copy link
Contributor Author

korken89 commented Aug 2, 2024

Thanks for the pointer, I'll fix!

@korken89
Copy link
Contributor Author

korken89 commented Aug 2, 2024

@thvdveld Another way to solve this is to instead mark the entry as "in use" by a connection to not only evict on "the one with the shortest timeout".
So it would instead be "evict shortest timeout that is not used for a connection" and if all are used by connections then go by the original rule.

This would be even more robust.
If I really push it with this fix applied I can still make the connection unstable.

@thvdveld
Copy link
Contributor

thvdveld commented Aug 2, 2024

From what I understand, Linux [1] does not remove the one with the shortest timeout, they just return an error and drop the packet. However, they have a neighbour cache size of 1024 neighbours. We can also increase the size by using the iface-neighbor-cache-count-1024 feature flag instead of the default iface-neighbor-cache-count-4. (Have you tried this btw?)

What Linux also does is updating the used flag [2], which this PR handles.

Linux keeps track of just more than reachable, etc. [3], which is based on Neighbour Unreachability Detection I think, specified in RFC 4861 Section 7.3. In the end, we should update our implementation to follow this. I know this is for IPv6, but I think they do something similar for IPv4.

I think that for now, updating the expiration timestamp is good enough. What do you think?

[1] https://github.com/torvalds/linux/blob/master/net/core/neighbour.c#L469-L524
[2] https://github.com/torvalds/linux/blob/master/net/core/neighbour.c#L501
[3] https://github.com/torvalds/linux/blob/master/net/core/neighbour.c#L1280-L1490

@Dirbaio
Copy link
Member

Dirbaio commented Aug 2, 2024

I agree we need some kind of "refreshing" MACs to prevent them from expiring while in active use. I'm not sure what's the best way though. There's historically been a few nasty bugs around this, about bad cache entries in broken networks. For example this 6210612 .

I'd be comfortable with this if we made the check stricter. Update expiration if all these hold:

  • Dest IP is unicast and ours (ie packet is explicitly directed at us, we ignore broadcast/multicast. It feels to me like these are more likely to be broken, like the above linked issue)
  • Src IP matches the neighbor cache
  • Src MAC matches the neighbor cache

I think that should fix your issues while keeping the chances of brokenness low.

The ideal fix would be to get "forward progress" confirmation from higher layer (e.g. TCP ACKs coming in for new data) like Linux does, but that's a more involved refactor.

- Check for unicast destination
- Source IP matches cache
- Source hardware address matches cache
@korken89
Copy link
Contributor Author

korken89 commented Aug 2, 2024

Thanks for the feedback and analysis! Great to know what the steps towards a complete fix would be.
I think I have implemented all the requested parts you asked for @Dirbaio, have a look at your leisure.

@korken89
Copy link
Contributor Author

korken89 commented Aug 2, 2024

@thvdveld on the comment of iface-neighbor-cache-count-1024, I have not tried this. However I did increase it now after testing these fixes as a small number can still cause issues.
I set it to 32 in my testing, and could go higher.

1024 seems like a lot though, wouldn't that eat like 20kB of RAM?

@korken89
Copy link
Contributor Author

korken89 commented Aug 5, 2024

I've test-run this over night in the old system that broke about every 30s with the old iface-neighbor-cache-count-4.
Now there was almost no failures at all over the ~15 hour test period.

@thvdveld
Copy link
Contributor

thvdveld commented Aug 5, 2024

@thvdveld on the comment of iface-neighbor-cache-count-1024, I have not tried this. However I did increase it now after testing these fixes as a small number can still cause issues. I set it to 32 in my testing, and could go higher.

1024 seems like a lot though, wouldn't that eat like 20kB of RAM?

1024 is indeed a lot when used on embedded. I looked at Contiki-NG and Zephyr, and they do 16 and 8 respectively. Maybe we should make it 8 by default in smoltcp as well.

@korken89
Copy link
Contributor Author

korken89 commented Aug 5, 2024

Yeah, the 4 is a bit too conservative for a good default. Do you have any preference @Dirbaio ?

@korken89
Copy link
Contributor Author

korken89 commented Aug 5, 2024

I tentatively updated the default to 8.

EDIT: Sorry misread the defaults.

Copy link
Member

@Dirbaio Dirbaio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@Dirbaio Dirbaio added this pull request to the merge queue Aug 5, 2024
Merged via the queue into smoltcp-rs:main with commit 8be46b9 Aug 5, 2024
9 checks passed
@korken89 korken89 deleted the neighbor-cache-expiry-fix branch August 5, 2024 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants