-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: out connections leak #3077
Conversation
This PR may contain changes to database schema of one of the drivers. If you are introducing any changes to the schema, make sure the upgrade from the latest release to this change passes without any errors/issues. Please make sure the label |
3f0c865
to
2656819
Compare
You can find the image built from this PR at
Built from 3deaacf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Do we need to revisit how missed pings are handled?
If only one side pings maybe we should be more lenient before disconnecting.
It may not be a problem in practice, IDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for it! 💯
Great point! I see that the connection should timeout after 4-5 missed pings (~10 minutes without being reachable) Line 1261 in e406673
I think it looks reasonable? Don't think it should give issues, lmk what you think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insightful! Thank you!
Description
Once we started promptly disconnecting from excess
in
connections, we began seeing our nodes significantly exceeding theirout
connections targets.The root cause was a race condition in our keep alive loop
nwaku/waku/node/waku_node.nim
Lines 1241 to 1258 in 643ab20
The case is the following:
nim-libp2p
accepts the connection until our peer manager notices that it's beyond ourin
target and disconnectsin
connection, we start running the keep alive loop and have that peer in the list of connected peers that we should pingin
connection as we noticed it's beyond our targetout
connection towards the nodeThe proposed change to avoid this race condition is to delegate the responsibility of the periodic ping to the node that originally initiated the connection. Or in other words, whoever initiated a connection is the one responsible to ping periodically to maintain it open - there's no need to have both nodes pinging each other.
Changes
connectedPeers()
to allow to get connected peers from all protocolskeepaliveLoop
so that we only ping nodes in ourout
connections listIssue
closes #3063