-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
peer: fix competing connections to the same peer #6082
Conversation
971302a
to
cafa2a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀 definitely removing a clear race condition 👍
Just left some nits and also a question (unrelated to this pr) about your persistentPeers
TODO.
server.go
Outdated
// TODO(yy): the bool value seems to be unused, we ask the connmgr to | ||
// make a permanent connection regardless of this value. Needs further | ||
// check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in prunePersistentPeerConnection
to ensure that we dont prune a peer we have marked as permanent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the bool represents if we should/shouldnt keep trying to reconnect with the peer even if we have no active channels to them. But i think you might be right in that it isnt actually used properly...
Here we prune the connection if num active channels is 0
which then will call this which remove this peer from the peristenetPeers
map and cancels connection reqs to it. but then back in the Brontide Start method, it just continues as per usual even though we may now have canceled the connReq to the peer.... hmmm... do you think this is a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it's a bug without a trivial fix...Worst case tho is an error will be returned from that Start method. Thb it's not easy to follow the code. Speaking of which, just curious, any thoughts on how to refactor this peer conn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any thoughts on how to refactor this peer conn?
I guess just continuing with #5700 & then handling the returned value from prunePersistentPeerConnection
properly. Or do you mean something else?
@Roasbeef: review reminder |
cafa2a2
to
8e2ebef
Compare
8e2ebef
to
87fd0a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix, LGTM 🎉
87fd0a1
to
4df1e2e
Compare
This commit fixes the issue where duplicate peers are used both in making persistent connections and bootstrap connections. When we init bootstrapping, we need to ignore peers that have connections already made so far plus peers which we are attempting to make connections with, hence the persistent peers.
4df1e2e
to
edf5926
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
edf5926
to
c9aa034
Compare
When
lnd
starts, two types of connections are made, the persistent connection and the bootstrapping connection. In short, the persistent connection is made in three steps,lnd
's server caches the connection request and sends it toconnMgr
connMgr
makes the connection and sends it backlnd
caches the connection so it remembers the peer being connected.In bootstrapping, we don't rely on
connMgr
(maybe we should?), instead, we directly usebrontide
to make a new connection. Plus, we only attempt to connect with peers that are not connected yet, which is decided by looking into the server's cache.The above two types are both run inside goroutines. Now suppose the persistent connection finishes to step 2 and is waiting on step 3, before the server remembers the peer being connected, the bootstrapping might also take place and attempt a connection to the same peer, which will end in canceling all previous connections from that peer.
Now when the persistent connection moves to step 3, which involves sending an init message to the remote peer, an error will be returned, as shown in the logs,
I suspect there are other places where a connection to a peer is being competed. It's a bit challenging to detect tho as the relevant code needs to be improved for maintainability. Plus the bootstrapping logic is skipped from itest, which will be put back once the itest is fixed.
Mitigate #6000.