Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peer: fix competing connections to the same peer #6082

Merged
merged 3 commits into from
Jan 3, 2022

Conversation

yyforyongyu
Copy link
Member

When lnd starts, two types of connections are made, the persistent connection and the bootstrapping connection. In short, the persistent connection is made in three steps,

  1. lnd's server caches the connection request and sends it to connMgr
  2. connMgr makes the connection and sends it back
  3. lnd caches the connection so it remembers the peer being connected.

In bootstrapping, we don't rely on connMgr (maybe we should?), instead, we directly use brontide to make a new connection. Plus, we only attempt to connect with peers that are not connected yet, which is decided by looking into the server's cache.

The above two types are both run inside goroutines. Now suppose the persistent connection finishes to step 2 and is waiting on step 3, before the server remembers the peer being connected, the bootstrapping might also take place and attempt a connection to the same peer, which will end in canceling all previous connections from that peer.

Now when the persistent connection moves to step 3, which involves sending an init message to the remote peer, an error will be returned, as shown in the logs,

Dec 09 15:05:03 fullnode lnd[40485]: 2021-12-09 15:05:03.137 [INF] SRVR: Finalizing connection to 03271338633d2d37b285dae4df40b413d8c6c791fbee7797bc5dc70812196d7d5c@3.95.117.200:49772, inbound=true
Dec 09 15:05:03 fullnode lnd[40485]: 2021-12-09 15:05:03.224 [INF] DISC: Creating new GossipSyncer for peer=03271338633d2d37b285dae4df40b413d8c6c791fbee7797bc5dc70812196d7d5c
Dec 09 15:05:11 fullnode lnd[40485]: 2021-12-09 15:05:11.321 [INF] PEER: disconnecting 03271338633d2d37b285dae4df40b413d8c6c791fbee7797bc5dc70812196d7d5c@3.95.117.200:49772, reason: server: disconnecting peer 03271338633d2d37b285dae4df40b413d8c6c791fbee7797bc5dc70812196d7d5c@3.95.117.200:49772
Dec 09 15:05:11 fullnode lnd[40485]: 2021-12-09 15:05:11.321 [INF] PEER: unable to read message from 03271338633d2d37b285dae4df40b413d8c6c791fbee7797bc5dc70812196d7d5c@3.95.117.200:49772: read tcp 10.40.0.2:9735->3.95.117.200:49772: use of closed network connection

I suspect there are other places where a connection to a peer is being competed. It's a bit challenging to detect tho as the relevant code needs to be improved for maintainability. Plus the bootstrapping logic is skipped from itest, which will be put back once the itest is fixed.

Mitigate #6000.

@yyforyongyu yyforyongyu force-pushed the 6000-peer-conn branch 2 times, most recently from 971302a to cafa2a2 Compare December 13, 2021 00:46
@Roasbeef Roasbeef requested review from ellemouton and removed request for bhandras December 13, 2021 19:23
@Roasbeef Roasbeef added bug fix networking p2p Code related to the peer-to-peer behaviour labels Dec 13, 2021
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 definitely removing a clear race condition 👍
Just left some nits and also a question (unrelated to this pr) about your persistentPeers TODO.

server.go Outdated
Comment on lines 202 to 204
// TODO(yy): the bool value seems to be unused, we ask the connmgr to
// make a permanent connection regardless of this value. Needs further
// check.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in prunePersistentPeerConnection to ensure that we dont prune a peer we have marked as permanent

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bool represents if we should/shouldnt keep trying to reconnect with the peer even if we have no active channels to them. But i think you might be right in that it isnt actually used properly...
Here we prune the connection if num active channels is 0
which then will call this which remove this peer from the peristenetPeers map and cancels connection reqs to it. but then back in the Brontide Start method, it just continues as per usual even though we may now have canceled the connReq to the peer.... hmmm... do you think this is a bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it's a bug without a trivial fix...Worst case tho is an error will be returned from that Start method. Thb it's not easy to follow the code. Speaking of which, just curious, any thoughts on how to refactor this peer conn?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any thoughts on how to refactor this peer conn?

I guess just continuing with #5700 & then handling the returned value from prunePersistentPeerConnection properly. Or do you mean something else?

server.go Outdated Show resolved Hide resolved
server.go Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
@lightninglabs-deploy
Copy link

@Roasbeef: review reminder
@yyforyongyu, remember to re-request review from reviewers when ready

@yyforyongyu yyforyongyu requested review from ellemouton and guggero and removed request for Roasbeef December 21, 2021 17:09
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix, LGTM 🎉

server.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
server.go Show resolved Hide resolved
This commit fixes the issue where duplicate peers are used both in
making persistent connections and bootstrap connections. When we init
bootstrapping, we need to ignore peers that have connections already
made so far plus peers which we are attempting to make connections with,
hence the persistent peers.
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@guggero guggero added this to the v0.14.2 milestone Dec 23, 2021
@guggero guggero merged commit 9d6701b into lightningnetwork:master Jan 3, 2022
@yyforyongyu yyforyongyu deleted the 6000-peer-conn branch January 3, 2022 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix networking p2p Code related to the peer-to-peer behaviour
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants