peer: fix competing connections to the same peer #6082

yyforyongyu · 2021-12-12T23:09:14Z

When lnd starts, two types of connections are made, the persistent connection and the bootstrapping connection. In short, the persistent connection is made in three steps,

lnd's server caches the connection request and sends it to connMgr
connMgr makes the connection and sends it back
lnd caches the connection so it remembers the peer being connected.

In bootstrapping, we don't rely on connMgr (maybe we should?), instead, we directly use brontide to make a new connection. Plus, we only attempt to connect with peers that are not connected yet, which is decided by looking into the server's cache.

The above two types are both run inside goroutines. Now suppose the persistent connection finishes to step 2 and is waiting on step 3, before the server remembers the peer being connected, the bootstrapping might also take place and attempt a connection to the same peer, which will end in canceling all previous connections from that peer.

Now when the persistent connection moves to step 3, which involves sending an init message to the remote peer, an error will be returned, as shown in the logs,

Dec 09 15:05:03 fullnode lnd[40485]: 2021-12-09 15:05:03.137 [INF] SRVR: Finalizing connection to 03271338633d2d37b285dae4df40b413d8c6c791fbee7797bc5dc70812196d7d5c@3.95.117.200:49772, inbound=true
Dec 09 15:05:03 fullnode lnd[40485]: 2021-12-09 15:05:03.224 [INF] DISC: Creating new GossipSyncer for peer=03271338633d2d37b285dae4df40b413d8c6c791fbee7797bc5dc70812196d7d5c
Dec 09 15:05:11 fullnode lnd[40485]: 2021-12-09 15:05:11.321 [INF] PEER: disconnecting 03271338633d2d37b285dae4df40b413d8c6c791fbee7797bc5dc70812196d7d5c@3.95.117.200:49772, reason: server: disconnecting peer 03271338633d2d37b285dae4df40b413d8c6c791fbee7797bc5dc70812196d7d5c@3.95.117.200:49772
Dec 09 15:05:11 fullnode lnd[40485]: 2021-12-09 15:05:11.321 [INF] PEER: unable to read message from 03271338633d2d37b285dae4df40b413d8c6c791fbee7797bc5dc70812196d7d5c@3.95.117.200:49772: read tcp 10.40.0.2:9735->3.95.117.200:49772: use of closed network connection

I suspect there are other places where a connection to a peer is being competed. It's a bit challenging to detect tho as the relevant code needs to be improved for maintainability. Plus the bootstrapping logic is skipped from itest, which will be put back once the itest is fixed.

Mitigate #6000.

ellemouton

LGTM 🚀 definitely removing a clear race condition 👍
Just left some nits and also a question (unrelated to this pr) about your persistentPeers TODO.

ellemouton · 2021-12-14T07:01:40Z

server.go

+	// TODO(yy): the bool value seems to be unused, we ask the connmgr to
+	// make a permanent connection regardless of this value. Needs further
+	// check.


It is used in prunePersistentPeerConnection to ensure that we dont prune a peer we have marked as permanent

the bool represents if we should/shouldnt keep trying to reconnect with the peer even if we have no active channels to them. But i think you might be right in that it isnt actually used properly...
Here we prune the connection if num active channels is 0
which then will call this which remove this peer from the peristenetPeers map and cancels connection reqs to it. but then back in the Brontide Start method, it just continues as per usual even though we may now have canceled the connReq to the peer.... hmmm... do you think this is a bug?

Yeah I think it's a bug without a trivial fix...Worst case tho is an error will be returned from that Start method. Thb it's not easy to follow the code. Speaking of which, just curious, any thoughts on how to refactor this peer conn?

any thoughts on how to refactor this peer conn?

I guess just continuing with #5700 & then handling the returned value from prunePersistentPeerConnection properly. Or do you mean something else?

server.go

lightninglabs-deploy · 2021-12-21T07:48:06Z

@Roasbeef: review reminder
@yyforyongyu, remember to re-request review from reviewers when ready

guggero

Nice fix, LGTM 🎉

server.go

This commit fixes the issue where duplicate peers are used both in making persistent connections and bootstrap connections. When we init bootstrapping, we need to ignore peers that have connections already made so far plus peers which we are attempting to make connections with, hence the persistent peers.

ellemouton

LGTM 🚀

docs/release-notes/release-notes-0.14.2.md

yyforyongyu requested review from Roasbeef and bhandras December 12, 2021 23:11

yyforyongyu force-pushed the 6000-peer-conn branch 2 times, most recently from 971302a to cafa2a2 Compare December 13, 2021 00:46

Roasbeef requested review from ellemouton and removed request for bhandras December 13, 2021 19:23

Roasbeef added bug fix networking p2p Code related to the peer-to-peer behaviour labels Dec 13, 2021

ellemouton reviewed Dec 14, 2021

View reviewed changes

yyforyongyu force-pushed the 6000-peer-conn branch from cafa2a2 to 8e2ebef Compare December 21, 2021 17:09

yyforyongyu requested review from ellemouton and guggero and removed request for Roasbeef December 21, 2021 17:09

yyforyongyu force-pushed the 6000-peer-conn branch from 8e2ebef to 87fd0a1 Compare December 21, 2021 17:10

guggero approved these changes Dec 22, 2021

View reviewed changes

server.go Outdated Show resolved Hide resolved

server.go Outdated Show resolved Hide resolved

server.go Outdated Show resolved Hide resolved

server.go Show resolved Hide resolved

yyforyongyu force-pushed the 6000-peer-conn branch from 87fd0a1 to 4df1e2e Compare December 23, 2021 07:11

yyforyongyu added 2 commits December 23, 2021 15:14

multi: enhance logging for debugging peer connection

46050fc

yyforyongyu force-pushed the 6000-peer-conn branch from 4df1e2e to edf5926 Compare December 23, 2021 07:15

ellemouton approved these changes Dec 23, 2021

View reviewed changes

guggero reviewed Dec 23, 2021

View reviewed changes

docs/release-notes/release-notes-0.14.2.md Outdated Show resolved Hide resolved

guggero added this to the v0.14.2 milestone Dec 23, 2021

docs: add release note for peer conn fix

c9aa034

yyforyongyu force-pushed the 6000-peer-conn branch from edf5926 to c9aa034 Compare December 28, 2021 11:32

guggero merged commit 9d6701b into lightningnetwork:master Jan 3, 2022

yyforyongyu deleted the 6000-peer-conn branch January 3, 2022 19:31

Roasbeef mentioned this pull request Jan 19, 2022

ACINQ channel switches to inactive, can't re-activate #6079

Closed

yyforyongyu mentioned this pull request Jan 23, 2022

itest: fix previously known test flakes #5940

Closed

guggero mentioned this pull request Jan 26, 2022

multi: create v0.14.2-beta-rc1 branch #6201

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

peer: fix competing connections to the same peer #6082

peer: fix competing connections to the same peer #6082

yyforyongyu commented Dec 12, 2021

ellemouton left a comment

ellemouton Dec 14, 2021

ellemouton Dec 14, 2021

yyforyongyu Dec 21, 2021

ellemouton Dec 23, 2021

lightninglabs-deploy commented Dec 21, 2021

guggero left a comment

ellemouton left a comment

peer: fix competing connections to the same peer #6082

peer: fix competing connections to the same peer #6082

Conversation

yyforyongyu commented Dec 12, 2021

ellemouton left a comment

Choose a reason for hiding this comment

ellemouton Dec 14, 2021

Choose a reason for hiding this comment

ellemouton Dec 14, 2021

Choose a reason for hiding this comment

yyforyongyu Dec 21, 2021

Choose a reason for hiding this comment

ellemouton Dec 23, 2021

Choose a reason for hiding this comment

lightninglabs-deploy commented Dec 21, 2021

guggero left a comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment