-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't add unresponsive DHT servers to the RT #811
Comments
We're mostly interested in large-scale incidents here and not the random node that happens to not be responsive. From this point of view, statistics can go a long way :) I'd say starting with a sample of 10%-20% of nodes in the routing table would do the trick, IMO. |
Actually, wouldn't it suffice to do a query similar to what we added in #810 inside the |
@dennis-tra in the case where a remote peer sends a DHT query, we don't know whether it can actually answer DHT queries (it must advertise the DHT protocol though), and the flag is still set to true. Lines 113 to 114 in 2b85cfc
I think that the easiest way to fix it is to add a flag to the |
Sure 👍 I just thought that this would have still caught the unresponsive peers. The peers were unresponsive because they couldn't open a new stream for the DHT protocol. If we received a query this would have proven that they still can open a stream. However, as you said, this doesn't prove that they respond to queries. Maybe there is a solution without passing an additional boolean flag through the stack? I'm asking because this sounds like the function could be split. Anyways, I'm sure you'll see how it's done better :) |
FYI @guillaumemichel that @Jorropo will be your Kubo-maintainer point of contact for supporting you to get these landed. |
@yiannisbot : this isn't making it in for 0.20 :(. It will happen after IPFS Thing for 0.21 (release in later may). |
#820 has been merged |
ETA: 2023Q1
TL;DR
What we have today
There is currently a function verifying that the peer that is about to be added to the Routing Table (RT) advertises the right DHT protocol. However, it doesn't verify whether the peer actually reply to DHT requests before adding it to the RT.
What is the problem
Misconfigured or misbehaving peers could advertise that they speak the right DHT protocol, but not answer to DHT requests as expected. These nodes may end up in the RT of well behaving peers, and upon request of a key that is close to a misbehaving peer, the well behaving peer will return the
peerid
of the misbehaving peer. The initial requester will then query the unresponsive peer and time out. The propagation of unresponsive peers slows down the lookup process.How to fix it
Before adding a remote peer to its RT, a node should verify that this remote peer is able to answer DHT queries correctly. It can verify this by making a DHT query to this peer (if not done before).
What is the expected impact
Nodes that have applied the patch no longer have unresponsive DHT servers in their RT, and they don't spread them anymore. However they can still be victim of unresponsive DHT server propagation from outdated peers. The DHT is expected to get faster as more peers apply the patch.
Checklist
Adding peers to the RT
New peers are added to the DHT using the
TryAddPeer
function only. Note that thequeryPeer
bool
variable is only useful for settingLastUsefulAt
, not for knowing whether the peer is a valid DHT server.The
TryAddPeer
function is only called from thertPeerLoop
function.go-libp2p-kad-dht/dht.go
Lines 590 to 597 in 2b85cfc
The only writer to the
dht.addPeerToRTChan
chan
is thepeerFound
function.go-libp2p-kad-dht/dht.go
Lines 625 to 659 in 2b85cfc
The only check performed by
peerFound
before adding the peer to the RT is callingvalidRTPeer
. This function only verify whether the remote peer advertises that it speaks the DHT protocol. But it doesn't verify that the remote peer is actually responsive to DHT requests. This check isn't enough.go-libp2p-kad-dht/subscriber_notifee.go
Lines 145 to 155 in 2b85cfc
We will list the callers of
peerFound
and explain how they add peers to the RT.New(...) (*IpfsDHT, error)
go-libp2p-kad-dht/dht.go
Lines 234 to 239 in 2b85cfc
When creating the DHT, the node tries to add all connected peers to its RT. Because of
validRTPeer
, only the peers advertising the DHT protocol will get added.In this case, it is important to make a DHT query (for a any key) to all remote peers that are about to be added to the RT, and only add them to the RT if they answer without error to the DHT query.
fixLowPeers
go-libp2p-kad-dht/dht.go
Lines 475 to 480 in 2b85cfc
Similar to the case above.
handleNewMessage
go-libp2p-kad-dht/dht_net.go
Lines 113 to 114 in 2b85cfc
If a peer sent us a DHT request (and advertises being a DHT server), add it to the RT.
Before adding these peers to the RT, make sure that they answer correctly to DHT queries
queryPeer
go-libp2p-kad-dht/query.go
Lines 407 to 420 in 2b85cfc
Upon successful DHT request, the remote peer is added to the RT.
No further action is required, as the node has proven to answer DHT queries
DHT requests upon RTRefresh
In addition to the proposed fix, it is possible to periodically make DHT request to all nodes in the RT to make sure they are still responsive to DHT queries. This operation however is more expensive, as 1 additional query is sent every 10 minutes for each RT member.
#810
A probabilistic approach can decrease the network load of additional DHT requests. At every refresh, the node only sends DHT requests to a fraction of the nodes inside a bucket, for all buckets. If it detects some unresponsive nodes, the fraction of peers queried in each bucket is increased for the next refresh. If it doesn't detect any unresponsive nodes for a while, the fraction of peers queried decreases at the next refresh, but never goes below a magic threshold.
This technique allows a low overhead if the network is behaving correctly. And if a significant share of the network acts in an adversarial way, the unresponsive DHT nodes are detected and removed from the RT, at the price of a higher network load.
IMO periodically verifying if nodes in the RT still answer to DHT queries isn't needed (at least for now). Preventing unresponsive nodes from being added to the RT in the first place should be sufficient.
The text was updated successfully, but these errors were encountered: