Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't add unresponsive DHT servers to the RT #811

Closed
4 tasks done
guillaumemichel opened this issue Feb 6, 2023 · 8 comments
Closed
4 tasks done

Don't add unresponsive DHT servers to the RT #811

guillaumemichel opened this issue Feb 6, 2023 · 8 comments
Assignees

Comments

@guillaumemichel
Copy link
Contributor

guillaumemichel commented Feb 6, 2023

ETA: 2023Q1

TL;DR

What we have today

There is currently a function verifying that the peer that is about to be added to the Routing Table (RT) advertises the right DHT protocol. However, it doesn't verify whether the peer actually reply to DHT requests before adding it to the RT.

What is the problem

Misconfigured or misbehaving peers could advertise that they speak the right DHT protocol, but not answer to DHT requests as expected. These nodes may end up in the RT of well behaving peers, and upon request of a key that is close to a misbehaving peer, the well behaving peer will return the peerid of the misbehaving peer. The initial requester will then query the unresponsive peer and time out. The propagation of unresponsive peers slows down the lookup process.

How to fix it

Before adding a remote peer to its RT, a node should verify that this remote peer is able to answer DHT queries correctly. It can verify this by making a DHT query to this peer (if not done before).

What is the expected impact

Nodes that have applied the patch no longer have unresponsive DHT servers in their RT, and they don't spread them anymore. However they can still be victim of unresponsive DHT server propagation from outdated peers. The DHT is expected to get faster as more peers apply the patch.

Checklist

  • Consensus on the changes to perform
  • Implement changes
  • Document the change at the relevant location
  • Merge the change

Adding peers to the RT

New peers are added to the DHT using the TryAddPeer function only. Note that the queryPeer bool variable is only useful for setting LastUsefulAt, not for knowing whether the peer is a valid DHT server.

The TryAddPeer function is only called from the rtPeerLoop function.

go-libp2p-kad-dht/dht.go

Lines 590 to 597 in 2b85cfc

case addReq := <-dht.addPeerToRTChan:
prevSize := dht.routingTable.Size()
if prevSize == 0 {
isBootsrapping = true
bootstrapCount = 0
timerCh = nil
}
newlyAdded, err := dht.routingTable.TryAddPeer(addReq.p, addReq.queryPeer, isBootsrapping)

The only writer to the dht.addPeerToRTChan chan is the peerFound function.

go-libp2p-kad-dht/dht.go

Lines 625 to 659 in 2b85cfc

// peerFound signals the routingTable that we've found a peer that
// might support the DHT protocol.
// If we have a connection a peer but no exchange of a query RPC ->
//
// LastQueriedAt=time.Now (so we don't ping it for some time for a liveliness check)
// LastUsefulAt=0
//
// If we connect to a peer and then exchange a query RPC ->
//
// LastQueriedAt=time.Now (same reason as above)
// LastUsefulAt=time.Now (so we give it some life in the RT without immediately evicting it)
//
// If we query a peer we already have in our Routing Table ->
//
// LastQueriedAt=time.Now()
// LastUsefulAt remains unchanged
//
// If we connect to a peer we already have in the RT but do not exchange a query (rare)
//
// Do Nothing.
func (dht *IpfsDHT) peerFound(ctx context.Context, p peer.ID, queryPeer bool) {
if c := baseLogger.Check(zap.DebugLevel, "peer found"); c != nil {
c.Write(zap.String("peer", p.String()))
}
b, err := dht.validRTPeer(p)
if err != nil {
logger.Errorw("failed to validate if peer is a DHT peer", "peer", p, "error", err)
} else if b {
select {
case dht.addPeerToRTChan <- addPeerRTReq{p, queryPeer}:
case <-dht.ctx.Done():
return
}
}
}

The only check performed by peerFound before adding the peer to the RT is calling validRTPeer. This function only verify whether the remote peer advertises that it speaks the DHT protocol. But it doesn't verify that the remote peer is actually responsive to DHT requests. This check isn't enough.

// validRTPeer returns true if the peer supports the DHT protocol and false otherwise. Supporting the DHT protocol means
// supporting the primary protocols, we do not want to add peers that are speaking obsolete secondary protocols to our
// routing table
func (dht *IpfsDHT) validRTPeer(p peer.ID) (bool, error) {
b, err := dht.peerstore.FirstSupportedProtocol(p, dht.protocolsStrs...)
if len(b) == 0 || err != nil {
return false, err
}
return dht.routingTablePeerFilter == nil || dht.routingTablePeerFilter(dht, p), nil
}

We will list the callers of peerFound and explain how they add peers to the RT.

New(...) (*IpfsDHT, error)

go-libp2p-kad-dht/dht.go

Lines 234 to 239 in 2b85cfc

// Fill routing table with currently connected peers that are DHT servers
dht.plk.Lock()
for _, p := range dht.host.Network().Peers() {
dht.peerFound(dht.ctx, p, false)
}
dht.plk.Unlock()

When creating the DHT, the node tries to add all connected peers to its RT. Because of validRTPeer, only the peers advertising the DHT protocol will get added.

In this case, it is important to make a DHT query (for a any key) to all remote peers that are about to be added to the RT, and only add them to the RT if they answer without error to the DHT query.

fixLowPeers

go-libp2p-kad-dht/dht.go

Lines 475 to 480 in 2b85cfc

// we try to add all peers we are connected to to the Routing Table
// in case they aren't already there.
for _, p := range dht.host.Network().Peers() {
dht.peerFound(ctx, p, false)
}

Similar to the case above.

handleNewMessage

// a peer has queried us, let's add it to RT
dht.peerFound(dht.ctx, mPeer, true)

If a peer sent us a DHT request (and advertises being a DHT server), add it to the RT.

Before adding these peers to the RT, make sure that they answer correctly to DHT queries

queryPeer

go-libp2p-kad-dht/query.go

Lines 407 to 420 in 2b85cfc

// send query RPC to the remote peer
newPeers, err := q.queryFn(queryCtx, p)
if err != nil {
if queryCtx.Err() == nil {
q.dht.peerStoppedDHT(q.dht.ctx, p)
}
ch <- &queryUpdate{cause: p, unreachable: []peer.ID{p}}
return
}
queryDuration := time.Since(startQuery)
// query successful, try to add to RT
q.dht.peerFound(q.dht.ctx, p, true)

Upon successful DHT request, the remote peer is added to the RT.

No further action is required, as the node has proven to answer DHT queries

DHT requests upon RTRefresh

In addition to the proposed fix, it is possible to periodically make DHT request to all nodes in the RT to make sure they are still responsive to DHT queries. This operation however is more expensive, as 1 additional query is sent every 10 minutes for each RT member.

#810

A probabilistic approach can decrease the network load of additional DHT requests. At every refresh, the node only sends DHT requests to a fraction of the nodes inside a bucket, for all buckets. If it detects some unresponsive nodes, the fraction of peers queried in each bucket is increased for the next refresh. If it doesn't detect any unresponsive nodes for a while, the fraction of peers queried decreases at the next refresh, but never goes below a magic threshold.

This technique allows a low overhead if the network is behaving correctly. And if a significant share of the network acts in an adversarial way, the unresponsive DHT nodes are detected and removed from the RT, at the price of a higher network load.

IMO periodically verifying if nodes in the RT still answer to DHT queries isn't needed (at least for now). Preventing unresponsive nodes from being added to the RT in the first place should be sufficient.

@guillaumemichel guillaumemichel changed the title Make sure peers answer to DHT requests before adding them to the RT Don't add unresponsive DHT servers to the RT Feb 6, 2023
@yiannisbot
Copy link

We're mostly interested in large-scale incidents here and not the random node that happens to not be responsive. From this point of view, statistics can go a long way :) I'd say starting with a sample of 10%-20% of nodes in the routing table would do the trick, IMO.

@dennis-tra
Copy link
Contributor

Actually, wouldn't it suffice to do a query similar to what we added in #810 inside the peerFound function if the queryPeer flag is false? This would definitely have caught the unresponsive nodes we observed recently. I'm not sure if this would break any separation of concerns inside the code base - I'd need another look through the code. Just wanted to leave it here as food for thought.

@guillaumemichel
Copy link
Contributor Author

@dennis-tra in the case where a remote peer sends a DHT query, we don't know whether it can actually answer DHT queries (it must advertise the DHT protocol though), and the flag is still set to true.

// a peer has queried us, let's add it to RT
dht.peerFound(dht.ctx, mPeer, true)

I think that the easiest way to fix it is to add a flag to the peerFound function, and propagate it to the validRTPeer where the same check as in #810 is performed. I'll work on it :)

@dennis-tra
Copy link
Contributor

in the case where a remote peer sends a DHT query, we don't know whether it can actually answer DHT queries (it must advertise the DHT protocol though), and the flag is still set to true.

Sure 👍 I just thought that this would have still caught the unresponsive peers. The peers were unresponsive because they couldn't open a new stream for the DHT protocol. If we received a query this would have proven that they still can open a stream. However, as you said, this doesn't prove that they respond to queries.


Maybe there is a solution without passing an additional boolean flag through the stack? I'm asking because this sounds like the function could be split. Anyways, I'm sure you'll see how it's done better :)

@BigLep
Copy link

BigLep commented Mar 8, 2023

FYI @guillaumemichel that @Jorropo will be your Kubo-maintainer point of contact for supporting you to get these landed.

@yiannisbot
Copy link

@Jorropo @BigLep do we have an ETA for reviewing this and getting it merged? It's one of the few remaining action items from the incidents earlier in the year and one of the milestones we've committed to ship by end of 2023Q1.

@BigLep
Copy link

BigLep commented Apr 12, 2023

@yiannisbot : this isn't making it in for 0.20 :(. It will happen after IPFS Thing for 0.21 (release in later may).

@guillaumemichel
Copy link
Contributor Author

#820 has been merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants