Don't add unresponsive DHT servers to the RT #811

guillaumemichel · 2023-02-06T10:07:05Z

ETA: 2023Q1

TL;DR

What we have today

There is currently a function verifying that the peer that is about to be added to the Routing Table (RT) advertises the right DHT protocol. However, it doesn't verify whether the peer actually reply to DHT requests before adding it to the RT.

What is the problem

Misconfigured or misbehaving peers could advertise that they speak the right DHT protocol, but not answer to DHT requests as expected. These nodes may end up in the RT of well behaving peers, and upon request of a key that is close to a misbehaving peer, the well behaving peer will return the peerid of the misbehaving peer. The initial requester will then query the unresponsive peer and time out. The propagation of unresponsive peers slows down the lookup process.

How to fix it

Before adding a remote peer to its RT, a node should verify that this remote peer is able to answer DHT queries correctly. It can verify this by making a DHT query to this peer (if not done before).

What is the expected impact

Nodes that have applied the patch no longer have unresponsive DHT servers in their RT, and they don't spread them anymore. However they can still be victim of unresponsive DHT server propagation from outdated peers. The DHT is expected to get faster as more peers apply the patch.

Checklist

Consensus on the changes to perform
Implement changes
Document the change at the relevant location
Merge the change

Adding peers to the RT

New peers are added to the DHT using the TryAddPeer function only. Note that the queryPeer bool variable is only useful for setting LastUsefulAt, not for knowing whether the peer is a valid DHT server.

The TryAddPeer function is only called from the rtPeerLoop function.

go-libp2p-kad-dht/dht.go

Lines 590 to 597 in 2b85cfc

    
           case addReq := <-dht.addPeerToRTChan: 
        
           	prevSize := dht.routingTable.Size() 
        
           	if prevSize == 0 { 
        
           		isBootsrapping = true 
        
           		bootstrapCount = 0 
        
           		timerCh = nil 
        
           	} 
        
           	newlyAdded, err := dht.routingTable.TryAddPeer(addReq.p, addReq.queryPeer, isBootsrapping)

The only writer to the dht.addPeerToRTChan chan is the peerFound function.

go-libp2p-kad-dht/dht.go

Lines 625 to 659 in 2b85cfc

    
           // peerFound signals the routingTable that we've found a peer that 
        
           // might support the DHT protocol. 
        
           // If we have a connection a peer but no exchange of a query RPC -> 
        
           // 
        
           //	LastQueriedAt=time.Now (so we don't ping it for some time for a liveliness check) 
        
           //	LastUsefulAt=0 
        
           // 
        
           // If we connect to a peer and then exchange a query RPC -> 
        
           // 
        
           //	LastQueriedAt=time.Now (same reason as above) 
        
           //	LastUsefulAt=time.Now (so we give it some life in the RT without immediately evicting it) 
        
           // 
        
           // If we query a peer we already have in our Routing Table -> 
        
           // 
        
           //	LastQueriedAt=time.Now() 
        
           //	LastUsefulAt remains unchanged 
        
           // 
        
           // If we connect to a peer we already have in the RT but do not exchange a query (rare) 
        
           // 
        
           //	Do Nothing. 
        
           func (dht *IpfsDHT) peerFound(ctx context.Context, p peer.ID, queryPeer bool) { 
        
           	if c := baseLogger.Check(zap.DebugLevel, "peer found"); c != nil { 
        
           		c.Write(zap.String("peer", p.String())) 
        
           	} 
        
           	b, err := dht.validRTPeer(p) 
        
           	if err != nil { 
        
           		logger.Errorw("failed to validate if peer is a DHT peer", "peer", p, "error", err) 
        
           	} else if b { 
        
           		select { 
        
           		case dht.addPeerToRTChan <- addPeerRTReq{p, queryPeer}: 
        
           		case <-dht.ctx.Done(): 
        
           			return 
        
           		} 
        
           	} 
        
           }

The only check performed by peerFound before adding the peer to the RT is calling validRTPeer. This function only verify whether the remote peer advertises that it speaks the DHT protocol. But it doesn't verify that the remote peer is actually responsive to DHT requests. This check isn't enough.

go-libp2p-kad-dht/subscriber_notifee.go

Lines 145 to 155 in 2b85cfc

    
           // validRTPeer returns true if the peer supports the DHT protocol and false otherwise. Supporting the DHT protocol means 
        
           // supporting the primary protocols, we do not want to add peers that are speaking obsolete secondary protocols to our 
        
           // routing table 
        
           func (dht *IpfsDHT) validRTPeer(p peer.ID) (bool, error) { 
        
           	b, err := dht.peerstore.FirstSupportedProtocol(p, dht.protocolsStrs...) 
        
           	if len(b) == 0 || err != nil { 
        
           		return false, err 
        
           	} 
        
           	return dht.routingTablePeerFilter == nil || dht.routingTablePeerFilter(dht, p), nil 
        
           }

We will list the callers of peerFound and explain how they add peers to the RT.

`New(...) (*IpfsDHT, error)`

go-libp2p-kad-dht/dht.go

Lines 234 to 239 in 2b85cfc

    
           // Fill routing table with currently connected peers that are DHT servers 
        
           dht.plk.Lock() 
        
           for _, p := range dht.host.Network().Peers() { 
        
           	dht.peerFound(dht.ctx, p, false) 
        
           } 
        
           dht.plk.Unlock()

When creating the DHT, the node tries to add all connected peers to its RT. Because of validRTPeer, only the peers advertising the DHT protocol will get added.

In this case, it is important to make a DHT query (for a any key) to all remote peers that are about to be added to the RT, and only add them to the RT if they answer without error to the DHT query.

`fixLowPeers`

go-libp2p-kad-dht/dht.go

Lines 475 to 480 in 2b85cfc

    
           // we try to add all peers we are connected to to the Routing Table 
        
           // in case they aren't already there. 
        
           for _, p := range dht.host.Network().Peers() { 
        
           	dht.peerFound(ctx, p, false) 
        
           }

Similar to the case above.

`handleNewMessage`

go-libp2p-kad-dht/dht_net.go

Lines 113 to 114 in 2b85cfc

    
           // a peer has queried us, let's add it to RT 
        
           dht.peerFound(dht.ctx, mPeer, true)

If a peer sent us a DHT request (and advertises being a DHT server), add it to the RT.

Before adding these peers to the RT, make sure that they answer correctly to DHT queries

`queryPeer`

go-libp2p-kad-dht/query.go

Lines 407 to 420 in 2b85cfc

    
           // send query RPC to the remote peer 
        
           newPeers, err := q.queryFn(queryCtx, p) 
        
           if err != nil { 
        
           	if queryCtx.Err() == nil { 
        
           		q.dht.peerStoppedDHT(q.dht.ctx, p) 
        
           	} 
        
           	ch <- &queryUpdate{cause: p, unreachable: []peer.ID{p}} 
        
           	return 
        
           } 
        
           queryDuration := time.Since(startQuery) 
        
           // query successful, try to add to RT 
        
           q.dht.peerFound(q.dht.ctx, p, true)

Upon successful DHT request, the remote peer is added to the RT.

No further action is required, as the node has proven to answer DHT queries

DHT requests upon RTRefresh

In addition to the proposed fix, it is possible to periodically make DHT request to all nodes in the RT to make sure they are still responsive to DHT queries. This operation however is more expensive, as 1 additional query is sent every 10 minutes for each RT member.

#810

A probabilistic approach can decrease the network load of additional DHT requests. At every refresh, the node only sends DHT requests to a fraction of the nodes inside a bucket, for all buckets. If it detects some unresponsive nodes, the fraction of peers queried in each bucket is increased for the next refresh. If it doesn't detect any unresponsive nodes for a while, the fraction of peers queried decreases at the next refresh, but never goes below a magic threshold.

This technique allows a low overhead if the network is behaving correctly. And if a significant share of the network acts in an adversarial way, the unresponsive DHT nodes are detected and removed from the RT, at the price of a higher network load.

IMO periodically verifying if nodes in the RT still answer to DHT queries isn't needed (at least for now). Preventing unresponsive nodes from being added to the RT in the first place should be sufficient.

The text was updated successfully, but these errors were encountered:

yiannisbot · 2023-02-20T16:19:12Z

We're mostly interested in large-scale incidents here and not the random node that happens to not be responsive. From this point of view, statistics can go a long way :) I'd say starting with a sample of 10%-20% of nodes in the routing table would do the trick, IMO.

dennis-tra · 2023-02-20T17:45:18Z

Actually, wouldn't it suffice to do a query similar to what we added in #810 inside the peerFound function if the queryPeer flag is false? This would definitely have caught the unresponsive nodes we observed recently. I'm not sure if this would break any separation of concerns inside the code base - I'd need another look through the code. Just wanted to leave it here as food for thought.

guillaumemichel · 2023-02-21T07:58:54Z

@dennis-tra in the case where a remote peer sends a DHT query, we don't know whether it can actually answer DHT queries (it must advertise the DHT protocol though), and the flag is still set to true.

go-libp2p-kad-dht/dht_net.go

Lines 113 to 114 in 2b85cfc

    
           // a peer has queried us, let's add it to RT 
        
           dht.peerFound(dht.ctx, mPeer, true)

I think that the easiest way to fix it is to add a flag to the peerFound function, and propagate it to the validRTPeer where the same check as in #810 is performed. I'll work on it :)

dennis-tra · 2023-02-21T08:26:12Z

in the case where a remote peer sends a DHT query, we don't know whether it can actually answer DHT queries (it must advertise the DHT protocol though), and the flag is still set to true.

Sure 👍 I just thought that this would have still caught the unresponsive peers. The peers were unresponsive because they couldn't open a new stream for the DHT protocol. If we received a query this would have proven that they still can open a stream. However, as you said, this doesn't prove that they respond to queries.

Maybe there is a solution without passing an additional boolean flag through the stack? I'm asking because this sounds like the function could be split. Anyways, I'm sure you'll see how it's done better :)

BigLep · 2023-03-08T05:19:18Z

FYI @guillaumemichel that @Jorropo will be your Kubo-maintainer point of contact for supporting you to get these landed.

yiannisbot · 2023-03-28T07:12:19Z

@Jorropo @BigLep do we have an ETA for reviewing this and getting it merged? It's one of the few remaining action items from the incidents earlier in the year and one of the milestones we've committed to ship by end of 2023Q1.

BigLep · 2023-04-12T09:24:16Z

@yiannisbot : this isn't making it in for 0.20 :(. It will happen after IPFS Thing for 0.21 (release in later may).

guillaumemichel · 2023-06-12T12:59:14Z

#820 has been merged

guillaumemichel mentioned this issue Feb 6, 2023

feat: ping peers on routing table refresh #810

Merged

guillaumemichel changed the title ~~Make sure peers answer to DHT requests before adding them to the RT~~ Don't add unresponsive DHT servers to the RT Feb 6, 2023

guillaumemichel mentioned this issue Feb 21, 2023

fix: don't add unresponsive DHT servers to the Routing Table #820

Merged

6 tasks

yiannisbot mentioned this issue Feb 21, 2023

Milestone: DHT Optimisations probe-lab/roadmap#14

Closed

BigLep added this to IPFS Shipyard Team Mar 8, 2023

BigLep assigned guillaumemichel Mar 8, 2023

dennis-tra mentioned this issue Mar 16, 2023

Optimistic Provide #783

Merged

7 tasks

BigLep mentioned this issue Apr 14, 2023

[Milestone] Operational resilience from recent events ipfs/kubo#9820

Closed

BigLep moved this to 🏃‍♀️ In Progress in IPFS Shipyard Team Apr 19, 2023

guillaumemichel closed this as completed Jun 12, 2023

github-project-automation bot moved this from 🏃‍♀️ In Progress to 🎉 Done in IPFS Shipyard Team Jun 12, 2023

This was referenced Jul 5, 2023

routing: Add triert package with Xor Trie backed routing table probe-lab/go-kademlia#50

Merged

Only add reachable peers to the routing table probe-lab/go-kademlia#53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't add unresponsive DHT servers to the RT #811

Don't add unresponsive DHT servers to the RT #811

guillaumemichel commented Feb 6, 2023 •

edited

Loading

yiannisbot commented Feb 20, 2023

dennis-tra commented Feb 20, 2023

guillaumemichel commented Feb 21, 2023

dennis-tra commented Feb 21, 2023

BigLep commented Mar 8, 2023

yiannisbot commented Mar 28, 2023

BigLep commented Apr 12, 2023

guillaumemichel commented Jun 12, 2023

Don't add unresponsive DHT servers to the RT #811

Don't add unresponsive DHT servers to the RT #811

Comments

guillaumemichel commented Feb 6, 2023 • edited Loading

TL;DR

What we have today

What is the problem

How to fix it

What is the expected impact

Checklist

Adding peers to the RT

New(...) (*IpfsDHT, error)

fixLowPeers

handleNewMessage

queryPeer

DHT requests upon RTRefresh

yiannisbot commented Feb 20, 2023

dennis-tra commented Feb 20, 2023

guillaumemichel commented Feb 21, 2023

dennis-tra commented Feb 21, 2023

BigLep commented Mar 8, 2023

yiannisbot commented Mar 28, 2023

BigLep commented Apr 12, 2023

guillaumemichel commented Jun 12, 2023

guillaumemichel commented Feb 6, 2023 •

edited

Loading

`New(...) (*IpfsDHT, error)`

`fixLowPeers`

`handleNewMessage`

`queryPeer`