Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Flare aka. decentralised hole punching #21

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions proposals/Hole Punching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# NAT traversal in libp2p via Hole Punching with Limited Relays

### Authors:

- @vyzo
- @aarshkshah1992
- @raulk

# Purpose & Impact

### Background

Given the pervasiveness of IPV4 peers that are behind NATs on the internet, NAT traversal is an essential requirement for a peer to peer application. The inability to traverse around NATs means that such NATT’d peers are NOT reachable on the network and are thus unable to provide any meaningful service to the network, nor interact with network participants under protocol patterns that require inbound connections (e.g. dialbacks).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Given the pervasiveness of IPV4 peers that are behind NATs on the internet, NAT traversal is an essential requirement for a peer to peer application. The inability to traverse around NATs means that such NATT’d peers are NOT reachable on the network and are thus unable to provide any meaningful service to the network, nor interact with network participants under protocol patterns that require inbound connections (e.g. dialbacks).
Given the pervasiveness of IPv4 peers that are behind NATs on the internet, NAT traversal is an essential requirement for a peer to peer application. The inability to traverse NATs means that such NATT’d peers are NOT reachable on the network and are thus unable to provide any meaningful service to the network, nor interact with network participants under protocol patterns that require inbound connections (e.g. dialbacks).


Libp2p currently executes NAT traversal using[ Circuit Relays](https://docs.libp2p.io/concepts/circuit-relay/) wherein publicly dialable Relay servers relay the entirety of user traffic to peers that are NATT’d. This approach does NOT scale because:

1. It costs bandwidth on the Relay server.
2. There is NO real incentive to be a Relay server.
3. Introduces communication latency between the two peers that are interfacing via the Relay server.

A more scalable approach to NAT traversal is to enable direct communication between the peers via a technique called _[Hole Punching](https://en.wikipedia.org/wiki/Hole_punching_(networking))_. Hole punching removes the need to relay _all_ traffic between two peers via a Relay server. Instead, relay servers are used merely at the connection bootstrapping phase, to convey signalling between two peers intending to connect to each other, sufficient to facilitate NAT hole punching. In most cases, such traffic represents a minimal fraction of the full user payload traffic.

Hole punching **has been shown to have ~60% success for TCP & ~80% success for QUIC** ([MIT paper on Hole Punching](https://pdos.csail.mit.edu/papers/p2pnat.pdf)). It has been widely studied in academia and has also been widely adopted in real world large scale p2p applications ([Tailscale blog on NAT traversal](https://tailscale.com/blog/how-nat-traversal-works/)). Pervasive web protocols like WebRTC rely on ICE, which uses signalling (STUN) to facilitate hole punching.

Based on [metrics that we collected](https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&refresh=5m&from=1613523541216&to=1613534341216&var-head=All&viewPanel=50), we have discovered that:

* **~80% peers in the current DHT network are NOT dialable/reachable**.
* Relay servers funded and operated by PL aren’t enough given the size of our network, and are also expensive to run.

We posit that implementing hole punching will enable direct connectivity for a significant portion of the network, and will improve overall QoS of the network by increasing the connectivity. In combination with future efforts to strengthen WebRTC support, it will also enable browser-centric use cases to be reliably built on libp2p.

### Intent

By implementing a _[Limited Relay](https://docs.google.com/document/d/1bhoVGitiB2rr6i8cVKvwkQONHvsfBxqcK76ePVW_JVs/edit#heading=h.s248pnlqjmsm)_ protocol that ONLY provides the resources and bandwidth needed to coordinate a hole punch instead of acting as full fledged data transfer Relays, we can get public DHT servers to run the Limited Relay protocol without costing them too much bandwidth/resources and thus ALSO get a pervasive & large scale Hole Punching co-ordination infrastructure that can help scale hole punching to the size of the network.

Note that Hole Punching fails if either peer is behind what is known as a _[Symmetric NAT](https://dh2i.com/kbs/kbs-2961448-understanding-different-nat-types-and-hole-punching/)_ as opposed to a _[Cone NAT](https://dh2i.com/kbs/kbs-2961448-understanding-different-nat-types-and-hole-punching/)_, which is why Hole Punching does not deliver 100% success rates, but it’s much better than the alternative which is NO _direct_ connectivity for ALL NATT’d peers.

Also, based on anecdotal evidence in the wild and engineering war stories (see [Tailscales’s blog](https://tailscale.com/blog/how-nat-traversal-works/)), **Cone NATs are much more pervasive in Home ISPs over Symmetric NATs, which justifies the~60-80% success for Hole Punching**.

Given that libp2p is a library to build peer to peer applications,**implementing Hole Punching in libp2p will allow any application/network that builds on top of libp2p to also benefit from Hole Punching**. This will be a significant step forward in the ease of building well connected p2p networks and will be a major added motivation for developers to build on top of our stack. Based on our research, **no such library that comes with out of the box NAT traversal via Hole Punching exists out in the wild today and we have the opportunity to provide the first such library & infra to herald the age of better connectivity in Web3 apps**.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gozala Does HyperSwarm use decentralised signalling ? How does it signal/co-ordinate hole punching ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I do not know how it goes about it. Reading the description of it (quoted below) makes me think that bootstrap nodes facilitate this.

If your IP and port is consistent across the bootstrap nodes holepunching usually works.

We could try to ask @mafintosh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, it uses webrtc ICE/STUN.

### Assumptions & hypotheses

* Applications that build on top of our stack want peers to be _directly_ reachable from the network even though they are behind a NAT (~80% peers in the current DHT network).
* PL is not willing to keep funding expensive bandwidth-unrestricted Relay servers as the network keeps growing to enable data transfer to/from NATT’d peers.
* Users would love to use our p2p stack if doing so means the applications they build get NAT traversal via Hole Punching out of the box.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope we are not too late to the party, but can definitely agree from the perspective of downstream consumers of libp2p-rust, that not only is hole-punchy NAT traversal the ultimate goal, but the point above about the cost of running full relays also means that, since there is no incentive to do so, such holistic systems are likely to suffer from centralisation concerns.

iotaledger/stronghold.rs#210

* Enabling better connectivity in IPFS, Filecoin and other applications that build on top of Libp2p in a world suffering from the tyranny of NATs is a success metric/priority for us.

### User workflow example

The User builds an application such as IPFS, Filecoin, Ethereum etc. on top of Libp2p and gets _direct_ connectivity between NATT’d peers (albeit ~60% for TCP & ~80% for QUIC) out of the box without bringing up additional infrastructure and with minimal configuration.

### Impact
🔥🔥🔥

* Libp2p will be one of the first libraries in the web3 ecosystem that provides Hole Punching and hence better connectivity out of the box. This will increase the functionality of our stack, and will encourage more developers to build on top of it.
* PL’s applications such as Filecoin & IPFS will get turbocharged with better connectivity as they build on top of libp2p.
* Important projects such as Eth2, 0x etc that build on top of libp2p will ALSO get this huge benefit should they choose to use it. Eth2 is likely to need this feature in Phase 2, which introduces browser-based light clients.
* New browser-centric use cases will be possible when this functionality is implemented in js-libp2p.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be elaborated a bit more, I am not sure I understand what are those use cases or how hole punching would make browser nodes dialable.

Copy link
Author

@aarshkshah1992 aarshkshah1992 Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gozala

Given that JS peers can't connect to go peers directly right now, shipping the Limited Relay based decentralised signalling infra is the first step in implementing a WebRTC transport in go & js libp2p where signalling does NOT rely on centralised STAR servers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"JS peers can't connect to go peers directly right now"

The root reason for this is that go-ipfs does not have the websockets transport by default. Also these nodes are not usually reachable with a SSL+DNS multiaddr, which is essential for the browser.

Unless we plan to have webRTC used by default in go-ipfs, the browser limitations will not be solved.

I remember that Chrome had speed limitations for webRTC, which made websockets a faster option. @Gozala do you know if this is still the case? If this is the case, I think it would be better to focus on a solution for generating certificates instead of webRTC in go. (and of course enable websockets by default in go)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that Chrome had speed limitations for webRTC, which made websockets a faster option. @Gozala do you know if this is still the case? If this is the case, I think it would be better to focus on a solution for generating certificates instead of webRTC in go. (and of course enable websockets by default in go)

I do not know what speed limitations chrome or other browsers put on WebRTC, but there are problems with WebRTC beyond that:

  1. WebRTC is not implemented in workers threads.
    • This makes building responsive apps a lot more challenging given all the other worker that is happening on main thread
    • Prevents IPFS node sharing a node across contexts like tabs or iframes.
  2. I have anecdotal reports that in practice WebRTC is impractical without TURN servers, suggesting that often times data is relayed anyway.
  3. Again anecdotal, evidence suggests that WebRTC seems to cause significant CPU load

All the above combined often leads teams to pivot towards WebSocket based solutions. I also heard teams reporting reduced bills when operating WebSocket based relay as opposed to TURN servers.

Please take all this with a grain of last, because I have not seen any comprehensive studies to support anecdotal evidence.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


**Summary: The idea of a peer to peer library that makes NAT traversal via Hole Punching easy and pervasive is a very important & exciting development in the world of peer to peer applications.**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go as far as claim that there is a general assumption that p2p networking library would come with Nat traversal & Hole Punching built-in.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. We were surprised when it didn't come delivered "out of the box". In fact, initially we built a publicly reachable "mailbox" solution that leveraged polling. Obviously terrible at scale for both transit and storage.


### Leverage

_How much would nailing this project improve our knowledge and ability to execute future projects?_

🎯🎯
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think higher score is warranted here

Suggested change
🎯🎯
🎯🎯🎯


Lack of dialability between peers has been a repeated protocol-level blocker to achieving seamless interaction, in libp2p DHT, pubsub, bitswap, Filecoin deals, and more.

### Confidence

**Medium.**

* The [MIT paper ](https://pdos.csail.mit.edu/papers/p2pnat.pdf)that studied Hole Punching concludes a ~60% success for TCP and ~80% success for QUIC. [Tailscale](https://tailscale.com/), a large-scale VPN that i[mplemented NAT traversal ](https://tailscale.com/blog/how-nat-traversal-works/)via Hole Punching also achieved similar results. This is because of the pervasiveness of Cone NATs (the good NATs) among Home ISPs.
* Both IPFS and Filecoin currently suffer in terms of connectivity as they do NOT have Hole Punching and based on the metrics we collected, ~80% peers on the DHT network are NOT reachable/dialable.
* We already have a working PoC for hole punching and based on the limited testing we’ve done so far, we are able to successfully hole punch and connect Cone NATT’d peers that would otherwise be unreachable.

**Confidence can be reassessed after the dog-fooding and alpha testing phase are complete within a community of Labbers that use Home ISPs and have NATT’d machines. Refer to the Project Definition section for details.**

# Project Definition

### Brief plan of attack

* Implement the _[Limited Relay](https://docs.google.com/document/d/1bhoVGitiB2rr6i8cVKvwkQONHvsfBxqcK76ePVW_JVs/edit#heading=h.s248pnlqjmsm)_ protocol in Libp2p.
* Implement the _[Hole Punching protocol ](https://github.com/libp2p/specs/pull/173)_to achieve direct connectivity with a NATT’d peer after coordinating a hole punch.
* Dogfood an alpha version of Hole Punching to Labbers and collect metrics regarding success rates and resource consumption.
* Based on the results of the dog-fooding endeavour, fix/optimize the hole punching process and release it to the world in a phased manner.
* In the initial phase, ONLY use statically configured Limited Relay servers hosted by PL.
* Once we conclude that PL hosted Limited Relays are stable, ship a release that turns on the Limited Relay protocol in public DHT servers but continues to use the statically configured Limited Relays.
* Once ~30% of public DHT servers upgrade to support the Limited Relay protocol(measure using Hydra Boosters), ship automated discovery & use of Limited Relays to coordinate a hole-punch rather than using statically configured Limited Relays servers.
* Achieve ~90% hole punching success if both peers are behind a Cone NAT in the second dog-fooding phase that uses AutoRelay to discover and connect to Limited Relays in the wild.
* Massive focus on usability, user education and evangelizing the feature by writing blog posts discussing a “how we got here”/ internals / engineering journey of Hole Punching in libp2p with metrics and blog posts and demo videos on how to configure, use and debug Hole Punching.

### Technical deliverables

* **Phase1**
- ~90% hole punching success if both peers are behind a Cone NAT in the first dog-fooding phase using PL hosted Limited Relays.
* **Phase 2**
- Optimise based on Dogfooding results and metrics and ship the feature using statically configured PL hosted Limited Relays.
* **Phase 3**
- Once we conclude that PL hosted Limited Relays are stable, ship a release that turns on the Limited Relay protocol in public DHT servers.
* **Phase 4**
- Once ~30% of public DHT servers upgrade to support the Limited Relay protocol (measure using Hydra Boosters), ship automated discovery & use of Limited Relays to coordinate a hole-punch rather than using statically configured Limited Relays servers.
* **Phase 5**
- ~90% hole punching success if both peers are behind a Cone NAT in the second dog-fooding phase that uses AutoRelay to discover and connect to Limited Relays in the wild rather than using statically configured Limited Relays.
Comment on lines +95 to +106
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided some private feedback on this wording, but just to summarise, this roadmap reads like a mixture of goals and tasks. It's not entirely clear what needs to be done for each milestone to be done. Consider wording these as task lists, or definition of done (deliverables), but not as a mixture of abstract goals and high-level tasks.


### Success criteria

* Dog-fooding phases deliver ~90% success for labbers using Home ISPs with Cone NATs (TCP & QUIC).
* No bugs related to Hole Punching failures if both peers involved in a Hole Punch are behind a Cone NAT (we have good PRs for and will ship code/tools for users to detect their NAT type).
* Users do not file bug reports about their public DHT peers getting DDosed/consuming too much bandwidth/resources because of acting as Limited Relays.
* We receive great traction and feedback on the ease of use and robustness of Hole Punching on channels such as Twitter, user surveys and from our community of users/partners.
Comment on lines +108 to +113
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fold the success criteria into each milestone.


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to add a criteria such as "IPFS node is able to tell if it can be dialed (hole punched to), combined with some troubleshooting interface that can guide a node operator in terms of what to do to make node reachable"

### Long Term Success criteria (~ 6 months in as network will take time to upgrade)

* We are able to remove static configuration of PL hosted Limited Relays in lieu of discovering Limited Relays. And better still, completely shut them down without causing disruptions to the Hole Punching abilities of the network (disruptions can be measured using Github issues, bug reports and PL hosted Cone NATT’d nodes that scrap the network for unreachable peers with Cone NATs, attempt Hole Punching with them and report corresponding metrics).

### Counterpoints & pre-mortem

_Why might this project be lower impact than expected? How could this project fail to complete, or fail to be successful?_

### Alternatives

Maybe implementing a WebRTC transport in go-libp2p that performs signalling/co-ordination via Limited Relay servers can help solve the connectivity problems that hole punching seeks to address but that means that we get tied to using WebRTC as a transport. Compared to that, implementing hole punching as a first class feature in Libp2p makes the whole feature transport agnostic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to raise another point here, webrtc is not available in all webviews, and even though there is promising work being done on the rust front with eg https://webrtc.rs - projects that build atop of Tauri are very likely to prefer to use a wss connection.

Similarly, there are privacy concerns since not everyone will run their own STUN services and then probably use something like a public google service.

See: tauri-apps/wry#85


### Dependencies/prerequisites

None.

### Future opportunities

Better connectivity via Hole Punching in the IPFS & Filecoin networks.

# Required Resources

### Estimated Scope

* 4-5 weeks i.e. **Medium for a team of 2 people** AND with some help from the Dev Onboarding Team for landing some kickass documentation/user education.
* Uncertainty is reviews NOT getting completed on time as we would ideally like the important aspects of our work (Limited Relays, Hole Punching Protocol, AutoRelay changes and Hydra Booster changes) to be reviewed by the Project Captain and/or someone from the Stewards team.

### Roles / skills needed

* libp2p engineers.
* Infrastructure engineers to deploy Limited Relays, metrics collection using Hydra Boosters, and deploy Cone NATT’d crawler nodes with real-time error reporting.
* Docs and User onboarding team to help ship blog posts, user education documentation and demo videos explaining the feature in depth and explaining how to configure, use and debug Hole Punching.