Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iroh-net): Improve direct connectivity establishment speed and reliablity #1984

Merged
merged 49 commits into from
Feb 15, 2024

Conversation

flub
Copy link
Contributor

@flub flub commented Jan 31, 2024

Description

This includes various fixes and improvements in holepunching behaviour,
improving both how fast a direct connection is established and how
reliably it is established.

  • Improve sending of call-me-maybe: this sometimes wasn't sent if
    there were no known addresses for an endpoint because no pings
    were sent and there was an active connection on via the derp
    path. Instead we now send it when needed, regardless of pings
    or active connections. To avoid sending it all the time when
    no direct connection can be established we only send it once
    every heartbeat cycle, i.e. every 5s. In the worst case (like
    when UDP pings get lost) this means a call-me-maybe will be sent
    after just under 2 heartbeat intervals, and should be regular
    every heartbeat interval afterwards. So in the face of packet
    loss this will still establish a connection eventually.

  • Not every scenario will result in a direct connection in both
    directions after each endpoint has sent one call-me-maybe
    request, because direct connections are asymmetric and each
    side has to receive their own disco pong before it will use
    the direct connection. To alleviate this and speed up the
    direct connection on the return path in cases were only one
    side manages to establish the direct path, this now also sends
    a ping message upon receipt of a ping. This opens the direct
    path much quicker in both directions.

  • When queuing a call-me-maybe and the last netcheck report is
    deemed too old, the call-me-maybe is not immediately sent.
    Instead a new netcheck is kicked off and the queued
    call-me-maybes should be sent once the report is in, so that
    they are sent with accurate local addresses. However they
    were never sent after the report finished. This is now fixed
    and they are sent after the report is ready.

  • When a ping timeout occurs, we were clearing the best_addr, i.e.
    the selected direct path, if the ping was for the same path as
    currently used. However that could have been for a timeout of
    an earlier ping that was still stopped by a firewall or NAT. It
    happens that a later ping is sent and receives a pong, before the
    timeout of that earlier ping. In that case we should not clear
    the direct path in use.

  • We used to prune addresses even if they were only just received from a
    call-me-maybe. This did not allow us to ping all the addresses from a
    call-me-maybe breaking hole punching.

    We take two measures:

    • Do not prune messages that were recently "alive", which means they
      received a ping, pong or call-me-maybe recently.

    • We increase the number of allowed inactive addresses. They used to
      be 100. If someone switches between several networks regularly, we
      would like to still know those previous networks as they might start
      working again.

  • When looking for a destination address for an endpoint, the
    addr_for_send() function uses un-confirmed UDP address if available.
    However if we were given an IPv6 address (probably via a ticket,
    effectively via add_node_addr()) we should only select the IPv6
    address if netcheck repored we have working IPv6 ourself.

  • The netcheck report was wrongly overwriting whether an IP
    family was usable on dual stack hosts. Resulting in the magic
    socket being confused about which connectivity it had.

This also begins to clean up some terminology: "endpoint" is used for
many different things. Here the term "path" is consistently used
when referring to a specific destination address of a node, whenter
over the derper or a direct UDP socket. This naming cleanup however
is not complete as that would obscure the fixes too much. More to
be done later.

Closes #1955

Notes & open questions

This also changes the first address of our QuicMappedAddresses to
start at 1. Not that 0 is reserved, but it is rather odd for people
who are used to IPv4 to see 0 being a real node. And it's not like
you're short of addresses in IPv6.

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.

dignifiedquire and others added 10 commits January 29, 2024 13:20
endpoint.close() is too agressive to actually get the behaviour being checked for
There is a bug that an accepting node does not send a
call-me-maybe to the connecting node for a long-long time.  Then the
connecting node never knows which direct addresses to try.

The accepting node should send call-me-maybe as soon as it is trying
to send something.  However the exact condition in this WIP code might
not be correct yet.
@ramfox ramfox added bug Something isn't working c-iroh-net labels Feb 6, 2024
@ramfox ramfox added this to the v0.13.0 milestone Feb 6, 2024
the logic is the same, the writing expressed simpler which helps
understanding what is going on.
The decision on whether to send pings was taken in two places which
were both replicating the same logic and had to agree in the right
way.  This logic could be taken entirely in .want_full_ping() which
leaves the address selection logic just concerned about address
selection.  Much simpler to reason about without duplicate logic.
@flub
Copy link
Contributor Author

flub commented Feb 7, 2024

Currently there are two main things left:

  • Call-me-maybe messages are sent too often. Resulting in too many pings. The issue is that they need to be sent whenever we have no info about the endpoint, because otherwise we can never learn about the other endpoint. I'm thinking of adding an exponential backoff for them: initially send call-me-maybe every 0.5-1 second, then as connections do not get established slow it down up to every 30ish seconds.

  • Old ping timeouts reset the best-addr state. This is kind of papered over with the assign_best_addr_from_candidates_if_empty function, but I think it should not be cleared in the first place.

@dignifiedquire
Copy link
Contributor

  • Old ping timeouts reset the best-addr state.

they should only clear if no newer information has been attained since they were sent

@dignifiedquire
Copy link
Contributor

  • I'm thinking of adding an exponential backoff for them

sounds good

@flub
Copy link
Contributor Author

flub commented Feb 7, 2024

  • I'm thinking of adding an exponential backoff for them

sounds good

ok, glad you seem to agree because i think this is the main thing to make call-me-maybe interactions work well and the biggest change.

@flub flub marked this pull request as ready for review February 13, 2024 15:01
@flub flub requested a review from ramfox February 13, 2024 16:19
We used to prune addresses even if they were only just received from a
call-me-maybe.  This did not allow us to ping all the addresses from a
call-me-maybe breaking hole punching.

We take two measures:

- Do not prune messages that were recently "alive", which means they
  received a ping, pong or call-me-maybe recently.

- We increase the number of allowed inactive addresses.  They used to
  be 100.  If someone switches between several networks regularly, we
  would like to still know those previous networks as they might start
  working again.
Starting from 0 is just weird, even though it is just fine reading
this address in the logfiles is confusing.
@flub flub changed the title fix(iroh-net) Improve direct connectivity establishment speed and reliablity fix(iroh-net): Improve direct connectivity establishment speed and reliablity Feb 14, 2024
@@ -478,6 +478,7 @@ fn run_cli(

#[cfg(feature = "cli")]
#[test]
#[ignore = "flaky"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( do we have an issue for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1935

All these tests are fundamentally not reliable sadly.

@flub flub enabled auto-merge February 15, 2024 16:17
@flub flub added this pull request to the merge queue Feb 15, 2024
Merged via the queue into main with commit b173520 Feb 15, 2024
19 checks passed
@flub flub deleted the flub/call-me-sooner branch February 15, 2024 16:34
fubuloubu pushed a commit to ApeWorX/iroh that referenced this pull request Feb 21, 2024
## Description

This was essentially re-implementing the Rust Debug trait.  Now that
DerpUrl has a nice Debug log there really is no need to manually
implement this, instead we rely on Debug.


## Notes & open questions

This is splitting off some peripheral stuff from
n0-computer#1984

New flaky test marked: n0-computer#2010
and n0-computer#1935

## Change checklist

- [x] Self-review.
fubuloubu pushed a commit to ApeWorX/iroh that referenced this pull request Feb 21, 2024
## Description

This allows you to write log output to a specific filedescriptor in
the shell: iroh --log-fd 3 ... 3>some/file.log

Thus not interfering with normal stdout and stderr output and getting
the full functionality of this.

## Notes & open questions

This is splitting off some peripheral stuff from
n0-computer#1984

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
fubuloubu pushed a commit to ApeWorX/iroh that referenced this pull request Feb 21, 2024
## Description

This reads a bit more natural.  It does not affect the on-the-wire
format.

## Notes & open questions

Split off from n0-computer#1984

## Change checklist

- [x] Self-review.
fubuloubu pushed a commit to ApeWorX/iroh that referenced this pull request Feb 21, 2024
…liablity (n0-computer#1984)

## Description

This includes various fixes and improvements in holepunching behaviour,
improving both how fast a direct connection is established and how
reliably it is established.

- Improve sending of call-me-maybe: this sometimes wasn't sent if
  there were no known addresses for an endpoint because no pings
  were sent and there was an active connection on via the derp
  path.  Instead we now send it when needed, regardless of pings
  or active connections.  To avoid sending it all the time when
  no direct connection can be established we only send it once
  every heartbeat cycle, i.e. every 5s.  In the worst case (like
  when UDP pings get lost) this means a call-me-maybe will be sent
  after just under 2 heartbeat intervals, and should be regular
  every heartbeat interval afterwards.  So in the face of packet
  loss this will still establish a connection eventually.

- Not every scenario will result in a direct connection in both
  directions after each endpoint has sent one call-me-maybe
  request, because direct connections are asymmetric and each
  side has to receive their own disco pong before it will use
  the direct connection.  To alleviate this and speed up the
  direct connection on the return path in cases were only one
  side manages to establish the direct path, this now also sends
  a ping message upon receipt of a ping.  This opens the direct
  path much quicker in both directions.

- When queuing a call-me-maybe and the last netcheck report is
  deemed too old, the call-me-maybe is not immediately sent.
  Instead a new netcheck is kicked off and the queued
  call-me-maybes should be sent once the report is in, so that
  they are sent with accurate local addresses.  However they
  were never sent after the report finished.  This is now fixed
  and they are sent after the report is ready.

- When a ping timeout occurs, we were clearing the best_addr, i.e.
  the selected direct path, if the ping was for the same path as
  currently used.  However that could have been for a timeout of
  an earlier ping that was still stopped by a firewall or NAT.  It
  happens that a later ping is sent and receives a pong, before the
  timeout of that earlier ping.  In that case we should not clear
  the direct path in use.

- We used to prune addresses even if they were only just received from a
  call-me-maybe.  This did not allow us to ping all the addresses from a
  call-me-maybe breaking hole punching.

  We take two measures:

  - Do not prune messages that were recently "alive", which means they
    received a ping, pong or call-me-maybe recently.

  - We increase the number of allowed inactive addresses.  They used to
    be 100.  If someone switches between several networks regularly, we
    would like to still know those previous networks as they might start
    working again.


- When looking for a destination address for an endpoint, the
  addr_for_send() function uses un-confirmed UDP address if available.
  However if we were given an IPv6 address (probably via a ticket,
  effectively via add_node_addr()) we should only select the IPv6
  address if netcheck repored we have working IPv6 ourself.

- The netcheck report was wrongly overwriting whether an IP
  family was usable on dual stack hosts.  Resulting in the magic
  socket being confused about which connectivity it had.

This also begins to clean up some terminology: "endpoint" is used for
many different things.  Here the term "path" is consistently used
when referring to a specific destination address of a node, whenter
over the derper or a direct UDP socket.  This naming cleanup however
is not complete as that would obscure the fixes too much.  More to
be done later.

Closes n0-computer#1955 

## Notes & open questions

This also changes the first address of our QuicMappedAddresses to
start at 1.  Not that 0 is reserved, but it is rather odd for people
who are used to IPv4 to see 0 being a real node.  And it's not like
you're short of addresses in IPv6.

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.

---------

Co-authored-by: dignifiedquire <[email protected]>
rklaehn pushed a commit to n0-computer/iroh-blobs that referenced this pull request Oct 22, 2024
## Description

This was essentially re-implementing the Rust Debug trait.  Now that
DerpUrl has a nice Debug log there really is no need to manually
implement this, instead we rely on Debug.


## Notes & open questions

This is splitting off some peripheral stuff from
n0-computer/iroh#1984

New flaky test marked: n0-computer/iroh#2010
and #1935

## Change checklist

- [x] Self-review.
rklaehn pushed a commit to n0-computer/iroh-blobs that referenced this pull request Oct 22, 2024
## Description

This allows you to write log output to a specific filedescriptor in
the shell: iroh --log-fd 3 ... 3>some/file.log

Thus not interfering with normal stdout and stderr output and getting
the full functionality of this.

## Notes & open questions

This is splitting off some peripheral stuff from
n0-computer/iroh#1984

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
dignifiedquire pushed a commit to n0-computer/iroh-gossip that referenced this pull request Oct 23, 2024
## Description

This was essentially re-implementing the Rust Debug trait.  Now that
DerpUrl has a nice Debug log there really is no need to manually
implement this, instead we rely on Debug.


## Notes & open questions

This is splitting off some peripheral stuff from
n0-computer/iroh#1984

New flaky test marked: n0-computer/iroh#2010
and #1935

## Change checklist

- [x] Self-review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c-iroh-net
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

direct connection not chosen for dumbpipe
3 participants