-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(iroh-net): Improve direct connectivity establishment speed and reliablity #1984
Conversation
endpoint.close() is too agressive to actually get the behaviour being checked for
There is a bug that an accepting node does not send a call-me-maybe to the connecting node for a long-long time. Then the connecting node never knows which direct addresses to try. The accepting node should send call-me-maybe as soon as it is trying to send something. However the exact condition in this WIP code might not be correct yet.
the logic is the same, the writing expressed simpler which helps understanding what is going on.
The decision on whether to send pings was taken in two places which were both replicating the same logic and had to agree in the right way. This logic could be taken entirely in .want_full_ping() which leaves the address selection logic just concerned about address selection. Much simpler to reason about without duplicate logic.
Currently there are two main things left:
|
they should only clear if no newer information has been attained since they were sent |
sounds good |
ok, glad you seem to agree because i think this is the main thing to make call-me-maybe interactions work well and the biggest change. |
We used to prune addresses even if they were only just received from a call-me-maybe. This did not allow us to ping all the addresses from a call-me-maybe breaking hole punching. We take two measures: - Do not prune messages that were recently "alive", which means they received a ping, pong or call-me-maybe recently. - We increase the number of allowed inactive addresses. They used to be 100. If someone switches between several networks regularly, we would like to still know those previous networks as they might start working again.
Starting from 0 is just weird, even though it is just fine reading this address in the logfiles is confusing.
@@ -478,6 +478,7 @@ fn run_cli( | |||
|
|||
#[cfg(feature = "cli")] | |||
#[test] | |||
#[ignore = "flaky"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:( do we have an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these tests are fundamentally not reliable sadly.
## Description This was essentially re-implementing the Rust Debug trait. Now that DerpUrl has a nice Debug log there really is no need to manually implement this, instead we rely on Debug. ## Notes & open questions This is splitting off some peripheral stuff from n0-computer#1984 New flaky test marked: n0-computer#2010 and n0-computer#1935 ## Change checklist - [x] Self-review.
## Description This allows you to write log output to a specific filedescriptor in the shell: iroh --log-fd 3 ... 3>some/file.log Thus not interfering with normal stdout and stderr output and getting the full functionality of this. ## Notes & open questions This is splitting off some peripheral stuff from n0-computer#1984 ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant.
## Description This reads a bit more natural. It does not affect the on-the-wire format. ## Notes & open questions Split off from n0-computer#1984 ## Change checklist - [x] Self-review.
…liablity (n0-computer#1984) ## Description This includes various fixes and improvements in holepunching behaviour, improving both how fast a direct connection is established and how reliably it is established. - Improve sending of call-me-maybe: this sometimes wasn't sent if there were no known addresses for an endpoint because no pings were sent and there was an active connection on via the derp path. Instead we now send it when needed, regardless of pings or active connections. To avoid sending it all the time when no direct connection can be established we only send it once every heartbeat cycle, i.e. every 5s. In the worst case (like when UDP pings get lost) this means a call-me-maybe will be sent after just under 2 heartbeat intervals, and should be regular every heartbeat interval afterwards. So in the face of packet loss this will still establish a connection eventually. - Not every scenario will result in a direct connection in both directions after each endpoint has sent one call-me-maybe request, because direct connections are asymmetric and each side has to receive their own disco pong before it will use the direct connection. To alleviate this and speed up the direct connection on the return path in cases were only one side manages to establish the direct path, this now also sends a ping message upon receipt of a ping. This opens the direct path much quicker in both directions. - When queuing a call-me-maybe and the last netcheck report is deemed too old, the call-me-maybe is not immediately sent. Instead a new netcheck is kicked off and the queued call-me-maybes should be sent once the report is in, so that they are sent with accurate local addresses. However they were never sent after the report finished. This is now fixed and they are sent after the report is ready. - When a ping timeout occurs, we were clearing the best_addr, i.e. the selected direct path, if the ping was for the same path as currently used. However that could have been for a timeout of an earlier ping that was still stopped by a firewall or NAT. It happens that a later ping is sent and receives a pong, before the timeout of that earlier ping. In that case we should not clear the direct path in use. - We used to prune addresses even if they were only just received from a call-me-maybe. This did not allow us to ping all the addresses from a call-me-maybe breaking hole punching. We take two measures: - Do not prune messages that were recently "alive", which means they received a ping, pong or call-me-maybe recently. - We increase the number of allowed inactive addresses. They used to be 100. If someone switches between several networks regularly, we would like to still know those previous networks as they might start working again. - When looking for a destination address for an endpoint, the addr_for_send() function uses un-confirmed UDP address if available. However if we were given an IPv6 address (probably via a ticket, effectively via add_node_addr()) we should only select the IPv6 address if netcheck repored we have working IPv6 ourself. - The netcheck report was wrongly overwriting whether an IP family was usable on dual stack hosts. Resulting in the magic socket being confused about which connectivity it had. This also begins to clean up some terminology: "endpoint" is used for many different things. Here the term "path" is consistently used when referring to a specific destination address of a node, whenter over the derper or a direct UDP socket. This naming cleanup however is not complete as that would obscure the fixes too much. More to be done later. Closes n0-computer#1955 ## Notes & open questions This also changes the first address of our QuicMappedAddresses to start at 1. Not that 0 is reserved, but it is rather odd for people who are used to IPv4 to see 0 being a real node. And it's not like you're short of addresses in IPv6. ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: dignifiedquire <[email protected]>
## Description This was essentially re-implementing the Rust Debug trait. Now that DerpUrl has a nice Debug log there really is no need to manually implement this, instead we rely on Debug. ## Notes & open questions This is splitting off some peripheral stuff from n0-computer/iroh#1984 New flaky test marked: n0-computer/iroh#2010 and #1935 ## Change checklist - [x] Self-review.
## Description This allows you to write log output to a specific filedescriptor in the shell: iroh --log-fd 3 ... 3>some/file.log Thus not interfering with normal stdout and stderr output and getting the full functionality of this. ## Notes & open questions This is splitting off some peripheral stuff from n0-computer/iroh#1984 ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant.
## Description This was essentially re-implementing the Rust Debug trait. Now that DerpUrl has a nice Debug log there really is no need to manually implement this, instead we rely on Debug. ## Notes & open questions This is splitting off some peripheral stuff from n0-computer/iroh#1984 New flaky test marked: n0-computer/iroh#2010 and #1935 ## Change checklist - [x] Self-review.
Description
This includes various fixes and improvements in holepunching behaviour,
improving both how fast a direct connection is established and how
reliably it is established.
Improve sending of call-me-maybe: this sometimes wasn't sent if
there were no known addresses for an endpoint because no pings
were sent and there was an active connection on via the derp
path. Instead we now send it when needed, regardless of pings
or active connections. To avoid sending it all the time when
no direct connection can be established we only send it once
every heartbeat cycle, i.e. every 5s. In the worst case (like
when UDP pings get lost) this means a call-me-maybe will be sent
after just under 2 heartbeat intervals, and should be regular
every heartbeat interval afterwards. So in the face of packet
loss this will still establish a connection eventually.
Not every scenario will result in a direct connection in both
directions after each endpoint has sent one call-me-maybe
request, because direct connections are asymmetric and each
side has to receive their own disco pong before it will use
the direct connection. To alleviate this and speed up the
direct connection on the return path in cases were only one
side manages to establish the direct path, this now also sends
a ping message upon receipt of a ping. This opens the direct
path much quicker in both directions.
When queuing a call-me-maybe and the last netcheck report is
deemed too old, the call-me-maybe is not immediately sent.
Instead a new netcheck is kicked off and the queued
call-me-maybes should be sent once the report is in, so that
they are sent with accurate local addresses. However they
were never sent after the report finished. This is now fixed
and they are sent after the report is ready.
When a ping timeout occurs, we were clearing the best_addr, i.e.
the selected direct path, if the ping was for the same path as
currently used. However that could have been for a timeout of
an earlier ping that was still stopped by a firewall or NAT. It
happens that a later ping is sent and receives a pong, before the
timeout of that earlier ping. In that case we should not clear
the direct path in use.
We used to prune addresses even if they were only just received from a
call-me-maybe. This did not allow us to ping all the addresses from a
call-me-maybe breaking hole punching.
We take two measures:
Do not prune messages that were recently "alive", which means they
received a ping, pong or call-me-maybe recently.
We increase the number of allowed inactive addresses. They used to
be 100. If someone switches between several networks regularly, we
would like to still know those previous networks as they might start
working again.
When looking for a destination address for an endpoint, the
addr_for_send() function uses un-confirmed UDP address if available.
However if we were given an IPv6 address (probably via a ticket,
effectively via add_node_addr()) we should only select the IPv6
address if netcheck repored we have working IPv6 ourself.
The netcheck report was wrongly overwriting whether an IP
family was usable on dual stack hosts. Resulting in the magic
socket being confused about which connectivity it had.
This also begins to clean up some terminology: "endpoint" is used for
many different things. Here the term "path" is consistently used
when referring to a specific destination address of a node, whenter
over the derper or a direct UDP socket. This naming cleanup however
is not complete as that would obscure the fixes too much. More to
be done later.
Closes #1955
Notes & open questions
This also changes the first address of our QuicMappedAddresses to
start at 1. Not that 0 is reserved, but it is rather odd for people
who are used to IPv4 to see 0 being a real node. And it's not like
you're short of addresses in IPv6.
Change checklist