Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve connectivity #1128

Merged
merged 33 commits into from
Jun 27, 2023
Merged

fix: improve connectivity #1128

merged 33 commits into from
Jun 27, 2023

Conversation

dignifiedquire
Copy link
Contributor

@dignifiedquire dignifiedquire commented Jun 22, 2023

  • make sure to send a full ping (and call me maybe) when doing the initial ping
  • handle disco messages when they are coming over derp. before they failed to be extracted due to the packet combinations & the prefixing.

Should improve #1098
Fixes #1084

@dignifiedquire
Copy link
Contributor Author

/netsim

ramfox
ramfox previously approved these changes Jun 22, 2023
Copy link
Contributor

@ramfox ramfox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small doc nits otherwise 🔥

iroh-net/src/hp/magicsock/endpoint.rs Outdated Show resolved Hide resolved
iroh-net/src/hp/magicsock/udp_actor.rs Outdated Show resolved Hide resolved
@ramfox
Copy link
Contributor

ramfox commented Jun 22, 2023

question tho, what are these ubuntu cli failures?

@github-actions
Copy link

fix-odd-issues.6ca98dea9406a46e43b9d47188f6874fb0a827eb
Perf report:

test case throughput_gbps throughput_transfer
iroh_latency_20ms 1_to_1 1.23 2.59
iroh_latency_20ms 1_to_3 3.98 8.17
iroh_latency_20ms 1_to_5 7.14 13.94
iroh_latency_20ms 1_to_10 12.70 24.64
iroh_latency_20ms 2_to_2 3.11 6.19
iroh_latency_20ms 2_to_4 5.61 11.67
iroh_latency_20ms 2_to_6 8.33 17.23
iroh_latency_20ms 2_to_10 12.62 25.20
iroh 1_to_1 1.50 2.87
iroh 1_to_3 4.40 9.32
iroh 1_to_5 6.69 12.74
iroh 1_to_10 12.99 25.12
iroh 2_to_2 2.36 4.72
iroh 2_to_4 5.64 11.95
iroh 2_to_6 8.48 17.03
iroh 2_to_10 13.14 25.47
iroh_latency_200ms 1_to_1 1.29 2.95
iroh_latency_200ms 1_to_3 4.09 7.77
iroh_latency_200ms 1_to_5 6.08 12.86
iroh_latency_200ms 1_to_10 12.69 24.46
iroh_latency_200ms 2_to_2 2.79 5.82
iroh_latency_200ms 2_to_4 5.67 12.00
iroh_latency_200ms 2_to_6 8.37 17.63
iroh_latency_200ms 2_to_10 12.42 23.76

@Arqu
Copy link
Collaborator

Arqu commented Jun 23, 2023

FWIW this still fails in netsim for the NAT tests.

@Arqu
Copy link
Collaborator

Arqu commented Jun 23, 2023

You can get tests on your PR running if you modify .github/workflows/ci.yml L334 sudo python3 main.py --integration --skip intg_derper__1_to_1_NAT_provide,intg_derper__1_to_1_NAT_both sims/integration into sudo python3 main.py --integration sims/integration

It should error out on those tests until they pass.

@Arqu
Copy link
Collaborator

Arqu commented Jun 23, 2023

@dignifiedquire you did the wrong one, netsim.yml only runs on main for final numbers, ci.yml runs the PR tests.

@Arqu
Copy link
Collaborator

Arqu commented Jun 23, 2023

It's fine to keep it though if we're going to fix nat stuff in this PR

@dignifiedquire
Copy link
Contributor Author

dignifiedquire commented Jun 23, 2023

It's fine to keep it though if we're going to fix nat stuff in this PR

That's the plan, it is fixed in my home test setup, when both parties are behind different nats.

@Arqu
Copy link
Collaborator

Arqu commented Jun 23, 2023

When running manually on my end, seems to do more now, but still fails connecting to the provider.

Comment on lines 184 to 185
self.trust_best_addr_until
.replace(*now + Duration::from_secs(60 * 60));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe trust from last_ping time, rather than now? & if that time exceeds an hour it is no longer a best_addr candidate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@dignifiedquire
Copy link
Contributor Author

/netsim

@github-actions
Copy link

fix-odd-issues.e15da277a27c5e1090094323eb9d1bef4263370a
Perf report:

test case throughput_gbps throughput_transfer
iroh_latency_20ms 1_to_1 1.47 2.71
iroh_latency_20ms 1_to_3 4.34 7.96
iroh_latency_20ms 1_to_5 7.30 13.54
iroh_latency_20ms 1_to_10 13.94 24.93
iroh_latency_20ms 2_to_2 2.86 5.17
iroh_latency_20ms 2_to_4 6.00 11.47
iroh_latency_20ms 2_to_6 8.93 17.00
iroh_latency_20ms 2_to_10 14.68 27.61
iroh 1_to_1 1.49 2.82
iroh 1_to_3 4.46 8.47
iroh 1_to_5 7.38 13.94
iroh 1_to_10 14.14 25.62
iroh 2_to_2 3.14 6.40
iroh 2_to_4 5.97 11.35
iroh 2_to_6 9.09 17.62
iroh 2_to_10 14.41 27.11
iroh_latency_200ms 1_to_1 1.53 2.99
iroh_latency_200ms 1_to_3 4.40 8.20
iroh_latency_200ms 1_to_5 7.32 13.63
iroh_latency_200ms 1_to_10 14.63 27.44
iroh_latency_200ms 2_to_2 3.12 6.27
iroh_latency_200ms 2_to_4 5.77 10.55
iroh_latency_200ms 2_to_6 9.05 17.64
iroh_latency_200ms 2_to_10 14.28 26.59

- send full pings on best_addr ping if needed
- only store endpoints with known keys
@dignifiedquire
Copy link
Contributor Author

/netsim

@github-actions
Copy link

fix-odd-issues.2c41eed44fda16b2600ed8f8e68d2958ffd7cf93
Perf report:

test case throughput_gbps throughput_transfer
iroh_latency_20ms 1_to_1 1.53 2.97
iroh_latency_20ms 1_to_3 4.12 7.95
iroh_latency_20ms 1_to_5 6.64 14.42
iroh_latency_20ms 1_to_10 11.85 23.64
iroh_latency_20ms 2_to_2 2.84 6.12
iroh_latency_20ms 2_to_4 5.20 12.04
iroh_latency_20ms 2_to_6 8.22 17.78
iroh_latency_20ms 2_to_10 11.83 24.29
iroh 1_to_1 1.55 3.08
iroh 1_to_3 4.08 8.74
iroh 1_to_5 6.39 14.53
iroh 1_to_10 12.48 27.28
iroh 2_to_2 2.62 6.12
iroh 2_to_4 5.74 12.38
iroh 2_to_6 8.17 17.58
iroh 2_to_10 12.22 26.79
iroh_latency_200ms 1_to_1 1.30 3.03
iroh_latency_200ms 1_to_3 4.17 9.27
iroh_latency_200ms 1_to_5 6.80 15.27
iroh_latency_200ms 1_to_10 11.61 21.92
iroh_latency_200ms 2_to_2 2.65 6.33
iroh_latency_200ms 2_to_4 5.44 12.11
iroh_latency_200ms 2_to_6 8.07 18.33
iroh_latency_200ms 2_to_10 12.39 27.83

@Arqu
Copy link
Collaborator

Arqu commented Jun 26, 2023

Seems like the failures are real connectivity issues. (again with a NAT on the provide side)

Comment on lines 195 to 201
if let Some(addr) = udp_addr {
self.best_addr = Some(AddrLatency {
addr,
latency: None,
});

self.trust_best_addr_until = Some(*now + Duration::from_secs(15));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so from looking at logs in the failing cli test, there are two main issues i'm seeing.

This issue is apparent on the get side, since as soon as we connect we try to send something.

Issue One, that needs some more thinking:
The first time we attempt to do a full ping with a callmemaybe over derp is the first time we ever connect to the derp server. This requires a full TLS handshake and takes some time to negotiate. But callmemaybe is sensitive to time. This is highly likely to never hole punch.

Issue Two:
This is related to the actual code I'm commenting on 🤣 . Once we set the best_addr and trust_best_addr, we are potentially condemned to using a bad address for 15 seconds. Plus, since it's considered valid, we won't back it up with a derp address even though we have had no proof (yet) that it is a valid address.

One possible solution would be to keep self.trust_best_addr_until None, until a full_ping comes back successfully. But, this means that if another pong has come back on a different address during this time, we won't ever attempt to use it, because we still have a bad address as the best_addr.

I'm proposing, instead, that we never set an address as a best_addr unless it has a latency.

So that would mean just removing these lines. (this fixes the bugs, locally for me, at least).

if we are worried about candidate address flip/flopping, we can keep track of a candidate_addr, that is the address we have randomly chosen as the default to use until we get any pongs back. We can loop through the addresses, and if none have any latency, we default to using the previous candiate_addr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are potentially condemned to using a bad address for 15 seconds.

Well only for 5s, as the pings get sent out and if we time out, it gets removed.

Plus, since it's considered valid,

this was the big problem, we drop derp because we think it is valid, forgot my own logic last night..

.github/workflows/ci.yml Outdated Show resolved Hide resolved
Copy link
Collaborator

@Arqu Arqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭕ 👊

@dignifiedquire dignifiedquire merged commit 8e2d947 into main Jun 27, 2023
@dignifiedquire dignifiedquire deleted the fix-odd-issues branch June 27, 2023 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

netsim integration tests with a NAT on the provide side fail
3 participants