-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try ipv6 connections even after udp timeout #1086
Conversation
. |
It looks like I found the problem, but please review carefully to make sure this is the correct solution. |
@GrayHatter could you take a look at this? Could there have been a reason for the previous behaviour? |
What part are you asking me to take a look at? Verify the correctness of the tests? Or explain the time out behavior? |
What are you asking here? What's the point of timing out connections? Or verify the correctness of the tests? Re: The change to netcrypto. Generally, I don't approve. If you want to disable the timeout checks, that's something I'm happy to discuss the merits of, but this still does the checks, then ignores them after the fact. I'd have to look into exactly what calls return_ipp_conn.... But why would you want to keep trying for a connection that is already timed out? I myself have always thought the timeouts were way too lazy. IMO toxcore should be culling connetions after something closer to 10s. (But that would be a change for after some network optimizations, toxcore is too network heavy for that still) Before I can't answer intelligently, about why toxcore does what it does... Can you describe the problem your coming up against a bit better? Additional thought; I don't know the TCP/IP stack that well, does the system hold packets for suspended apps? If the kernel is kind enough to queue UDP packets for suspended apps with an open socket. When you resume the app, the paused client will think it's peer is still online, (because the kernel will then send it all the packets sent by it's peer, thus the suspend app will update the timeout to when it first got the packets, not when they were actually sent). Where as with a network outage both clients will rightfully see the other as disconnected, because the timeouts on both ends will be accurate. |
@GrayHatter: thanks for looking. The original code checks if both ip4
and ip6 have timed out, and if so returns an ip4 address if it exists.
But if there's no ip4 address, it returns 'empty' even if there is an
ip6 address. My question is whether this difference between the handling
of ip4 and ip6 was deliberate, or somehow makes sense.
Maybe @irungentoo could help here.
This comes up because returning 'empty' leads to a connection which
stays alive but along which no udp packets can be sent. So the test in
this PR fails without the patch to net_crypto - the useless net_crypto
connection doesn't get restarted until FRIEND_DHT_TIMEOUT seconds pass.
It does seem to make sense not to simply kill the connection after the
UDP timeout, because it gives TCP a chance.
does the system hold packets for suspended apps?
Based on a quick test with netcat: yes. And yes, this fits with what we
see in tox - it takes the suspended tox another 32
(FRIEND_CONNECTION_TIMEOUT) seconds to notice that the connection has
gone down.
|
auto_tests/reconnect_test.c
Outdated
#include "../toxcore/util.h" | ||
#include "check_compat.h" | ||
|
||
#define NUM_TOXES 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a different name, or needs #undef
at the end: https://travis-ci.org/TokTok/c-toxcore/jobs/416043872#L1251
auto_tests/reconnect_test.c
Outdated
|
||
printf("letting connections settle\n"); | ||
|
||
while ((int)(time(nullptr) - test_start_time) < 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the int-cast here?
uint32_t connected_count = 0; | ||
|
||
for (size_t j = 0; j < friend_count; j++) { | ||
if (tox_friend_get_connection_status(toxes[index], j, nullptr) != TOX_CONNECTION_NONE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd use the state and friend connection callbacks for this, but maybe we need a bit more auto_test framework to make that less repetitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'm leaving it as is for now - quite a few of the tests use this deprecated function, best to fix them all at once.
toxcore/net_crypto.c
Outdated
return conn->ip_portv4; | ||
} | ||
|
||
if (net_family_is_ipv6(conn->ip_portv6.ip.family)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think about this a bit more tomorrow. Before I think about it and read more code, maybe you can tell me what the purpose of line 638-640 is after this change. I.e. why we have ip_is_lan
between the two ipv6 returns.
* Wednesday, 2018-08-15 at 16:26 -0700 - iphydf <[email protected]>:
I'll think about this a bit more tomorrow. Before I think about it and
read more code, maybe you can tell me what the purpose of line 638-640
is after this change. I.e. why we have `ip_is_lan` between the two ipv6
returns.
In the existing code, we prefer in order: a non-timed-out v4 lan
connection, a non-timed-out v6 connection, then a v4 connection. The
lines you mention are there to preserve these preferences when
considering timed out connections, prefering v4 lan then v6 then v4.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @zugz)
auto_tests/Makefile.inc, line 3 at r3 (raw file):
if BUILD_TESTS TESTS = disc_test bootstrap_test conference_double_invite_test conference_peer_nick_test conference_simple_test conference_test \
reconnect_test (and keep this sorted)
auto_tests/Makefile.inc, line 10 at r3 (raw file):
tox_many_tcp_test tox_many_test tox_one_test tox_strncasecmp_test typing_test version_test check_PROGRAMS = disc_test bootstrap_test conference_double_invite_test conference_peer_nick_test conference_simple_test \
reconnect_test
auto_tests/Makefile.inc, line 60 at r3 (raw file):
conference_test_LDADD = $(AUTOTEST_LDADD) disc_test_SOURCES = ../auto_tests/bootstrap_test.c
Not bootstrap_test, reconnect_test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, 2 of 3 files at r4.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @zugz)
auto_tests/reconnect_test.c, line 1 at r4 (raw file):
/* Auto Tests: Conferences.
Something else here. Perhaps copy/paste some of the PR description here for posterity.
auto_tests/reconnect_test.c, line 71 at r4 (raw file):
printf("disconnecting #%u\n", state[disconnect].index); while (!all_disconnected_from(TOX_COUNT, toxes, state, disconnect)) {
Optional: perhaps use do-while like in all other tests as of recently.
Codecov Report
@@ Coverage Diff @@
## master #1086 +/- ##
========================================
- Coverage 82.7% 82.6% -0.1%
========================================
Files 81 82 +1
Lines 14438 14478 +40
========================================
+ Hits 11945 11965 +20
- Misses 2493 2513 +20
Continue to review full report at Codecov.
|
Done all, and fixed a bug I introduced at some point in the reconnection
test - it wasn't actually disconnecting! Now it once again fails without
the ipv6 change and succeeds with it, as intended.
|
0453144
to
27a22a0
Compare
|
c1624c6
to
c6b0073
Compare
Also adds a test (auto_reconnect_test) which fails without this change.
Currently this just adds a test demonstrating the problem: if a friend
connection is established for at least 2 seconds, then one friend stops
iterating for at least 31 seconds, then the friend connection goes down and
requires around two minutes to come back up. This is clearly excessive, and
poses a problem for tests which want to use this method for testing network
disconnections, as well as for actual clients which are ^Z suspended.
Experience with actual clients suggests suspension of toxcore is necessary to
trigger this bug - disconnections due to actual network failure are typically
re-established much faster. I have no real idea currently what the problem
could be.
See also zugz/c-toxcore:reconnect_slow:timeshifting for a version of the test
which executes without delays, by simulating the sleeps.
This change is