Try ipv6 connections even after udp timeout #1086

zugz · 2018-08-13T19:35:41Z

Currently this just adds a test demonstrating the problem: if a friend
connection is established for at least 2 seconds, then one friend stops
iterating for at least 31 seconds, then the friend connection goes down and
requires around two minutes to come back up. This is clearly excessive, and
poses a problem for tests which want to use this method for testing network
disconnections, as well as for actual clients which are ^Z suspended.

Experience with actual clients suggests suspension of toxcore is necessary to
trigger this bug - disconnections due to actual network failure are typically
re-established much faster. I have no real idea currently what the problem
could be.

See also zugz/c-toxcore:reconnect_slow:timeshifting for a version of the test
which executes without delays, by simulating the sleeps.

This change is

CLAassistant · 2018-08-13T19:35:49Z

.

zugz · 2018-08-14T17:13:30Z

It looks like I found the problem, but please review carefully to make sure this is the correct solution.

zugz · 2018-08-14T17:30:18Z

@GrayHatter could you take a look at this? Could there have been a reason for the previous behaviour?

GrayHatter · 2018-08-15T07:39:38Z

What part are you asking me to take a look at?

Verify the correctness of the tests? Or explain the time out behavior?

GrayHatter · 2018-08-15T07:56:46Z

What are you asking here? What's the point of timing out connections? Or verify the correctness of the tests?

Re: The change to netcrypto. Generally, I don't approve. If you want to disable the timeout checks, that's something I'm happy to discuss the merits of, but this still does the checks, then ignores them after the fact.

I'd have to look into exactly what calls return_ipp_conn.... But why would you want to keep trying for a connection that is already timed out? I myself have always thought the timeouts were way too lazy. IMO toxcore should be culling connetions after something closer to 10s. (But that would be a change for after some network optimizations, toxcore is too network heavy for that still)

Before I can't answer intelligently, about why toxcore does what it does... Can you describe the problem your coming up against a bit better?

Additional thought; I don't know the TCP/IP stack that well, does the system hold packets for suspended apps? If the kernel is kind enough to queue UDP packets for suspended apps with an open socket. When you resume the app, the paused client will think it's peer is still online, (because the kernel will then send it all the packets sent by it's peer, thus the suspend app will update the timeout to when it first got the packets, not when they were actually sent). Where as with a network outage both clients will rightfully see the other as disconnected, because the timeouts on both ends will be accurate.

zugz · 2018-08-15T15:12:15Z

@GrayHatter: thanks for looking. The original code checks if both ip4 and ip6 have timed out, and if so returns an ip4 address if it exists. But if there's no ip4 address, it returns 'empty' even if there is an ip6 address. My question is whether this difference between the handling of ip4 and ip6 was deliberate, or somehow makes sense. Maybe @irungentoo could help here. This comes up because returning 'empty' leads to a connection which stays alive but along which no udp packets can be sent. So the test in this PR fails without the patch to net_crypto - the useless net_crypto connection doesn't get restarted until FRIEND_DHT_TIMEOUT seconds pass. It does seem to make sense not to simply kill the connection after the UDP timeout, because it gives TCP a chance.

does the system hold packets for suspended apps?

Based on a quick test with netcat: yes. And yes, this fits with what we see in tox - it takes the suspended tox another 32 (FRIEND_CONNECTION_TIMEOUT) seconds to notice that the connection has gone down.

iphydf · 2018-08-15T23:18:28Z

auto_tests/reconnect_test.c

+#include "../toxcore/util.h"
+#include "check_compat.h"
+
+#define NUM_TOXES 2


This needs a different name, or needs #undef at the end: https://travis-ci.org/TokTok/c-toxcore/jobs/416043872#L1251

iphydf · 2018-08-15T23:19:27Z

auto_tests/reconnect_test.c

+
+    printf("letting connections settle\n");
+
+    while ((int)(time(nullptr) - test_start_time) < 2) {


Why the int-cast here?

iphydf · 2018-08-15T23:20:56Z

auto_tests/reconnect_test.c

+    uint32_t connected_count = 0;
+
+    for (size_t j = 0; j < friend_count; j++) {
+        if (tox_friend_get_connection_status(toxes[index], j, nullptr) != TOX_CONNECTION_NONE) {


Ideally we'd use the state and friend connection callbacks for this, but maybe we need a bit more auto_test framework to make that less repetitive.

Yes. I'm leaving it as is for now - quite a few of the tests use this deprecated function, best to fix them all at once.

iphydf · 2018-08-15T23:26:52Z

toxcore/net_crypto.c

+        return conn->ip_portv4;
+    }
+
+    if (net_family_is_ipv6(conn->ip_portv6.ip.family)) {


I'll think about this a bit more tomorrow. Before I think about it and read more code, maybe you can tell me what the purpose of line 638-640 is after this change. I.e. why we have ip_is_lan between the two ipv6 returns.

zugz · 2018-08-16T01:55:55Z

* Wednesday, 2018-08-15 at 16:26 -0700 - iphydf <[email protected]>:

I'll think about this a bit more tomorrow. Before I think about it and read more code, maybe you can tell me what the purpose of line 638-640 is after this change. I.e. why we have `ip_is_lan` between the two ipv6 returns.

In the existing code, we prefer in order: a non-timed-out v4 lan connection, a non-timed-out v6 connection, then a v4 connection. The lines you mention are there to preserve these preferences when considering timed out connections, prefering v4 lan then v6 then v4.

iphydf

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @zugz)

auto_tests/Makefile.inc, line 3 at r3 (raw file):

if BUILD_TESTS

TESTS = disc_test bootstrap_test conference_double_invite_test conference_peer_nick_test conference_simple_test conference_test \

reconnect_test (and keep this sorted)

auto_tests/Makefile.inc, line 10 at r3 (raw file):

        tox_many_tcp_test tox_many_test tox_one_test tox_strncasecmp_test typing_test version_test

check_PROGRAMS = disc_test bootstrap_test conference_double_invite_test conference_peer_nick_test conference_simple_test \

reconnect_test

auto_tests/Makefile.inc, line 60 at r3 (raw file):

conference_test_LDADD = $(AUTOTEST_LDADD)

disc_test_SOURCES = ../auto_tests/bootstrap_test.c

Not bootstrap_test, reconnect_test.

iphydf

Reviewed 1 of 1 files at r3, 2 of 3 files at r4.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @zugz)

auto_tests/reconnect_test.c, line 1 at r4 (raw file):

/* Auto Tests: Conferences.

Something else here. Perhaps copy/paste some of the PR description here for posterity.

auto_tests/reconnect_test.c, line 71 at r4 (raw file):

    printf("disconnecting #%u\n", state[disconnect].index);

    while (!all_disconnected_from(TOX_COUNT, toxes, state, disconnect)) {

Optional: perhaps use do-while like in all other tests as of recently.

codecov · 2018-08-26T08:30:32Z

Codecov Report

Merging #1086 into master will decrease coverage by <.1%.
The diff coverage is 95%.

@@           Coverage Diff            @@
##           master   #1086     +/-   ##
========================================
- Coverage    82.7%   82.6%   -0.1%     
========================================
  Files          81      82      +1     
  Lines       14438   14478     +40     
========================================
+ Hits        11945   11965     +20     
- Misses       2493    2513     +20

Impacted Files	Coverage Δ
auto_tests/reconnect_test.c	`100% <100%> (ø)`
toxcore/net_crypto.c	`93.5% <66.6%> (-0.1%)`	⬇️
toxcore/crypto_core_test.cc	`90.3% <0%> (-9.7%)`	⬇️
toxav/audio.c	`65.7% <0%> (-7.8%)`	⬇️
toxav/toxav.c	`68.3% <0%> (-0.4%)`	⬇️
toxcore/Messenger.c	`85.7% <0%> (-0.1%)`	⬇️
toxcore/DHT.c	`76.7% <0%> (ø)`	⬆️
toxav/msi.c	`65% <0%> (+0.5%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0075374...66ab386. Read the comment docs.

zugz · 2018-08-26T09:34:21Z

Done all, and fixed a bug I introduced at some point in the reconnection test - it wasn't actually disconnecting! Now it once again fails without the ipv6 change and succeeds with it, as intended.

iphydf · 2018-08-27T16:14:09Z

<@irungentoo> it looks ok
<@irungentoo> though for consistency it might be better to put the  if (net_family_is_ipv4(conn->ip_portv4.ip.family)) {...} before: if (net_family_is_ipv6(conn->ip_portv6.ip.family)) {...}

Also adds a test (auto_reconnect_test) which fails without this change.

iphydf added this to the v0.2.x milestone Aug 13, 2018

zugz force-pushed the reconnect_slow branch from 9c24cb3 to 2392280 Compare August 14, 2018 17:11

zugz changed the title ~~WIP: make friend connections reconnect faster after suspend~~ try ipv6 connections even after udp timeout Aug 14, 2018

zugz changed the title ~~try ipv6 connections even after udp timeout~~ Try ipv6 connections even after udp timeout Aug 14, 2018

zugz force-pushed the reconnect_slow branch from 2392280 to be77440 Compare August 14, 2018 17:28

zugz force-pushed the reconnect_slow branch from be77440 to 04055f7 Compare August 14, 2018 18:30

iphydf requested changes Aug 15, 2018

View reviewed changes

zugz force-pushed the reconnect_slow branch from 04055f7 to 7611d96 Compare August 18, 2018 14:43

iphydf requested changes Aug 19, 2018

View reviewed changes

zugz force-pushed the reconnect_slow branch from 7611d96 to 15281df Compare August 26, 2018 07:03

iphydf requested changes Aug 26, 2018

View reviewed changes

zugz force-pushed the reconnect_slow branch from 15281df to be1a29f Compare August 26, 2018 09:32

iphydf approved these changes Aug 26, 2018

View reviewed changes

iphydf assigned irungentoo Aug 26, 2018

zugz force-pushed the reconnect_slow branch 5 times, most recently from 0453144 to 27a22a0 Compare August 27, 2018 16:08

iphydf unassigned irungentoo Aug 27, 2018

zugz force-pushed the reconnect_slow branch 2 times, most recently from c1624c6 to c6b0073 Compare August 27, 2018 20:05

try ipv6 connections even after udp timeout

66ab386

Also adds a test (auto_reconnect_test) which fails without this change.

zugz force-pushed the reconnect_slow branch from c6b0073 to 66ab386 Compare August 27, 2018 20:28

zugz merged commit 66ab386 into TokTok:master Aug 27, 2018

zugz deleted the reconnect_slow branch August 27, 2018 21:18

robinlinden modified the milestones: v0.2.x, v0.2.7 Aug 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try ipv6 connections even after udp timeout #1086

Try ipv6 connections even after udp timeout #1086

zugz commented Aug 13, 2018 •

edited by iphydf

Loading

CLAassistant commented Aug 13, 2018 •

edited by iphydf

Loading

zugz commented Aug 14, 2018

zugz commented Aug 14, 2018

GrayHatter commented Aug 15, 2018

GrayHatter commented Aug 15, 2018

zugz commented Aug 15, 2018 via email

iphydf Aug 15, 2018

iphydf Aug 15, 2018

iphydf Aug 15, 2018

zugz Aug 18, 2018

iphydf Aug 15, 2018

zugz commented Aug 16, 2018 via email

iphydf left a comment

iphydf left a comment

codecov bot commented Aug 26, 2018 •

edited

Loading

zugz commented Aug 26, 2018 via email

iphydf commented Aug 27, 2018


		printf("letting connections settle\n");

		while ((int)(time(nullptr) - test_start_time) < 2) {

Try ipv6 connections even after udp timeout #1086

Try ipv6 connections even after udp timeout #1086

Conversation

zugz commented Aug 13, 2018 • edited by iphydf Loading

CLAassistant commented Aug 13, 2018 • edited by iphydf Loading

zugz commented Aug 14, 2018

zugz commented Aug 14, 2018

GrayHatter commented Aug 15, 2018

GrayHatter commented Aug 15, 2018

zugz commented Aug 15, 2018 via email

iphydf Aug 15, 2018

Choose a reason for hiding this comment

iphydf Aug 15, 2018

Choose a reason for hiding this comment

iphydf Aug 15, 2018

Choose a reason for hiding this comment

zugz Aug 18, 2018

Choose a reason for hiding this comment

iphydf Aug 15, 2018

Choose a reason for hiding this comment

zugz commented Aug 16, 2018 via email

iphydf left a comment

Choose a reason for hiding this comment

iphydf left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 26, 2018 • edited Loading

Codecov Report

zugz commented Aug 26, 2018 via email

iphydf commented Aug 27, 2018

zugz commented Aug 13, 2018 •

edited by iphydf

Loading

CLAassistant commented Aug 13, 2018 •

edited by iphydf

Loading

codecov bot commented Aug 26, 2018 •

edited

Loading