fix(iroh-net): Keep the relay connection alive on read errors #2782

flub · 2024-10-03T17:07:08Z

Description

When the connection to the relay server fails the read channel will return a read error. At this point the ActiveRelay actor will passively wait until it has been asked to send something again before it will re-establish a connection.

However if the local node has no reason to send anything to the relay server, the connection is never re-established. This is problematic when the relay has remote nodes trying to send to this node. This doubly problematic when the connection is to the home relay: the node just sits there thinking everything is healty and quiet, but no traffic is reaching it.

In a node with active traffic this doesn't really show up, since a send will be triggered quickly for an active connection and the connection with the relay server would be re-established.

The start of the ActiveRelay run loop is the right place for this. A read error triggers the loop to go round, logs a read error already and then re-estagblishes the connection.

This does not keep the relay connection open forever. The mechanism that is cleans up
unused connections to relay servers will still function correctly since this only takes
the time something was last sent to a relay server into account. As long as a connection
with a remote node exists there will be a DISCO ping between the two nodes over the relay
path, so the connection is correctly kept alive. The home relay is exempted from the
relay connection cleanup so is also kept connected, leaving this node available to be
contacted via the relay server. Which is the entire point of this bugfix.

The relay_client.is_connected() call sends a message to the relay Client actor, and relay_client.connect() does that again. Taking the shortcut to only call .connect() however is not better because the logging becomes messier. In the common case there is one roundrip-message to the relay Client actor and this would not improve anyway. The two messages for the case where a reconnect is needed does not occur commonly.

Breaking Changes

None

Notes & open questions

Fixes fishfolk/bones#428

It is rather difficult to test though.

This targets #2781 as base.

Change checklist

Self-review.
~~[ ] Documentation updates following the style guide, if relevant.~~
~~[ ] Tests if relevant.~~
~~[ ] All breaking changes documented.~~

github-actions · 2024-10-03T17:09:36Z

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/2782/docs/iroh/

Last updated: 2024-10-04T07:20:29Z

github-actions · 2024-10-03T17:16:48Z

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 892203f

ramfox

lord the relay stuff is so convoluted. Actors on actors on actors.

Looks good, though!

When the connection to the relay server fails the read channel will return a read error. At this point the ActiveRelay actor will passively wait until it has been asked to send something again before it will re-establish a connection. However if the local node has no reason to send anything to the relay server, the connection is never re-established. This is problematic when the relay has remote nodes trying to send to this node. This doubly problematic when the connection is to the home relay: the node just sits there thinking everything is healty and quiet, but no traffic is reaching it. In a node with active traffic this doesn't really show up, since a send will be triggered quickly for an active connection and the connection with the relay server would be re-established. The start of the ActiveRelay run loop is the right place for this. A read error triggers the loop to go round, logs a read error already and then re-estagblishes the connection. The relay_client.is_connected() call sends a message to the relay Client actor, and relay_client.connect() does that again. Taking the shortcut to only call .connect() however is not better because the logging becomes messier. In the common case there is one roundrip-message to the relay Client actor and this would not improve anyway. The two messages for the case where a reconnect is needed does not occur commonly.

flub requested review from ramfox and divagant-martian October 3, 2024 17:07

flub mentioned this pull request Oct 3, 2024

Matchmaker eventually stops accepting connections after being deployed fishfolk/bones#428

Closed

flub force-pushed the flub/relay-actor-active-nodes branch from 8c594f5 to 275b399 Compare October 3, 2024 17:24

flub force-pushed the flub/relay-conn-keep-alive branch from 4ba59df to 64750b4 Compare October 3, 2024 17:25

ramfox approved these changes Oct 4, 2024

View reviewed changes

Base automatically changed from flub/relay-actor-active-nodes to main October 4, 2024 07:08

flub force-pushed the flub/relay-conn-keep-alive branch from 64750b4 to 3eadb89 Compare October 4, 2024 07:18

flub added this pull request to the merge queue Oct 4, 2024

Merged via the queue into main with commit 383f1f9 Oct 4, 2024
27 checks passed

flub deleted the flub/relay-conn-keep-alive branch October 4, 2024 08:30

arilotter mentioned this pull request Nov 19, 2024

iroh-net: regression: blob downloads freeze at a low percentage completed #2951

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(iroh-net): Keep the relay connection alive on read errors #2782

fix(iroh-net): Keep the relay connection alive on read errors #2782

flub commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 3, 2024 •

edited

Loading

ramfox left a comment

fix(iroh-net): Keep the relay connection alive on read errors #2782

fix(iroh-net): Keep the relay connection alive on read errors #2782

Conversation

flub commented Oct 3, 2024 • edited Loading

Description

Breaking Changes

Notes & open questions

Change checklist

github-actions bot commented Oct 3, 2024 • edited Loading

github-actions bot commented Oct 3, 2024 • edited Loading

ramfox left a comment

Choose a reason for hiding this comment

flub commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 3, 2024 •

edited

Loading