fix(iroh-net): Keep the relay connection alive on read errors #2782
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
When the connection to the relay server fails the read channel will return a read error. At this point the ActiveRelay actor will passively wait until it has been asked to send something again before it will re-establish a connection.
However if the local node has no reason to send anything to the relay server, the connection is never re-established. This is problematic when the relay has remote nodes trying to send to this node. This doubly problematic when the connection is to the home relay: the node just sits there thinking everything is healty and quiet, but no traffic is reaching it.
In a node with active traffic this doesn't really show up, since a send will be triggered quickly for an active connection and the connection with the relay server would be re-established.
The start of the ActiveRelay run loop is the right place for this. A read error triggers the loop to go round, logs a read error already and then re-estagblishes the connection.
This does not keep the relay connection open forever. The mechanism that is cleans up
unused connections to relay servers will still function correctly since this only takes
the time something was last sent to a relay server into account. As long as a connection
with a remote node exists there will be a DISCO ping between the two nodes over the relay
path, so the connection is correctly kept alive. The home relay is exempted from the
relay connection cleanup so is also kept connected, leaving this node available to be
contacted via the relay server. Which is the entire point of this bugfix.
The relay_client.is_connected() call sends a message to the relay Client actor, and relay_client.connect() does that again. Taking the shortcut to only call .connect() however is not better because the logging becomes messier. In the common case there is one roundrip-message to the relay Client actor and this would not improve anyway. The two messages for the case where a reconnect is needed does not occur commonly.
Breaking Changes
None
Notes & open questions
Fixes fishfolk/bones#428
It is rather difficult to test though.
This targets #2781 as base.
Change checklist
[ ] Documentation updates following the style guide, if relevant.[ ] Tests if relevant.[ ] All breaking changes documented.