Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iroh-net): combine discovery services and add heuristics when to start discovery #2056

Merged
merged 17 commits into from
Mar 11, 2024

Conversation

Frando
Copy link
Member

@Frando Frando commented Mar 4, 2024

Description

  • Changes the resolve method of the discovery trait to return a Stream of DiscoveryItems
  • Adds a CombinedDiscovery to combine multiple discovery services
  • Always uses the discovery service when connecting to a node to which we don't have any address yet
  • Start the quinn connect as soon as a first discovery result came in, but continue the discovery until we have a successful connection
  • Move the discovery trait from module magicsock to new module discovery

Notes & open questions

  • Should we also attempt a discovery if we have some address already?
  • Add tests

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.

///
/// This will be called from a tokio task, so it is safe to spawn new tasks.
/// These tasks will be run on the runtime of the [`super::MagicEndpoint`].
fn publish(&self, _info: &AddrInfo) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these not async?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can answer this at least for the original discovery trait.

They are intended to be "fire and forget". At the site where you call these you do not want to wait for them to complete. E.g. if we publish to a HTTP PUT endpoint we don't want to wait for successful publish before continuing, since that would probably create havoc inside the magicsock and in any case not help anybody.

E.g. in case of the pkarr discovery, this just puts the record somewhere and starts a loop to publish and republish to the DHT. Publishing to the DHT can take seconds, you don't want to wait for that. And what are you going to do if it fails? E.g. you are offline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well from a task management perspective I would have expected a handling more like this

// in the magicsock
let publish_task = tokio::task::spawn(discovery.publish());
// track task or drop depending on the management

because this way you can eg cancel things on shutdown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task is owned by the discovery service, so it's the job of the discovery service to cancel the task on drop. I think there is not necessarily is a 1:1 relation between tasks that need to be spawned and publish calls.

E.g. in the pkarr discovery there is just a single long lived republish task owned by the discovery, and calling publish just replaces the value to be published with the new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would I communicate that I want the publish to end then? there is no shutdown method or anything like that if this is service is long running

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe pass CancellationTokens into both resolve and publish? This would give full flexibility to both. Downside is that implementors would have to use tokio::select! likely. Still, even for the resolve case it could make implementations simpler. Depending on how a resolver works, detecting that the returned stream was dropped may not be straightforward or instant. E.g. with channels, you might only realize once you try to sent, and so you might have been doing uneeded work already up to that point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could add an optional fn shutdown() to the discovery trait which will be called in MagicEndpoint::close and can be used by the discovery service to abort all tasks it started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the cancellation token approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I saw the doc comment here I was wondering why the signature does not give you a handle to the runtime to spawn instead of having it just in the documentation that "tokio spawn is fine".

Given the cancellation discussion, maybe it could even be a JoinSet that you're politely asked to spawn into, and then you get aborting for free when the magicsock drops the joinset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this for a followup and created an issue here: #2066

github-merge-queue bot pushed a commit that referenced this pull request Mar 6, 2024
…sock actor (#2058)

## Description

While working on #2056 I spotted that we use the actor inbox with return
channels for information that is readily available already on the shared
inner magicsock. This removes the unneeded complexity and thus
simplifies `get_mapping_addr`, `endpoint_info` and `endpoint_infos` to
return the information non-async and infallible. Yay!

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [x] Self-review.
@Frando Frando changed the title (WIP) feat: combine discovery services feat: combine discovery services Mar 7, 2024
@Frando
Copy link
Member Author

Frando commented Mar 7, 2024

I added more docs and tests, and implemented a DiscoveryTask with the heuristics as discussed in Discord:

  • connect called with NodeAddr
  • If NodeAddr contains direct_addrs or derp_url: Add to magicsock
  • If no direct_addrs or derp_url provided, and no info in magicsock, start discovery and wait for first result
  • If we haven't received recently on any path from the node, start task in background:
    • Check: Did we receive recently? If so, done.
    • If user provided new addresses, wait for a delay of 500ms
    • Check again: Did we receive recently? If so - done. If not - start discovery.
  • start connect_quinn. This starts in parallel to the discovery (maybe) still running.
  • Quinn will retry automatically for up to 10s in our config.
  • Abort discovery task once a connection is made

This is ready to review now. I will rebase the DNS discovery on top of this also I think.

@Frando Frando changed the title feat: combine discovery services feat: combine discovery services and add heuristics when to start discovery Mar 7, 2024
@Frando Frando changed the title feat: combine discovery services and add heuristics when to start discovery feat(iroh-net): combine discovery services and add heuristics when to start discovery Mar 7, 2024
pub addr_info: AddrInfo,
}

/// A discovery service that combines multiple discovery sources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment that it resolves in parallel vs serial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe even call it ConcurrentDiscovery?

/// Start a discovery task.
pub fn start(ep: MagicEndpoint, node_id: NodeId) -> Result<Self> {
if ep.discovery().is_none() {
bail!("No discovery services configured");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use ensure

Copy link
Contributor

@dignifiedquire dignifiedquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits but overall nice improvements

@Frando Frando mentioned this pull request Mar 7, 2024
3 tasks
@Frando Frando added this pull request to the merge queue Mar 11, 2024
Merged via the queue into main with commit f4d3fab Mar 11, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants