Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kad): correctly handle NoKnownPeers error when bootstrap #5349

Merged
merged 4 commits into from
Jun 18, 2024

Conversation

stormshield-frb
Copy link
Contributor

@stormshield-frb stormshield-frb commented Apr 30, 2024

Description

After testing master, we encountered a bug due to #4838 when doing automatic or periodic bootstrap if the node has no known peers.

Since it failed immediately, I though there was no need to call the bootstrap_status.on_started method. But not doing so never resets the periodic timer inside bootstrap_status resulting in getting stuck to try to bootstrap every time poll is called on kad::Behaviour.

Notes & open questions

N/A

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@guillaumemichel
Copy link
Contributor

This looks a bit hacky, wouldn't it be better to modify the bootstrap Status instead (e.g poll_next_bootstrap)?

@stormshield-frb
Copy link
Contributor Author

This looks a bit hacky, wouldn't it be better to modify the bootstrap Status instead (e.g poll_next_bootstrap)?

I'm not sure to understand what you mean. on_started and on_finished are intended for that purpose.

Even if I would update the Status directly, we would not be able to remove on_started completely since the end user could still manually trigger a bootstrap, and we would not be able to remove on_finish at all since there is currently no way to detect a bootstrap has finished outside exploring query_finished or query_timeout. And since a bootstrap can also fail immediately, we have to handle that there.

I agree that it would feel better to have a passive way to learn that a bootstrap did start or finish but I don't see how to implement that in a reasonably simple manner.

The reason we need to know if a bootstrap as started or finished is because we don't want to cascade bootstrap requests. When a bootstrap is triggered (no matter if it was automatic, periodic or manual), we reset the automatic and periodic timer to their initial value.

@stormshield-frb stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch 2 times, most recently from 67e9aca to ec9898f Compare May 2, 2024 07:42
@guillaumemichel
Copy link
Contributor

Not checking whether the routing table is empty would take us back to before this commit. IMO it is good that this check was introduced, allowing the bootstrap to fail fast.

If the check isn't performed, a new query for self is created, and the first time next() is called, the query state is immediately set to Finished. IMO it would be better to have an error message saying that the bootstrap failed because the routing table is empty, rather than looking for an empty OutboundQueryProgressed.

@stormshield-frb
Copy link
Contributor Author

Not checking whether the routing table is empty would take us back to before this commit. IMO it is good that this check was introduced, allowing the bootstrap to fail fast.

If the check isn't performed, a new query for self is created, and the first time next() is called, the query state is immediately set to Finished. IMO it would be better to have an error message saying that the bootstrap failed because the routing table is empty, rather than looking for an empty OutboundQueryProgressed.

I'm not sure about that. If a check is needed when the routing table is empty, why is it not done for the other queries like get_closest_peers, get_record or get_providers ? In those cases, a Query is always started however it is possible no actual query is emitted on the wire. Sure get_record and get_providers can have information from their local cache, but get_closest_peers does not.

@guillaumemichel
Copy link
Contributor

I agree that the empty routing table check should be consistent with other queries, and it would even be better if the check was unified (e.g when creating the query?). But this can be left for a follow up PR.

Copy link
Contributor

mergify bot commented Jun 12, 2024

This pull request has merge conflicts. Could you please resolve them @stormshield-frb? 🙏

@stormshield-frb stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch 3 times, most recently from 345424b to 2922bde Compare June 13, 2024 10:11
@stormshield-frb stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch from 2922bde to f1dfb2b Compare June 13, 2024 10:12
@jxs jxs added this to the v0.54.0 milestone Jun 13, 2024
@guillaumemichel
Copy link
Contributor

@stormshield-frb what do you think of b548fc7? IMO it is a slightly cleaner, because if the routing table is empty, we simply reset the timers, and we don't need to increase and then decrease the count of bootstrap requests, and we don't need to wake the waker.

If you disagree, we can revert to the last commit.

Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks François, and Gui for the review!

@jxs jxs added the send-it label Jun 18, 2024
@mergify mergify bot merged commit 32e917f into libp2p:master Jun 18, 2024
72 checks passed
@stormshield-frb stormshield-frb deleted the fix/automatic-bootstrap-bug branch August 28, 2024 09:42
TimTinkers pushed a commit to unattended-backpack/rust-libp2p that referenced this pull request Sep 14, 2024
After testing `master`, we encountered a bug due to libp2p#4838 when doing automatic or periodic bootstrap if the node has no known peers.

Since it failed immediately, I though there was no need to call the `bootstrap_status.on_started` method. But not doing so never resets the periodic timer inside `bootstrap_status` resulting in getting stuck to try to bootstrap every time `poll` is called on `kad::Behaviour`.

Pull-Request: libp2p#5349.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants