Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kad): add refresh_interval config used to poll bootstrap #4838

Merged
merged 45 commits into from
Mar 8, 2024

Conversation

PanGan21
Copy link
Contributor

@PanGan21 PanGan21 commented Nov 12, 2023

Description

Previously, users were responsible for calling bootstrap on an interval. This was documented but hard to discover for people new to the library. To maintain healthy routing tables, it is advised to regularly call bootstrap. By default, we will now do this automatically every 5 minutes and once we add a peer to our routing table, assuming we didn't bootstrap yet. This is especially useful as part of nodes starting up and connecting to bootstrap nodes.

Closes: #4730.

Attributions

Co-authored-by: stormshield-frb [email protected]

Notes & open questions

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@PanGan21 PanGan21 changed the title feat(kad): Add config option for refresh interval used to poll Kademlia::bootstrap feat(kad): add config option for refresh interval used to poll Kademlia::bootstrap Nov 12, 2023
@PanGan21 PanGan21 changed the title feat(kad): add config option for refresh interval used to poll Kademlia::bootstrap feat(kad): add refresh_interval config used to poll Kademlia::bootstrap Nov 12, 2023
@PanGan21 PanGan21 changed the title feat(kad): add refresh_interval config used to poll Kademlia::bootstrap feat(kad): add refresh_interval config used to poll bootstrap Nov 12, 2023
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on! I've left some feedback :)

protocols/kad/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/kad/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/kad/CHANGELOG.md Outdated Show resolved Hide resolved
protocols/kad/CHANGELOG.md Outdated Show resolved Hide resolved
protocols/kad/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/kad/src/behaviour.rs Show resolved Hide resolved
PanGan21 and others added 2 commits November 12, 2023 22:51
Co-authored-by: Thomas Eizinger <[email protected]>
Co-authored-by: Thomas Eizinger <[email protected]>
@PanGan21 PanGan21 marked this pull request as draft November 12, 2023 21:52
@dariusc93
Copy link
Member

Looks like a good start to me. Correct me if im wrong but should we only poll the timer if Behaviour::bootstrap is successfully called?

@thomaseizinger
Copy link
Contributor

Correct me if im wrong but should we only poll the timer if Behaviour::bootstrap is successfully called?

Why? In order to maintain a healthy routing table, bootstrap should be called regularly. I don't think there is a reason to wait for the user call it. In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.

If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

@thomaseizinger
Copy link
Contributor

In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?

cc @mxinden

@dariusc93
Copy link
Member

Why? In order to maintain a healthy routing table, bootstrap should be called regularly. I don't think there is a reason to wait for the user call it. In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.

If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

I do wonder if it should be done automatically instead of waiting for it to be called, given that the spec does mention doing it at the start, so maybe it could be made a configurable option, although the developer loses out on flexibility on making that call themselves. Regardless, in its current state, a successful call to Behaviour::bootstrap should, in my opinion, trigger the timer to start rather than for the node to instantly start the timer upon enabling the behaviour, especially if the node does not wish to bootstrap right at that moment (and not sure if its an actual requirement for the node to bootstrap, but maybe @mxinden can provide insight on that). It might even be better to start/reset the timer after the query itself has been completed.

@thomaseizinger
Copy link
Contributor

Why? In order to maintain a healthy routing table, bootstrap should be called regularly. I don't think there is a reason to wait for the user call it. In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?
Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.
If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

I do wonder if it should be done automatically instead of waiting for it to be called, given that the spec does mention doing it at the start, so maybe it could be made a configurable option, although the developer loses out on flexibility on making that call themselves. Regardless, in its current state, a successful call to Behaviour::bootstrap should, in my opinion, trigger the timer to start rather than for the node to instantly start the timer upon enabling the behaviour, especially if the node does not wish to bootstrap right at that moment (and not sure if its an actual requirement for the node to bootstrap, but maybe @mxinden can provide insight on that). It might even be better to start/reset the timer after the query itself has been completed.

Those are good points. How about the following:

  • We add a new function, start_regular_bootstrap (name to be bike-shedded) that kicks off an initial bootstrap.
  • We remember the query ID of this bootstrap within the Behaviour.
  • Once if finishes, we reset the timer. We can tell that this was a "regular" bootstrap by comparing it against the stored ID. We may want to not report any events generated by this regular bootstrap.

For now, I'd leave the current bootstrap untouched but I think it is worth considering to deprecate it in favor of start_regular_bootstrap.

misc/server/src/main.rs Outdated Show resolved Hide resolved
protocols/kad/src/behaviour.rs Outdated Show resolved Hide resolved
@PanGan21 PanGan21 marked this pull request as ready for review November 14, 2023 09:39
@mxinden
Copy link
Member

mxinden commented Nov 14, 2023

Today users forget to (a) call bootstrapp at startup and (b) call bootstrap continuously thereafter. In my eyes in most cases one should do both (a) and (b). Thus I believe it should be the default behavior and the goal of #4730.

  • We add a new function, start_regular_bootstrap (name to be bike-shedded) that kicks off an initial bootstrap.

That would again require users to call a method in the default case which they will likely forget.

How about triggering the first bootstrap once the first node is added?

For now, I'd leave the current bootstrap untouched

I am fine with leaving the existing fn bootstrap` untouched and allowing users to disable the automatic reoccurring interval.

@dariusc93
Copy link
Member

Today users forget to (a) call bootstrapp at startup and (b) call bootstrap continuously thereafter. In my eyes in most cases one should do both (a) and (b). Thus I believe it should be the default behavior and the goal of #4730.

  • We add a new function, start_regular_bootstrap (name to be bike-shedded) that kicks off an initial bootstrap.

That would again require users to call a method in the default case which they will likely forget.

How about triggering the first bootstrap once the first node is added?

For now, I'd leave the current bootstrap untouched

I am fine with leaving the existing fn bootstrap` untouched and allowing users to disable the automatic reoccurring interval.

I will emit that I would be guilty of (b), though for (a), the only times I find myself not bootstrapping is when I wish to only query peers added to the routing table manually. I take it that that (a) should be done when one peer is added regardless and not wait around until a later point? Are there any implications from not bootstrapping in the first place even when its only one peer added (with that peer not containing any other peers in its table)?

@thomaseizinger
Copy link
Contributor

To quote @mxinden from an earlier conversation: A bootstrap is just a query. If you want to fire a query asap on startup, fire it. A bootstrap just fills up the buckets to make future queries faster. If you already have a query you want the result for, running it will give you a quicker results than first waiting for the bootstrap!

@stormshield-frb
Copy link
Contributor

Hi everyone. Sorry to barge in 3 days later, but we just saw this PR and might have some interesting thoughts to share.

We totally agree with the initial issue that it would be a great thing for the end user if the Kademlia behaviour handled bootstrapping on its own. (Why not also call it refresh, it might make more sense, but we don't have a particular fixed idea on this matter).

However, we think it might be a good idea to have a system a little bit more evolved than just bootstrapping based on an interval. Moreover, there is already in the mDNS behaviour, a system allowing to create many packets at first and gradually reaching a cruise interval (protocols/mdns/src/behaviour/iface.rs:ProbeState). Why not do something like that?

We did and are currently using a specific development to handle bootstrap in a more evolved manner, so we would be more than happy to help on this subject, and why not upstreaming it into the libp2p code base if you want.

Our bootstrap is composed of several "components" and we are not saying that everything should be integrated in the libp2p code base. The only purpose of the following description of our implementation is to be a starting point of an open discussion around this issue.

  1. our bootstrap system is based on an IncrementedStream which is almost the same thing as an exponential backoff. The only difference is that for example, between 1 and 16, increments for an exponential backoff would be 1, 2, 4, 8, 16 and that for our IncrementedStream, increments are 1, 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 8, 8, 16 (i.e. 1(x5), 2(x4), 4(x3), 8(x2), 16(x1)). So it just means that we are multiplying the lowest increments of the exponential backoff to be able to run more often at these durations. So our bootstrap process is very frequent at peer startup and will then gradually slow down to reach the final interval (almost like the mdns ProbeState).

  2. we also integrated a system allowing us to reset this timer in the event of a huge network perturbation. On each FromSwarm::ConnectionEstablished and FromSwarm::ConnectionClosed, we store the number of connected_peers and try to determine if the most recent value of connected_peers imply an important change. To be more clear, losing connection to 4 peers does not have the same meaning nor the same impact if we previously were connected to 6 peers or 50 peers. To measure the importance of the change, we verify for how long we had a "stable network" (i.e. if the number of connected_peers was roughly stable during a specified amount of time), and then we verify if the current number of connected_peers represent an important variation based on the "average" number of connected_peers during the last N minutes. If an important change is detected, the timer is reset, and we restart our rapid bootstrap.

If you find this interesting, I will gladly elaborate and share code samples.

@thomaseizinger
Copy link
Contributor

Thank you for the input @stormshield-frb ! Whilst interesting, I am not sure we need an elaborate algorithm like this as a default.

If we keep the current bootstrap method and invoke it on a configurable timer that can be disabled, it should be easy enough to implement different bootstrap schemes if a simple timer is not sufficient. At the same time, having a timer that runs out-of-the-box makes libp2p-kad more useful for people getting started with it.

This follows the mantra of: "make the easy things easy, make the hard things possible".

Would that work for everybody?

@stormshield-frb
Copy link
Contributor

Hi @thomaseizinger, you're welcome 😉 Glad you found it interesting. Indeed, it might be too much by default, and I do agree with your mantra.

However, quoting you from an earlier comment:

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.

If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

Doing something like it is currently done in mdns would help in this matter. We don't have to do what I described (custom exponential backoff), a simple exponential backoff like it is done in mdns with ProbeState would probably help a lot. Maybe something simple like this could be considered?

For the metric part (number of connected_peers) I will not strongly advocate for it right now because this feature is still quite new in our code base, so we don't have as much background to assert it is efficient. We will keep using this in the near future, but I totally understand your point of view. The more complex the algorithm, the more error-prone it is 😂 If we see amazing results in the next several months, we will consider upstreaming it in a new PR 😉

I am fine with leaving the existing fn bootstrap untouched and allowing users to disable the automatic reoccurring interval.

As @mxinden said, we would gladly appreciate it if the current fn bootstrap was left untouched since it would allow, like you said, to "make hard things possible".

@thomaseizinger
Copy link
Contributor

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.
If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

Doing something like it is currently done in mdns would help in this matter. We don't have to do what I described (custom exponential backoff), a simple exponential backoff like it is done in mdns with ProbeState would probably help a lot. Maybe something simple like this could be considered?

Yes we definitely need to handle that. I think to start with, it is probably easiest to have a simple state variable that tracks, whether we've already successfully run at least 1 bootstrap.

Upon every new connection, we can then initiate a bootstrap if we haven't successfully bootstrapped yet (we should also track whether there is currently one in progress). It could be that the first peer we connect to does not support kademlia so we should not give up after the first peer.

Something like:

enum InitialBootstrap {
	Active(QueryId),
	Completed
}

and:

initial_bootstrap: Option<InitialBootstrap>

If the bootstrap fails, we set it back to None.

What do people think?

@stormshield-frb
Copy link
Contributor

have a simple state variable that tracks, whether we've already successfully run at least 1 bootstrap.

Absolutely. This is a good idea. We actually do that in our code base.

However, it raises the question : what should we do if the first bootstrap fail ? Should be try an other bootstrap right after ? Some time after ? What if this one fails too ?

In my opinion, this situation conforms precisely to the recommended practices of using exponential backoff. What do you think ?

@PanGan21
Copy link
Contributor Author

Hi @thomaseizinger ! Fyi, one more update was pushed thanks to @stormshield-frb . I also fixed conflicts, although some pipelines are not passing due to master branch.

@thomaseizinger
Copy link
Contributor

Awesome, I'll take a look sometime later this week!

Copy link
Contributor

mergify bot commented Feb 29, 2024

This pull request has merge conflicts. Could you please resolve them @PanGan21? 🙏

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, thank you for adding all these tests and sorry that this has been dragging on for so long.

Just one request regarding the version and changelog entry, otherwise ACK.

Cargo.toml Outdated Show resolved Hide resolved
@thomaseizinger
Copy link
Contributor

@jxs Feel free to approve and merge this once the version and changelog has been corrected.

thomaseizinger
thomaseizinger previously approved these changes Mar 6, 2024
jxs
jxs previously approved these changes Mar 7, 2024
Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the ping Thomas, LGTM only one minor doc remark
Thanks Panagiotis!

protocols/kad/src/behaviour.rs Outdated Show resolved Hide resolved
@mergify mergify bot dismissed stale reviews from jxs and thomaseizinger March 7, 2024 11:35

Approvals have been dismissed because the PR was updated after the send-it label was applied.

Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mergify mergify bot merged commit 6a91617 into libp2p:master Mar 8, 2024
70 of 72 checks passed
@PanGan21
Copy link
Contributor Author

PanGan21 commented Mar 8, 2024

@thomaseizinger @stormshield-frb
Hey guys exciting that we managed to merged this! Thanks a lot for the collaboration and your patience! 🚀

guillaumemichel pushed a commit that referenced this pull request Mar 28, 2024
Previously, users were responsible for calling `bootstrap` on an interval. This was documented but hard to discover for people new to the library. To maintain healthy routing tables, it is advised to regularly call `bootstrap`. By default, we will now do this automatically every 5 minutes and once we add a peer to our routing table, assuming we didn't bootstrap yet. This is especially useful as part of nodes starting up and connecting to bootstrap nodes.

Closes: #4730.

Pull-Request: #4838.

Co-authored-by: stormshield-frb <[email protected]>
mergify bot pushed a commit that referenced this pull request Jun 18, 2024
After testing `master`, we encountered a bug due to #4838 when doing automatic or periodic bootstrap if the node has no known peers.

Since it failed immediately, I though there was no need to call the `bootstrap_status.on_started` method. But not doing so never resets the periodic timer inside `bootstrap_status` resulting in getting stuck to try to bootstrap every time `poll` is called on `kad::Behaviour`.

Pull-Request: #5349.
mergify bot pushed a commit that referenced this pull request Jul 4, 2024
As discussed in the last maintainer call, some improvements are probably necessary for the automatic bootstrap feature (introduced by #4838). Indeed, like @drHuangMHT mentioned in #5341 and like @guillaumemichel has agreed, triggering a bootstrap every time an update happens inside the routing table consumes a lot more resources.

The idea behind the automatic bootstrap feature it that, when a peer is starting, if a routing table update happens we probably don't want to wait for the periodic bootstrap to trigger and we want to trigger it right now. However, like @guillaumemichel said, this is something we want to do at startup or when a network connectivity problem happens, we don't want to do that all the time.

This PR is a proposal to trigger automatically a bootstrap on routing table update but only when we have less that `K_VALUE` peers in it (meaning that we are starting up or something went wrong and the fact that a new peer is inserted is probably a sign that the network connectivity issue is resolved).

I have also added a new triggering condition like mentioned in the maintainer call. When discovering a new listen address and if we have no connected peers, we trigger a bootstrap. This condition is based on our own experience at Stormshield : some peers were starting before the network interfaces were up, doing so, the automatic and periodic bootstrap failed, but when the network interfaces were finally up, we were waiting X minutes for the periodic bootstrap to actually trigger a bootstrap and join the p2p network.

Pull-Request: #5474.
TimTinkers pushed a commit to unattended-backpack/rust-libp2p that referenced this pull request Sep 14, 2024
After testing `master`, we encountered a bug due to libp2p#4838 when doing automatic or periodic bootstrap if the node has no known peers.

Since it failed immediately, I though there was no need to call the `bootstrap_status.on_started` method. But not doing so never resets the periodic timer inside `bootstrap_status` resulting in getting stuck to try to bootstrap every time `poll` is called on `kad::Behaviour`.

Pull-Request: libp2p#5349.
TimTinkers pushed a commit to unattended-backpack/rust-libp2p that referenced this pull request Sep 14, 2024
As discussed in the last maintainer call, some improvements are probably necessary for the automatic bootstrap feature (introduced by libp2p#4838). Indeed, like @drHuangMHT mentioned in libp2p#5341 and like @guillaumemichel has agreed, triggering a bootstrap every time an update happens inside the routing table consumes a lot more resources.

The idea behind the automatic bootstrap feature it that, when a peer is starting, if a routing table update happens we probably don't want to wait for the periodic bootstrap to trigger and we want to trigger it right now. However, like @guillaumemichel said, this is something we want to do at startup or when a network connectivity problem happens, we don't want to do that all the time.

This PR is a proposal to trigger automatically a bootstrap on routing table update but only when we have less that `K_VALUE` peers in it (meaning that we are starting up or something went wrong and the fact that a new peer is inserted is probably a sign that the network connectivity issue is resolved).

I have also added a new triggering condition like mentioned in the maintainer call. When discovering a new listen address and if we have no connected peers, we trigger a bootstrap. This condition is based on our own experience at Stormshield : some peers were starting before the network interfaces were up, doing so, the automatic and periodic bootstrap failed, but when the network interfaces were finally up, we were waiting X minutes for the periodic bootstrap to actually trigger a bootstrap and join the p2p network.

Pull-Request: libp2p#5474.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(kad): add config option for bootstrap interval
6 participants