feat(kad): add `refresh_interval` config used to poll `bootstrap` #4838

PanGan21 · 2023-11-12T16:05:07Z

Description

Previously, users were responsible for calling bootstrap on an interval. This was documented but hard to discover for people new to the library. To maintain healthy routing tables, it is advised to regularly call bootstrap. By default, we will now do this automatically every 5 minutes and once we add a peer to our routing table, assuming we didn't bootstrap yet. This is especially useful as part of nodes starting up and connecting to bootstrap nodes.

Closes: #4730.

Attributions

Co-authored-by: stormshield-frb [email protected]

Notes & open questions

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
A changelog entry has been made in the appropriate crates

thomaseizinger

Thanks for taking this on! I've left some feedback :)

protocols/kad/src/behaviour.rs

protocols/kad/CHANGELOG.md

protocols/kad/src/behaviour.rs

Co-authored-by: Thomas Eizinger <[email protected]>

dariusc93 · 2023-11-13T01:56:45Z

Looks like a good start to me. Correct me if im wrong but should we only poll the timer if Behaviour::bootstrap is successfully called?

thomaseizinger · 2023-11-13T02:04:29Z

Correct me if im wrong but should we only poll the timer if Behaviour::bootstrap is successfully called?

Why? In order to maintain a healthy routing table, bootstrap should be called regularly. I don't think there is a reason to wait for the user call it. In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.

If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

thomaseizinger · 2023-11-13T02:04:52Z

In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?

cc @mxinden

dariusc93 · 2023-11-13T22:25:40Z

Why? In order to maintain a healthy routing table, bootstrap should be called regularly. I don't think there is a reason to wait for the user call it. In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.

If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

I do wonder if it should be done automatically instead of waiting for it to be called, given that the spec does mention doing it at the start, so maybe it could be made a configurable option, although the developer loses out on flexibility on making that call themselves. Regardless, in its current state, a successful call to Behaviour::bootstrap should, in my opinion, trigger the timer to start rather than for the node to instantly start the timer upon enabling the behaviour, especially if the node does not wish to bootstrap right at that moment (and not sure if its an actual requirement for the node to bootstrap, but maybe @mxinden can provide insight on that). It might even be better to start/reset the timer after the query itself has been completed.

thomaseizinger · 2023-11-14T00:16:44Z

Why? In order to maintain a healthy routing table, bootstrap should be called regularly. I don't think there is a reason to wait for the user call it. In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?
Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.
If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

I do wonder if it should be done automatically instead of waiting for it to be called, given that the spec does mention doing it at the start, so maybe it could be made a configurable option, although the developer loses out on flexibility on making that call themselves. Regardless, in its current state, a successful call to Behaviour::bootstrap should, in my opinion, trigger the timer to start rather than for the node to instantly start the timer upon enabling the behaviour, especially if the node does not wish to bootstrap right at that moment (and not sure if its an actual requirement for the node to bootstrap, but maybe @mxinden can provide insight on that). It might even be better to start/reset the timer after the query itself has been completed.

Those are good points. How about the following:

We add a new function, start_regular_bootstrap (name to be bike-shedded) that kicks off an initial bootstrap.
We remember the query ID of this bootstrap within the Behaviour.
Once if finishes, we reset the timer. We can tell that this was a "regular" bootstrap by comparing it against the stored ID. We may want to not report any events generated by this regular bootstrap.

For now, I'd leave the current bootstrap untouched but I think it is worth considering to deprecate it in favor of start_regular_bootstrap.

misc/server/src/main.rs

protocols/kad/src/behaviour.rs

mxinden · 2023-11-14T20:14:10Z

Today users forget to (a) call bootstrapp at startup and (b) call bootstrap continuously thereafter. In my eyes in most cases one should do both (a) and (b). Thus I believe it should be the default behavior and the goal of #4730.

We add a new function, start_regular_bootstrap (name to be bike-shedded) that kicks off an initial bootstrap.

That would again require users to call a method in the default case which they will likely forget.

How about triggering the first bootstrap once the first node is added?

For now, I'd leave the current bootstrap untouched

I am fine with leaving the existing fn bootstrap` untouched and allowing users to disable the automatic reoccurring interval.

dariusc93 · 2023-11-14T20:44:18Z

Today users forget to (a) call bootstrapp at startup and (b) call bootstrap continuously thereafter. In my eyes in most cases one should do both (a) and (b). Thus I believe it should be the default behavior and the goal of #4730.

We add a new function, start_regular_bootstrap (name to be bike-shedded) that kicks off an initial bootstrap.

That would again require users to call a method in the default case which they will likely forget.

How about triggering the first bootstrap once the first node is added?

For now, I'd leave the current bootstrap untouched

I am fine with leaving the existing fn bootstrap` untouched and allowing users to disable the automatic reoccurring interval.

I will emit that I would be guilty of (b), though for (a), the only times I find myself not bootstrapping is when I wish to only query peers added to the routing table manually. I take it that that (a) should be done when one peer is added regardless and not wait around until a later point? Are there any implications from not bootstrapping in the first place even when its only one peer added (with that peer not containing any other peers in its table)?

thomaseizinger · 2023-11-14T21:14:12Z

To quote @mxinden from an earlier conversation: A bootstrap is just a query. If you want to fire a query asap on startup, fire it. A bootstrap just fills up the buckets to make future queries faster. If you already have a query you want the result for, running it will give you a quicker results than first waiting for the bootstrap!

stormshield-frb · 2023-11-15T17:05:54Z

Hi everyone. Sorry to barge in 3 days later, but we just saw this PR and might have some interesting thoughts to share.

We totally agree with the initial issue that it would be a great thing for the end user if the Kademlia behaviour handled bootstrapping on its own. (Why not also call it refresh, it might make more sense, but we don't have a particular fixed idea on this matter).

However, we think it might be a good idea to have a system a little bit more evolved than just bootstrapping based on an interval. Moreover, there is already in the mDNS behaviour, a system allowing to create many packets at first and gradually reaching a cruise interval (protocols/mdns/src/behaviour/iface.rs:ProbeState). Why not do something like that?

We did and are currently using a specific development to handle bootstrap in a more evolved manner, so we would be more than happy to help on this subject, and why not upstreaming it into the libp2p code base if you want.

Our bootstrap is composed of several "components" and we are not saying that everything should be integrated in the libp2p code base. The only purpose of the following description of our implementation is to be a starting point of an open discussion around this issue.

our bootstrap system is based on an IncrementedStream which is almost the same thing as an exponential backoff. The only difference is that for example, between 1 and 16, increments for an exponential backoff would be 1, 2, 4, 8, 16 and that for our IncrementedStream, increments are 1, 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 8, 8, 16 (i.e. 1(x5), 2(x4), 4(x3), 8(x2), 16(x1)). So it just means that we are multiplying the lowest increments of the exponential backoff to be able to run more often at these durations. So our bootstrap process is very frequent at peer startup and will then gradually slow down to reach the final interval (almost like the mdns ProbeState).
we also integrated a system allowing us to reset this timer in the event of a huge network perturbation. On each FromSwarm::ConnectionEstablished and FromSwarm::ConnectionClosed, we store the number of connected_peers and try to determine if the most recent value of connected_peers imply an important change. To be more clear, losing connection to 4 peers does not have the same meaning nor the same impact if we previously were connected to 6 peers or 50 peers. To measure the importance of the change, we verify for how long we had a "stable network" (i.e. if the number of connected_peers was roughly stable during a specified amount of time), and then we verify if the current number of connected_peers represent an important variation based on the "average" number of connected_peers during the last N minutes. If an important change is detected, the timer is reset, and we restart our rapid bootstrap.

If you find this interesting, I will gladly elaborate and share code samples.

thomaseizinger · 2023-11-16T00:10:41Z

Thank you for the input @stormshield-frb ! Whilst interesting, I am not sure we need an elaborate algorithm like this as a default.

If we keep the current bootstrap method and invoke it on a configurable timer that can be disabled, it should be easy enough to implement different bootstrap schemes if a simple timer is not sufficient. At the same time, having a timer that runs out-of-the-box makes libp2p-kad more useful for people getting started with it.

This follows the mantra of: "make the easy things easy, make the hard things possible".

Would that work for everybody?

stormshield-frb · 2023-11-16T09:14:47Z

Hi @thomaseizinger, you're welcome 😉 Glad you found it interesting. Indeed, it might be too much by default, and I do agree with your mantra.

However, quoting you from an earlier comment:

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.

If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

Doing something like it is currently done in mdns would help in this matter. We don't have to do what I described (custom exponential backoff), a simple exponential backoff like it is done in mdns with ProbeState would probably help a lot. Maybe something simple like this could be considered?

For the metric part (number of connected_peers) I will not strongly advocate for it right now because this feature is still quite new in our code base, so we don't have as much background to assert it is efficient. We will keep using this in the near future, but I totally understand your point of view. The more complex the algorithm, the more error-prone it is 😂 If we see amazing results in the next several months, we will consider upstreaming it in a new PR 😉

I am fine with leaving the existing fn bootstrap untouched and allowing users to disable the automatic reoccurring interval.

As @mxinden said, we would gladly appreciate it if the current fn bootstrap was left untouched since it would allow, like you said, to "make hard things possible".

thomaseizinger · 2023-11-16T22:15:49Z

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.
If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

Doing something like it is currently done in mdns would help in this matter. We don't have to do what I described (custom exponential backoff), a simple exponential backoff like it is done in mdns with ProbeState would probably help a lot. Maybe something simple like this could be considered?

Yes we definitely need to handle that. I think to start with, it is probably easiest to have a simple state variable that tracks, whether we've already successfully run at least 1 bootstrap.

Upon every new connection, we can then initiate a bootstrap if we haven't successfully bootstrapped yet (we should also track whether there is currently one in progress). It could be that the first peer we connect to does not support kademlia so we should not give up after the first peer.

Something like:

enum InitialBootstrap {
	Active(QueryId),
	Completed
}

and:

initial_bootstrap: Option<InitialBootstrap>

If the bootstrap fails, we set it back to None.

What do people think?

stormshield-frb · 2023-11-20T08:41:21Z

have a simple state variable that tracks, whether we've already successfully run at least 1 bootstrap.

Absolutely. This is a good idea. We actually do that in our code base.

However, it raises the question : what should we do if the first bootstrap fail ? Should be try an other bootstrap right after ? Some time after ? What if this one fails too ?

In my opinion, this situation conforms precisely to the recommended practices of using exponential backoff. What do you think ?

PanGan21 · 2024-02-20T19:40:41Z

Hi @thomaseizinger ! Fyi, one more update was pushed thanks to @stormshield-frb . I also fixed conflicts, although some pipelines are not passing due to master branch.

thomaseizinger · 2024-02-25T03:37:02Z

Awesome, I'll take a look sometime later this week!

mergify · 2024-02-29T19:30:35Z

This pull request has merge conflicts. Could you please resolve them @PanGan21? 🙏

thomaseizinger

Excellent work, thank you for adding all these tests and sorry that this has been dragging on for so long.

Just one request regarding the version and changelog entry, otherwise ACK.

Cargo.toml

thomaseizinger · 2024-03-06T09:47:45Z

@jxs Feel free to approve and merge this once the version and changelog has been corrected.

jxs

thanks for the ping Thomas, LGTM only one minor doc remark
Thanks Panagiotis!

protocols/kad/src/behaviour.rs

Approvals have been dismissed because the PR was updated after the send-it label was applied.

jxs

Thanks!

PanGan21 · 2024-03-08T15:19:05Z

@thomaseizinger @stormshield-frb
Hey guys exciting that we managed to merged this! Thanks a lot for the collaboration and your patience! 🚀

Previously, users were responsible for calling `bootstrap` on an interval. This was documented but hard to discover for people new to the library. To maintain healthy routing tables, it is advised to regularly call `bootstrap`. By default, we will now do this automatically every 5 minutes and once we add a peer to our routing table, assuming we didn't bootstrap yet. This is especially useful as part of nodes starting up and connecting to bootstrap nodes. Closes: #4730. Pull-Request: #4838. Co-authored-by: stormshield-frb <[email protected]>

After testing `master`, we encountered a bug due to #4838 when doing automatic or periodic bootstrap if the node has no known peers. Since it failed immediately, I though there was no need to call the `bootstrap_status.on_started` method. But not doing so never resets the periodic timer inside `bootstrap_status` resulting in getting stuck to try to bootstrap every time `poll` is called on `kad::Behaviour`. Pull-Request: #5349.

@drHuangMHT

As discussed in the last maintainer call, some improvements are probably necessary for the automatic bootstrap feature (introduced by #4838). Indeed, like @drHuangMHT mentioned in #5341 and like @guillaumemichel has agreed, triggering a bootstrap every time an update happens inside the routing table consumes a lot more resources. The idea behind the automatic bootstrap feature it that, when a peer is starting, if a routing table update happens we probably don't want to wait for the periodic bootstrap to trigger and we want to trigger it right now. However, like @guillaumemichel said, this is something we want to do at startup or when a network connectivity problem happens, we don't want to do that all the time. This PR is a proposal to trigger automatically a bootstrap on routing table update but only when we have less that `K_VALUE` peers in it (meaning that we are starting up or something went wrong and the fact that a new peer is inserted is probably a sign that the network connectivity issue is resolved). I have also added a new triggering condition like mentioned in the maintainer call. When discovering a new listen address and if we have no connected peers, we trigger a bootstrap. This condition is based on our own experience at Stormshield : some peers were starting before the network interfaces were up, doing so, the automatic and periodic bootstrap failed, but when the network interfaces were finally up, we were waiting X minutes for the periodic bootstrap to actually trigger a bootstrap and join the p2p network. Pull-Request: #5474.

After testing `master`, we encountered a bug due to libp2p#4838 when doing automatic or periodic bootstrap if the node has no known peers. Since it failed immediately, I though there was no need to call the `bootstrap_status.on_started` method. But not doing so never resets the periodic timer inside `bootstrap_status` resulting in getting stuck to try to bootstrap every time `poll` is called on `kad::Behaviour`. Pull-Request: libp2p#5349.

@drHuangMHT

As discussed in the last maintainer call, some improvements are probably necessary for the automatic bootstrap feature (introduced by libp2p#4838). Indeed, like @drHuangMHT mentioned in libp2p#5341 and like @guillaumemichel has agreed, triggering a bootstrap every time an update happens inside the routing table consumes a lot more resources. The idea behind the automatic bootstrap feature it that, when a peer is starting, if a routing table update happens we probably don't want to wait for the periodic bootstrap to trigger and we want to trigger it right now. However, like @guillaumemichel said, this is something we want to do at startup or when a network connectivity problem happens, we don't want to do that all the time. This PR is a proposal to trigger automatically a bootstrap on routing table update but only when we have less that `K_VALUE` peers in it (meaning that we are starting up or something went wrong and the fact that a new peer is inserted is probably a sign that the network connectivity issue is resolved). I have also added a new triggering condition like mentioned in the maintainer call. When discovering a new listen address and if we have no connected peers, we trigger a bootstrap. This condition is based on our own experience at Stormshield : some peers were starting before the network interfaces were up, doing so, the automatic and periodic bootstrap failed, but when the network interfaces were finally up, we were waiting X minutes for the periodic bootstrap to actually trigger a bootstrap and join the p2p network. Pull-Request: libp2p#5474.

add poll function in kademlia behaviour that polls bootstrap function

dc770aa

PanGan21 changed the title ~~feat(kad): Add config option for refresh interval used to poll Kademlia::bootstrap~~ feat(kad): add config option for refresh interval used to poll Kademlia::bootstrap Nov 12, 2023

PanGan21 changed the title ~~feat(kad): add config option for refresh interval used to poll Kademlia::bootstrap~~ feat(kad): add refresh_interval config used to poll Kademlia::bootstrap Nov 12, 2023

PanGan21 changed the title ~~feat(kad): add refresh_interval config used to poll Kademlia::bootstrap~~ feat(kad): add refresh_interval config used to poll bootstrap Nov 12, 2023

PanGan21 added 2 commits November 12, 2023 18:15

bumb kad version and add changelog

30559da

bumb server version and add server changelog

a7fcd26

thomaseizinger reviewed Nov 12, 2023

View reviewed changes

PanGan21 and others added 2 commits November 12, 2023 22:51

Fix doc comments

695b94a

Co-authored-by: Thomas Eizinger <[email protected]>

Rename to bootstrap_interval

0b54f9a

Co-authored-by: Thomas Eizinger <[email protected]>

PanGan21 marked this pull request as draft November 12, 2023 21:52

PanGan21 and others added 2 commits November 12, 2023 23:12

Update protocols/kad/src/behaviour.rs

d749cbd

Co-authored-by: Thomas Eizinger <[email protected]>

fix terminology across behaviour

7f720ce

thomaseizinger reviewed Nov 14, 2023

View reviewed changes

misc/server/src/main.rs Outdated Show resolved Hide resolved

protocols/kad/src/behaviour.rs Outdated Show resolved Hide resolved

PanGan21 added 3 commits November 14, 2023 10:44

move polling into the internal poll function

3a7737c

fix conflicts

ac0fdd6

do not reference private fields in doc

7343bb2

PanGan21 marked this pull request as ready for review November 14, 2023 09:39

satisfy clippy

fcafa60

Merge branch 'master' into kademlia-poll-bootstrap

cc2f533

fix conflicts

89b011a

thomaseizinger reviewed Mar 6, 2024

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

PanGan21 and others added 2 commits March 6, 2024 12:07

fix: use current kad version

245c2c8

Merge branch 'master' into kademlia-poll-bootstrap

3000dd1

thomaseizinger previously approved these changes Mar 6, 2024

View reviewed changes

thomaseizinger added the send-it label Mar 6, 2024

jxs previously approved these changes Mar 7, 2024

View reviewed changes

protocols/kad/src/behaviour.rs Outdated Show resolved Hide resolved

reference automatic_bootstrap_throttle in the docs

5f8ef81

PanGan21 requested review from thomaseizinger, jxs and mxinden March 7, 2024 11:46

fix bootstrap doc to omit private config

a58ed37

jxs approved these changes Mar 8, 2024

View reviewed changes

mergify bot merged commit 6a91617 into libp2p:master Mar 8, 2024
70 of 72 checks passed

stormshield-frb mentioned this pull request Apr 30, 2024

fix(kad): correctly handle NoKnownPeers error when bootstrap #5349

Merged

4 tasks

stormshield-frb mentioned this pull request Jun 18, 2024

feat(kad): improve automatic bootstrap #5474

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kad): add `refresh_interval` config used to poll `bootstrap` #4838

feat(kad): add `refresh_interval` config used to poll `bootstrap` #4838

PanGan21 commented Nov 12, 2023 •

edited by thomaseizinger

Loading

thomaseizinger left a comment

dariusc93 commented Nov 13, 2023

thomaseizinger commented Nov 13, 2023

thomaseizinger commented Nov 13, 2023

dariusc93 commented Nov 13, 2023

thomaseizinger commented Nov 14, 2023

mxinden commented Nov 14, 2023

dariusc93 commented Nov 14, 2023

thomaseizinger commented Nov 14, 2023

stormshield-frb commented Nov 15, 2023

thomaseizinger commented Nov 16, 2023

stormshield-frb commented Nov 16, 2023

thomaseizinger commented Nov 16, 2023

stormshield-frb commented Nov 20, 2023

PanGan21 commented Feb 20, 2024

thomaseizinger commented Feb 25, 2024

mergify bot commented Feb 29, 2024

thomaseizinger left a comment

thomaseizinger commented Mar 6, 2024

jxs left a comment

jxs left a comment

PanGan21 commented Mar 8, 2024

feat(kad): add refresh_interval config used to poll bootstrap #4838

feat(kad): add refresh_interval config used to poll bootstrap #4838

Conversation

PanGan21 commented Nov 12, 2023 • edited by thomaseizinger Loading

Description

Attributions

Notes & open questions

Change checklist

thomaseizinger left a comment

Choose a reason for hiding this comment

dariusc93 commented Nov 13, 2023

thomaseizinger commented Nov 13, 2023

thomaseizinger commented Nov 13, 2023

dariusc93 commented Nov 13, 2023

thomaseizinger commented Nov 14, 2023

mxinden commented Nov 14, 2023

dariusc93 commented Nov 14, 2023

thomaseizinger commented Nov 14, 2023

stormshield-frb commented Nov 15, 2023

thomaseizinger commented Nov 16, 2023

stormshield-frb commented Nov 16, 2023

thomaseizinger commented Nov 16, 2023

stormshield-frb commented Nov 20, 2023

PanGan21 commented Feb 20, 2024

thomaseizinger commented Feb 25, 2024

mergify bot commented Feb 29, 2024

thomaseizinger left a comment

Choose a reason for hiding this comment

thomaseizinger commented Mar 6, 2024

jxs left a comment

Choose a reason for hiding this comment

jxs left a comment

Choose a reason for hiding this comment

PanGan21 commented Mar 8, 2024

feat(kad): add `refresh_interval` config used to poll `bootstrap` #4838

feat(kad): add `refresh_interval` config used to poll `bootstrap` #4838

PanGan21 commented Nov 12, 2023 •

edited by thomaseizinger

Loading