You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When all network nodes are rebooted after the update they try syncing at the same time and this makes the booting slow.
Proposed design 1
Node A should reject syncing requests using structured error X when it already has more than Y nodes syncing. Number Y should be determined by benchmarking the syncing code. When node B receives structured error X during the sync it should attempt more nodes, and have a some delay retry mechanism on the peers that returned structured error X. This will naturally line up nodes into a queue.
This also makes monitoring of such network easier, since it will be easier to observe why node has not synced yet.
Proposed design 2
@evgenykuzyakov proposed that nearup can have a random delay before it starts the node. @nearmax 's argument against it is that it will not work universally, e.g. it won't work when nodes are upgraded by the community and NEAR foundation does not have perfect control on when and how people start them. Besides having randomization is a heuristics which adds to the maintenance of the system.
The text was updated successfully, but these errors were encountered:
Let's not conflate nearup (which is a tool to manage the node) with the behavior of the node itself. Whatever we do with nearup should be separate from nearcore. As for syncing, since a node has a limited number of peers, the number of peers that are syncing is naturally limited. Also, other than state sync (for which we already have limits), syncing is not very resource intensive so I don't think that we probably don't need to impose extra restrictions, although I do agree that limiting the number of peers syncing is a way to prevent eclipse attack.
Let's not conflate nearup (which is a tool to manage the node) with the behavior of the node itself. Whatever we do with nearup should be separate from nearcore.
I agree, let's not add hacks into nearup, like adding a randomized timer, that would solve node issues. Inability of the node to efficiently communicate with the peers and decide when and how to sync is the node issue.
Motivation
When all network nodes are rebooted after the update they try syncing at the same time and this makes the booting slow.
Proposed design 1
Node A should reject syncing requests using structured error X when it already has more than Y nodes syncing. Number Y should be determined by benchmarking the syncing code. When node B receives structured error X during the sync it should attempt more nodes, and have a some delay retry mechanism on the peers that returned structured error X. This will naturally line up nodes into a queue.
This also makes monitoring of such network easier, since it will be easier to observe why node has not synced yet.
Proposed design 2
@evgenykuzyakov proposed that nearup can have a random delay before it starts the node. @nearmax 's argument against it is that it will not work universally, e.g. it won't work when nodes are upgraded by the community and NEAR foundation does not have perfect control on when and how people start them. Besides having randomization is a heuristics which adds to the maintenance of the system.
The text was updated successfully, but these errors were encountered: