-
Notifications
You must be signed in to change notification settings - Fork 775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Free more full node slots on the network #519
Comments
After upgrading to 0.9.38. one of our validator constantly (every hour or so would get stuck in some block and stops syncing even though it reports having 40 peers). I'm not sure if it is related to this issue, when it gets stuck it reports stable 40 peers, rejecting all new peers based on too many full nodes reason. we were not able to downgrade so the only way out was to increase the number peers to 50. Since we haven't experienced sync problem anymore. One observation I had was that as long as we had this type of mass disconnect event in the log, then the node was going on and as you see in the log there is no such disconnect event when the node is stuck:
One explanation, would be that upgraded nodes were dropping us for an unknown reason/bug but stale nodes would keep the connection forever which eventually result in peers being all saturated by stale nodes. So I'm not sure if it might be actually a consequence of #528. |
@drskalman thanks for the logs! I must admit I also had an issue with node reporting 40 peers during gap sync ("Block history" phase of warp sync) but making no progress (currently downloading block number was stuck) on the latest master. The issue went away after I restarted the node. |
* Update some docs * Add derived account origin * Add tests for derived origin * Do a little bit of cleanup * Change Origin type to use AccountIds instead of Public keys * Update (most) tests to use new Origin types * Remove redundant test * Update `runtime-common` tests to use new Origin types * Remove unused import * Fix documentation around origin verification * Update config types to use AccountIds in runtime * Update Origin type used in message relay * Use correct type when verifying message origin * Make CallOrigin docs more consistent * Use AccountIds instead of Public keys in Runtime types * Introduce trait for converting AccountIds * Bring back standalone function for deriving account IDs * Remove AccountIdConverter configuration trait * Remove old bridge_account_id derivation function * Handle target ID decoding errors more gracefully * Update message-lane to use new AccountId derivation * Update merged code to use new Origin types * Use explicit conversion between H256 and AccountIds * Make relayer fund account a config option in `message-lane` pallet * Add note about deriving the same account on different chains * Fix test weight * Use AccountId instead of Public key when signing Calls * Semi-hardcode relayer fund address into Message Lane pallet
Problem statement
Due to a bug #526 (fixed by paritytech/substrate#13396 and later reverted), currently nodes do not detect being kicked off by a remote peer during warp sync. Because of this, the node remains connected on other protocols, including block request protocol, and continues syncing even so the remote node has all the full node slots occupied. Once the warp sync is over, the local node sends block announcement, instantly learns that the set 0 (block announcements) notification stream was closed by the remote node, and finally discovers that it was rejected by the remote node. This leads to the peer count dropping after the warp sync, like described in #528. The disconnect on the local side happens with a delay (only after sending out a block announcement), so our node still thinks that it's connected to nodes that actually rejected it, after connecting to them, so the peer count reported is higher than should be and this allows continuing communication on non-default protocols (non-zero peerset).
After merging the fix for #526 (paritytech/substrate#13396), it turned out that the local node can't actually sync, because it now disconnects from the remote once it's kicked off, and there is not enough peers to sync from. So the fix paritytech/substrate#13396 was reverted in paritytech/substrate#13409. As investigated by @altonen, our local node is kicked of because the remote has all the full node slots occupied.
In order for sync to work after merging paritytech/substrate#13396, there should be more full node slots available on the network (see previous "everyone is full" crisis paritytech/substrate#12434).
Solution proposed
One way of increasing the available full node slot count on the network is to reduce outbound connections from the nodes, which don't really need them. High number of connections is needed to speed up the initial sync, but when the node is just doing the keep-up sync, fewer connections can be used. So, it's proposed to reduce the number of outgoing connections once the initial sync is finished to free slots up for other full nodes.
The text was updated successfully, but these errors were encountered: