-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P2P: Don't add peer to maintained connections if local node is not ready #4469
Conversation
* Fixes an issue due to which multiple peer connections are setup over time to the local node if it is present in the static nodes list. The code was not handling the case where local node is not yet ready (encountered at startup). * Also adds a one time call to initiate connection to nodes in static nodes list after local node is ready. In the absence of this call, the connection attempt is made after the configured interval (default 60 seconds) to check maintained connections. This is to make sure the node doesn't have to wait for almost a minute before it can start interacting with peers (and sync blocks etc.) Signed-off-by: Vinod Damle <[email protected]>
Signed-off-by: Vinod Damle <[email protected]>
Signed-off-by: Vinod Damle <[email protected]>
c2ce213
to
bb98e8a
Compare
Signed-off-by: Vinod Damle <[email protected]>
Need some hand holding to fix the acceptance test https://github.com/hyperledger/besu/blob/main/acceptance-tests/tests/src/test/java/org/hyperledger/besu/tests/acceptance/bootstrap/StaticNodesAcceptanceTest.java#L38 . I see there is a separate |
I'm looking at this @vdamle |
@@ -259,6 +259,10 @@ public void start() { | |||
peerBondedObserverId = | |||
OptionalLong.of(peerDiscoveryAgent.observePeerBondedEvents(this::handlePeerBondedEvent)); | |||
|
|||
// Call checkMaintainedConnectionPeers() now that the local node is up, for immediate peer | |||
// additions | |||
checkMaintainedConnectionPeers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might make it faster - have you observed this in practice ie on a running node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the absence of this call, the node does not establish peering until the first scheduled run of maintained connections:
2022-09-30 21:43:31.530+00:00 | main | INFO | DefaultSynchronizer | Starting synchronizer.
2022-09-30 21:43:31.531+00:00 | main | INFO | FullSyncDownloader | Starting full sync.
2022-09-30 21:43:31.533+00:00 | main | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:43:31.544+00:00 | main | INFO | JsonRpcHttpService | Starting JSON-RPC service on 0.0.0.0:8555
2022-09-30 21:43:31.736+00:00 | vert.x-eventloop-thread-1 | INFO | JsonRpcHttpService | JSON-RPC service started and listening on 0.0.0.0:8555
2022-09-30 21:43:31.737+00:00 | main | INFO | WebSocketService | Starting Websocket service on 0.0.0.0:8556
2022-09-30 21:43:31.740+00:00 | vert.x-eventloop-thread-1 | INFO | WebSocketService | Websocket service started and listening on 0.0.0.0:8556
2022-09-30 21:43:31.741+00:00 | main | INFO | Runner | Ethereum main loop is up.
2022-09-30 21:43:31.793+00:00 | main | INFO | AutoTransactionLogBloomCachingService | Starting auto transaction log bloom caching service.
2022-09-30 21:43:31.797+00:00 | main | INFO | LogBloomCacheMetadata | Lookup cache metadata file in data directory: /qdata/ethereum/caches
2022-09-30 21:43:36.538+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:43:41.539+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:43:46.540+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:43:51.541+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:43:56.542+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:01.544+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:06.544+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:11.546+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:16.547+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:21.548+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:26.550+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:31.551+00:00 | EthScheduler-Timer-0 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:35.025+00:00 | nioEventLoopGroup-3-5 | INFO | FullSyncTargetManager | No sync target, waiting for peers: 1
2022-09-30 21:44:40.415+00:00 | nioEventLoopGroup-3-5 | INFO | SyncTargetManager | Found common ancestor with peer Peer 0xc39364ec1ed6c1d02f... at block 27011
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good find!
&& localNode.getPeer() != null | ||
&& localNode.getPeer().getEnodeURL() != null; | ||
if (!localNodeIsReady | ||
|| peer.getEnodeURL().getNodeId().equals(localNode.getPeer().getEnodeURL().getNodeId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change isn't needed. localNode.isReady just checks if the enode is ready (a pre-requisite to checking if our enode is the same as the peer), it's not checking if the network is ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ this change is what breaks the staticNodes AcceptanceTest - so if you change this back the tests should pass :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @macfarla that makes sense. I've updated the check to only whether the local node is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you revert the changes to this method addMaintainedConnectionPeer() altogether, the tests will pass and we can approve and merge it.
localNode.isReady just means we can read the enode - needed in order to do the comparison with the peer enode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving aside the acceptance tests for a moment, the purpose of this PR is to fix a bug due to which the local node is repeatedly getting added to the maintained connection list. Here is the problem with the original code (i.e, if I revert these changes altogether):
The above check always fails during node startup when the caller is invoking this method for each entry present in static-nodes.json
, which in our case happens to include the local node. The check is A && B && C && D
, and it bails as soon as A
is false
. As a result:
the local node gets added added as a connection. Below are logs from a 2 node network, you can see that the peer count shows up as 2954
! I don't have the output of admin.peers()
saved from triaging this in a live customer chain, but I did check and confirm that an entry with the local enode ID was getting duplicated hundreds of times.
2022-09-19 07:18:49.072+00:00 | EthScheduler-Workers-1 | INFO | PersistBlockTask | Imported #10,219,794 / 0 tx / 0 om / 0 (0.0%) gas / (0xe005699262c248b6b4eeb6b2b1f898a335ee4af4fb41e9b853c3d3fa07ce071b) in 0.018s. Peers: 1439
2022-09-19 07:18:52.191+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10219793, Peers: 1439
2022-09-19 07:18:57.192+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10219794, Peers: 1439
2022-09-19 07:18:59.045+00:00 | EthScheduler-Workers-3 | INFO | PersistBlockTask | Imported #10,219,796 / 0 tx / 0 om / 0 (0.0%) gas / (0x05a2be65bd2b648ee081813348f09691c99d3a4e5f723eaa2f0c58acb6d4aa06) in 0.013s. Peers: 1439
2022-09-19 07:19:02.192+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10219795, Peers: 1439
2022-09-19 07:19:04.045+00:00 | EthScheduler-Workers-0 | INFO | PersistBlockTask | Imported #10,219,797 / 0 tx / 0 om / 0 (0.0%) gas / (0xfae06c6df283597d7ebe7e62d902544597e0cddc1aea3fb4605d3af9f954b6c4) in 0.015s. Peers: 1439
2022-09-19 07:19:07.192+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10219796, Peers: 1439
2022-09-19 07:19:12.193+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10219797, Peers: 1439
2022-09-19 07:19:14.060+00:00 | EthScheduler-Workers-2 | INFO | PersistBlockTask | Imported #10,219,799 / 0 tx / 0 om / 0 (0.0%) gas / (0x606531274d1ceb438ea3abf8fcfaeb5d6a78a0f3f96d1ebbdd94fe11b940cba1) in 0.014s. Peers: 1439
...
...
...
2022-09-22 14:24:58.917+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10276746, Peers: 2954
2022-09-22 14:24:59.734+00:00 | EthScheduler-Workers-0 | INFO | PersistBlockTask | Imported #10,276,748 / 0 tx / 0 om / 0 (0.0%) gas / (0xcab6f9cf553508b0778475c73f5b6207680a12d97f8ad9d56ac0082ede5a7acb) in 0.489s. Peers: 2954
2022-09-22 14:25:03.952+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10276747, Peers: 2954
2022-09-22 14:25:04.704+00:00 | EthScheduler-Workers-3 | INFO | PersistBlockTask | Imported #10,276,749 / 0 tx / 0 om / 0 (0.0%) gas / (0x414b14b45a564bdcffe02cb9efdd4dcc98d7f9f4828a696601152814c0047078) in 0.507s. Peers: 2954
2022-09-22 14:25:08.984+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10276748, Peers: 2954
2022-09-22 14:25:14.487+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10276749, Peers: 2954
2022-09-22 15:22:17.029+00:00 | EthScheduler-Workers-0 | INFO | PersistBlockTask | Imported #10,276,750 / 1 tx / 0 om / 251,662 (0.0%) gas / (0x6fdb2f23177ff7054bf27b933d113ba6dee180177dc90e9d1ade3a6cf43da23b) in 3426.331s. Peers: 2952
To me, this method should return false
if any of the below holds:
- Local node is not ready - in this case, as you point out, we cannot read the enode yet.
- the entry being added is the local node.
The latest diff I've got does the above but one or more acceptance tests still fail because the tests expect to see 1 peer and there are zero peers. I'm not sure the correct solution is to revert the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand what you're saying now but I think the problem is in the (regularly called) checkMaintainedConnectionPeers method rather than the addMaintained method which is only called at startup (once) or by the adminAddPeer method. I had a play around with the code and came up with this #4521 - let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I agree that the multiple additions would occur if the list provided to checkMaintainedConnectionPeers()
includes the local node. The change I proposed ensures that the list provided to checkMaintainedConnectionPeers
does not contain the local node whereas #4521 treats the list as potentially faulty and filters out the local node. It seems reasonable to eliminate the local node at that stage.
Given that we are going with #4521 as the fix for this issue, I will revert the changes to addMaintainedConnectionPeer
and update the title and description.
@vdamle the change that breaks the AT isn't required - so then you just need to verify that the additional call to checkMaintainedPeers is actually helpful (run up the code on a besu node and see?) and we can review |
Signed-off-by: Vinod Damle <[email protected]>
Signed-off-by: Vinod Damle <[email protected]>
CHANGELOG.md
Outdated
@@ -11,7 +11,8 @@ | |||
|
|||
### Bug Fixes | |||
- Corrects emission of blockadded events when rewinding during a re-org. Fix for [#4495](https://github.com/hyperledger/besu/issues/4495) | |||
- | |||
- Do not add static node to maintained connection list if local node is not yet ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe something like "initiate connection to maintained peers soon after startup"
We rely on filtering in checkMaintainedConnectionPeers() to skip the local node when setting up peer connections Signed-off-by: Vinod Damle <[email protected]>
Signed-off-by: Vinod Damle <[email protected]>
Signed-off-by: Vinod Damle <[email protected]>
closing this one - I couldn't push to your branch so I made a new PR fixing the changelog so we could get it in the RC #4543 |
PR description
Fixes an issue due to which multiple peer connections are setup over time to the local node if it is present in the static nodes list. The code was not handling the case where local node is not yet ready (encountered at startup). Don't connect to peer if it has some enode ID #1703 added logic to skip adding local node to the set of maintained peer connections. However, the check fails to take into account the case where the local node is not marked ready (occurs at startup since sanitizePeers is called before the call is made to setup the networkRunner at which time the local node is setup and marked ready.
Also adds a one time call to initiate connection to nodes in static nodes list after local node is ready. In the absence of this call, the connection attempt is made after the configured interval (default 60 seconds) to check maintained connections. This is to make sure the node doesn't have to wait for almost a minute before it can start interacting with peers (and sync blocks etc.).
Fixed Issue(s)
Fixes #1702
Documentation
I thought about documentation and added thedoc-change-required
label to this PR ifupdates are required.
No doc updates necessary.
Changelog