Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peering - disconnects refactor #6968

Closed
wants to merge 31 commits into from

Conversation

macfarla
Copy link
Contributor

@macfarla macfarla commented Apr 19, 2024

PR description

Before this PR, when we get to max peers, we refuse ALL incoming connections regardless of any properties of the incoming peer (with UNKNOWN reason). There were a few spots where we were disconnecting peers for various reasons including repeated timeouts and "useless" responses. This PR aims to consolidate the disconnects, as well as only disconnecting an established peer when we have a better peer to replace it with.

  • Move the disconnection decision from PeerReputation/Peer to EthPeers so the totality of current peers can be considered.
  • use reputation score to sort peers within EthPeers (bestPeerComparator)
    • note in a few places in tests I've explicitly set the comparator to what was used before (uses chain height estimate) to avoid having to update a heap of tests that are dependent on the decision made by the comparator
  • only disconnect "worst" peer if we have max peers (think this will help on holesky for "useless" disconnects but not for socket closed etc - where the connection is already gone).
  • on incoming connection, compare the incoming peer to the current collection of peers and (if at max peers) disconnect whichever compares least favourably - in effect this will be the incoming peer if all our current peers are giving us good responses (reputation score), or it will be an existing peer if any are not giving good responses

example debug log (yes there is a TODO to remove this log)
{"@timestamp":"2024-04-19T00:53:31,297","level":"DEBUG","thread":"nioEventLoopGroup-3-9","class":"EthPeers","message":"comparing worstCurrentPeer PeerId: 0x024e106a70572288... PeerReputation score: 87, timeouts: {3=5}, useless: 0, validated? true, disconnected? false, client: erigon/v2.58.2-125509e4/linux-amd64/go1.21.8, [Connection with hashCode 1276176835 inboundInitiated? true initAt 1713481827797], enode://024e106a70572288701e97724610d120a682f69f24881eeee0ab1f6379646780d93daf11d9226593faf71deb699f24020dbecf801eb3d0da779ef2be641590fa@3.38.172.157:30304 with connectingPeer PeerId: 0x1ec3a5e247e616a3... PeerReputation score: 100, timeouts: {}, useless: 0, validated? true, disconnected? false, client: Nethermind/v1.25.4+20b10b35/linux-x64/dotnet8.0.2, [Connection with hashCode 1951644955 inboundInitiated? false initAt 1713488011205], enode://1ec3a5e247e616a347038b9c35a1328529bdf354408fe3c968433df542eb1b2a7c7d1b7b7d481a5b6465baac64877ee53e1396ddf43a3ad0f5f0adbaed659145@188.40.67.160:30303","throwable":""}

Have seen decent results on holesky and mainnet. See screenshots

Fixed Issue(s)

Refs #6805 and #6842

Thanks for sending a pull request! Have you done the following?

  • Checked out our contribution guidelines?
  • Considered documentation and added the doc-change-required label to this PR if updates are required.
  • Considered the changelog and included an update if required.
  • For database changes (e.g. KeyValueSegmentIdentifier) considered compatibility and performed forwards and backwards compatibility tests

Locally, you can run these tests to catch failures early:

  • unit tests: ./gradlew build
  • acceptance tests: ./gradlew acceptanceTest
  • integration tests: ./gradlew integrationTest
  • reference tests: ./gradlew ethereum:referenceTests:referenceTests

…til we actually compare

Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
@macfarla
Copy link
Contributor Author

3 x mainnet nodes
1 is a bit flat
Screenshot 2024-04-19 at 12 53 51 PM

@macfarla
Copy link
Contributor Author

3 x holesky
Screenshot 2024-04-19 at 12 55 44 PM

Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
@Beanow
Copy link

Beanow commented May 5, 2024

As mentioned in #6945 do see a lot less UNKNOWN disconnects once hitting our peer limit.

image
Holesky here.

@macfarla
Copy link
Contributor Author

burn-in of 10 bonsai/checkpoint mainnet nodes (started in 2 batches)
Screenshot 2024-05-22 at 10 46 39 AM

@macfarla
Copy link
Contributor Author

peering graph from the first batch of mainnet nodes
Screenshot 2024-05-22 at 10 51 41 AM

and the second
Screenshot 2024-05-22 at 10 52 35 AM

@macfarla macfarla requested a review from pinges May 29, 2024 02:27
@@ -215,7 +215,7 @@ public void recordRequestTimeout(final int requestCode) {
.addArgument(this::getLoggableId)
.log();
LOG.trace("Timed out while waiting for response from peer {}", this);
reputation.recordRequestTimeout(requestCode, this).ifPresent(this::disconnect);
reputation.recordRequestTimeout(requestCode, this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should disconnect these, because otherwise we might waist time by using these peers for requests, because they are likely to timeout as well.

@@ -224,7 +224,7 @@ public void recordUselessResponse(final String requestType) {
.addArgument(requestType)
.addArgument(this::getLoggableId)
.log();
reputation.recordUselessResponse(System.currentTimeMillis(), this).ifPresent(this::disconnect);
reputation.recordUselessResponse(System.currentTimeMillis(), this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.a.

@@ -132,6 +132,7 @@ private Stream<EthPeer> remainingPeersToTry() {
}

private void refreshPeers() {
// TODO this duplicates EthPeers.disconnectWorst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at line 141, I think we could just not filter on !is.disconnected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
Signed-off-by: Sally MacFarlane <[email protected]>
@macfarla
Copy link
Contributor Author

Screenshot 2024-06-27 at 11 43 22 AM around 1h for all mainnet nodes to get to 100% peers

@macfarla
Copy link
Contributor Author

Screenshot 2024-06-27 at 11 41 25 AM syncing progress

@macfarla
Copy link
Contributor Author

@pinges I have made the changes you requested, can you review

@macfarla
Copy link
Contributor Author

going to close this and reprise the worthy changes into separate PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants