Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - fetch/peers: make latency based selection biased towards last requests #5688

Closed
wants to merge 14 commits into from

Conversation

dshulyak
Copy link
Contributor

@dshulyak dshulyak commented Mar 12, 2024

in the flaky system test it took 4 minutes to stop sending requests to a node that was actively dropping requests.
in this pr i switched peer latency estimator to be biased towards latency observed in last requests, i looked up if/how lotus node handles similar issues.

beside that there will be an INFO log with top peers stats, and global average latency in nanoseconds and total number of peers. by default node will emit log every 30m and it can be tuned by adding log-peer-stats-interval in fetch section of the config.

"fetch": {
        "log-peer-stats-interval": "1m"
}

@dshulyak
Copy link
Contributor Author

dshulyak commented Mar 12, 2024

problem in the test peers stick for too long to a couple of peers, that peer drops them but they can't make progress and eventually go out of sync. unsynced node can't submit transaction, hence the failure

image

Copy link

codecov bot commented Mar 12, 2024

Codecov Report

Attention: Patch coverage is 65.07937% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 79.7%. Comparing base (fb60767) to head (dfc3f40).

❗ Current head dfc3f40 differs from pull request most recent head 79a9859. Consider uploading reports for the commit 79a9859 to get more accurate results

Files Patch % Lines
fetch/peers/peers.go 62.7% 15 Missing and 1 partial ⚠️
syncer/syncer.go 25.0% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##           develop   #5688     +/-   ##
=========================================
- Coverage     79.8%   79.7%   -0.2%     
=========================================
  Files          279     279             
  Lines        28426   28468     +42     
=========================================
- Hits         22712   22704      -8     
- Misses        4148    4199     +51     
+ Partials      1566    1565      -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dshulyak
Copy link
Contributor Author

bors try

spacemesh-bors bot added a commit that referenced this pull request Mar 12, 2024
@dshulyak
Copy link
Contributor Author

bors cancel

@dshulyak
Copy link
Contributor Author

bors try

@spacemesh-bors
Copy link

try

Already running a review

@spacemesh-bors
Copy link

try

Build failed:

@dshulyak dshulyak changed the title debug flaky partition test fetch/peers: make latency based selection biased towards last requests Mar 12, 2024
@dshulyak
Copy link
Contributor Author

bors try

spacemesh-bors bot added a commit that referenced this pull request Mar 12, 2024
@dshulyak dshulyak marked this pull request as ready for review March 12, 2024 08:15
@spacemesh-bors
Copy link

try

Build failed:

syncer/syncer.go Outdated Show resolved Hide resolved
Comment on lines 599 to 600
go func() {
data, err := f.sendBatch(peer, batch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this PR, but this part of the code seems like it can be improved:

  • the spawned go routine isn't tracked via an errgroup.Group or sync.Waitgroup
  • there is no cancellation via a context.Context in place, the caller has no control over when sending a request shall be aborted (shutdown or timeout)
  • handleHashError locks a mutex for its whole lifetime, while receiveResponse does not. So if something is wrong and a lot of errors happen everything might slow down 🤔

fetch/peers/peers.go Outdated Show resolved Hide resolved
fetch/peers/peers.go Show resolved Hide resolved
@dshulyak
Copy link
Contributor Author

bors try

spacemesh-bors bot added a commit that referenced this pull request Mar 12, 2024
@spacemesh-bors
Copy link

try

Build failed:

@dshulyak
Copy link
Contributor Author

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Mar 12, 2024
#5688)

in the flaky system test it took 4 minutes to stop sending requests to a node that was actively dropping requests.
in this pr i switched peer latency estimator to be biased towards latency observed in last requests, i looked up if/how lotus node handles similar issues.

beside that there will be an INFO log with top peers stats, and global average latency in nanoseconds and total number of peers. by default node will emit log every 30m and it can be tuned by adding log-peer-stats-interval in fetch section of the config.

```json
"fetch": {
        "log-peer-stats-interval": "1m"
}
```
@spacemesh-bors
Copy link

Build failed:

@dshulyak
Copy link
Contributor Author

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Mar 12, 2024
#5688)

in the flaky system test it took 4 minutes to stop sending requests to a node that was actively dropping requests.
in this pr i switched peer latency estimator to be biased towards latency observed in last requests, i looked up if/how lotus node handles similar issues.

beside that there will be an INFO log with top peers stats, and global average latency in nanoseconds and total number of peers. by default node will emit log every 30m and it can be tuned by adding log-peer-stats-interval in fetch section of the config.

```json
"fetch": {
        "log-peer-stats-interval": "1m"
}
```
@spacemesh-bors
Copy link

Build failed (retrying...):

spacemesh-bors bot pushed a commit that referenced this pull request Mar 12, 2024
#5688)

in the flaky system test it took 4 minutes to stop sending requests to a node that was actively dropping requests.
in this pr i switched peer latency estimator to be biased towards latency observed in last requests, i looked up if/how lotus node handles similar issues.

beside that there will be an INFO log with top peers stats, and global average latency in nanoseconds and total number of peers. by default node will emit log every 30m and it can be tuned by adding log-peer-stats-interval in fetch section of the config.

```json
"fetch": {
        "log-peer-stats-interval": "1m"
}
```
@spacemesh-bors
Copy link

Pull request successfully merged into develop.

Build succeeded:

@spacemesh-bors spacemesh-bors bot changed the title fetch/peers: make latency based selection biased towards last requests [Merged by Bors] - fetch/peers: make latency based selection biased towards last requests Mar 12, 2024
@spacemesh-bors spacemesh-bors bot closed this Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants