[Merged by Bors] - fetch/peers: make latency based selection biased towards last requests #5688

dshulyak · 2024-03-12T06:32:51Z

in the flaky system test it took 4 minutes to stop sending requests to a node that was actively dropping requests.
in this pr i switched peer latency estimator to be biased towards latency observed in last requests, i looked up if/how lotus node handles similar issues.

beside that there will be an INFO log with top peers stats, and global average latency in nanoseconds and total number of peers. by default node will emit log every 30m and it can be tuned by adding log-peer-stats-interval in fetch section of the config.

"fetch": {
        "log-peer-stats-interval": "1m"
}

dshulyak · 2024-03-12T06:42:51Z

problem in the test peers stick for too long to a couple of peers, that peer drops them but they can't make progress and eventually go out of sync. unsynced node can't submit transaction, hence the failure

codecov · 2024-03-12T06:46:59Z

Codecov Report

Attention: Patch coverage is 65.07937% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 79.7%. Comparing base (fb60767) to head (dfc3f40).

❗ Current head dfc3f40 differs from pull request most recent head 79a9859. Consider uploading reports for the commit 79a9859 to get more accurate results

Files	Patch %	Lines
fetch/peers/peers.go	62.7%	15 Missing and 1 partial ⚠️
syncer/syncer.go	25.0%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##           develop   #5688     +/-   ##
=========================================
- Coverage     79.8%   79.7%   -0.2%     
=========================================
  Files          279     279             
  Lines        28426   28468     +42     
=========================================
- Hits         22712   22704      -8     
- Misses        4148    4199     +51     
+ Partials      1566    1565      -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dshulyak · 2024-03-12T06:51:40Z

bors try

dshulyak · 2024-03-12T07:15:48Z

bors cancel

dshulyak · 2024-03-12T07:15:53Z

bors try

spacemesh-bors · 2024-03-12T07:15:55Z

try

Already running a review

spacemesh-bors · 2024-03-12T07:21:30Z

try

Build failed:

ci-status

dshulyak · 2024-03-12T08:06:56Z

bors try

spacemesh-bors · 2024-03-12T09:13:02Z

try

Build failed:

systest-status

syncer/syncer.go

fasmat · 2024-03-12T09:08:10Z

fetch/fetch.go

 			go func() {
 				data, err := f.sendBatch(peer, batch)


Not part of this PR, but this part of the code seems like it can be improved:

the spawned go routine isn't tracked via an errgroup.Group or sync.Waitgroup

there is no cancellation via a context.Context in place, the caller has no control over when sending a request shall be aborted (shutdown or timeout)

handleHashError locks a mutex for its whole lifetime, while receiveResponse does not. So if something is wrong and a lot of errors happen everything might slow down 🤔

fetch/peers/peers.go

dshulyak · 2024-03-12T09:30:37Z

bors try

spacemesh-bors · 2024-03-12T10:36:08Z

try

Build failed:

systest-status

dshulyak · 2024-03-12T10:53:59Z

bors merge

#5688) in the flaky system test it took 4 minutes to stop sending requests to a node that was actively dropping requests. in this pr i switched peer latency estimator to be biased towards latency observed in last requests, i looked up if/how lotus node handles similar issues. beside that there will be an INFO log with top peers stats, and global average latency in nanoseconds and total number of peers. by default node will emit log every 30m and it can be tuned by adding log-peer-stats-interval in fetch section of the config. ```json "fetch": { "log-peer-stats-interval": "1m" } ```

spacemesh-bors · 2024-03-12T11:23:16Z

Build failed:

ci-status

dshulyak · 2024-03-12T11:32:29Z

bors merge

#5688) in the flaky system test it took 4 minutes to stop sending requests to a node that was actively dropping requests. in this pr i switched peer latency estimator to be biased towards latency observed in last requests, i looked up if/how lotus node handles similar issues. beside that there will be an INFO log with top peers stats, and global average latency in nanoseconds and total number of peers. by default node will emit log every 30m and it can be tuned by adding log-peer-stats-interval in fetch section of the config. ```json "fetch": { "log-peer-stats-interval": "1m" } ```

spacemesh-bors · 2024-03-12T11:58:31Z

Build failed (retrying...):

ci-status

#5688) in the flaky system test it took 4 minutes to stop sending requests to a node that was actively dropping requests. in this pr i switched peer latency estimator to be biased towards latency observed in last requests, i looked up if/how lotus node handles similar issues. beside that there will be an INFO log with top peers stats, and global average latency in nanoseconds and total number of peers. by default node will emit log every 30m and it can be tuned by adding log-peer-stats-interval in fetch section of the config. ```json "fetch": { "log-peer-stats-interval": "1m" } ```

spacemesh-bors · 2024-03-12T12:48:17Z

Pull request successfully merged into develop.

Build succeeded:

systest-status
ci-status

dshulyak added 4 commits March 9, 2024 08:19

log peer stats and immediate startup

79c8dd0

debug flaky partition test

ecade57

enable syncer debug logs

401b3b1

metrics were enabled

e7cbeef

dshulyak added 3 commits March 12, 2024 07:47

remove interval here

f23495b

Merge branch 'peer-stats' into debug-flaky-partitionn

c900848

print peer stats

7bac5fd

spacemesh-bors bot added a commit that referenced this pull request Mar 12, 2024

Try #5688:

21491d3

use log scaling, looked up in lotus

8bf3635

dshulyak added 2 commits March 12, 2024 08:44

log best peers periodically

518cb64

cleanup

2ed921b

dshulyak changed the title ~~debug flaky partition test~~ fetch/peers: make latency based selection biased towards last requests Mar 12, 2024

Merge branch 'develop' into debug-flaky-partitionn

866507f

spacemesh-bors bot added a commit that referenced this pull request Mar 12, 2024

Try #5688:

0681101

dshulyak marked this pull request as ready for review March 12, 2024 08:15

dshulyak requested review from fasmat, poszu and ivan4th as code owners March 12, 2024 08:15

fasmat approved these changes Mar 12, 2024

View reviewed changes

spacemesh-bors bot added a commit that referenced this pull request Mar 12, 2024

Try #5688:

fee9cd8

refactor couple of if conditions

dfc3f40

dshulyak added 2 commits March 12, 2024 11:53

remove logging

c78c20f

Merge branch 'develop' into debug-flaky-partitionn

79a9859

spacemesh-bors bot changed the title ~~fetch/peers: make latency based selection biased towards last requests~~ [Merged by Bors] - fetch/peers: make latency based selection biased towards last requests Mar 12, 2024

spacemesh-bors bot closed this Mar 12, 2024

fasmat mentioned this pull request Mar 15, 2024

Add check to not overwrite an existing key when migrating after downgrade #5709

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - fetch/peers: make latency based selection biased towards last requests #5688

[Merged by Bors] - fetch/peers: make latency based selection biased towards last requests #5688

dshulyak commented Mar 12, 2024 •

edited

Loading

dshulyak commented Mar 12, 2024 •

edited

Loading

codecov bot commented Mar 12, 2024 •

edited

Loading

dshulyak commented Mar 12, 2024

dshulyak commented Mar 12, 2024

dshulyak commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

dshulyak commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

fasmat Mar 12, 2024

dshulyak commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

dshulyak commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

dshulyak commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

[Merged by Bors] - fetch/peers: make latency based selection biased towards last requests #5688

[Merged by Bors] - fetch/peers: make latency based selection biased towards last requests #5688

Conversation

dshulyak commented Mar 12, 2024 • edited Loading

dshulyak commented Mar 12, 2024 • edited Loading

codecov bot commented Mar 12, 2024 • edited Loading

Codecov Report

dshulyak commented Mar 12, 2024

dshulyak commented Mar 12, 2024

dshulyak commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

try

spacemesh-bors bot commented Mar 12, 2024

try

dshulyak commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

try

fasmat Mar 12, 2024

Choose a reason for hiding this comment

dshulyak commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

try

dshulyak commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

dshulyak commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

spacemesh-bors bot commented Mar 12, 2024

dshulyak commented Mar 12, 2024 •

edited

Loading

dshulyak commented Mar 12, 2024 •

edited

Loading

codecov bot commented Mar 12, 2024 •

edited

Loading