Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

More instrumentation of context cancellations and reward low latency nodes #85

Merged
merged 5 commits into from
Apr 7, 2023

Conversation

aarshkshah1992
Copy link
Contributor

@aarshkshah1992 aarshkshah1992 commented Apr 7, 2023

  1. We do have some peers with a reasonable P75 latency even after they have served >100 successful cache hit retrievals
    https://protocollabs.grafana.net/d/6g0_YjBVk/bifrost-caboose-staging?orgId=1&from=now-6h&to=now&editPanel=32

  2. Even after the context cancellation fix by @willscott in don't keep retrying on context errors #82, a significant number of requests(for both blocks and CARs) still fail with the context cancelled error. Let's instrument them more to help Bifrost debug this.
    https://protocollabs.grafana.net/d/6g0_YjBVk/bifrost-caboose-staging?orgId=1&from=now-6h&to=now&editPanel=30

@aarshkshah1992 aarshkshah1992 requested a review from willscott April 7, 2023 07:32
caboose.go Outdated
Comment on lines 276 to 278
if recordIfContextErr(resourceTypeBlock, ctx, "FetchBlockApi") {
return nil, ctx.Err()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only a partial view of this, right? - because asks for Has and GetSize also trigger parallel block triggers to Get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott Yeah, just wanna focus on fetch for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then lets not record the block api at all - misreporting will lead to confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott

Fixed the instrumentation. We now record on the lower level block and car fetchBlockWith and fetchResourceWith instead of the public API for capturing accurate numbers.

pool.go Outdated
if perf, ok := p.nodePerf[m.url]; ok {
// Our analysis so far shows that we do have ~10-15 peers with -75 < 200ms latency.
// It's not the best but it's a good start and we can tune as we go along.
if perf.latencyDigest.Count() > 100 && perf.latencyDigest.Quantile(0.75) <= 200 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these parameters probably shouldn't be hard coded? - maybe environment variables at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott Made these varibales for now. Will make them env in the next Caboose PR that I'm gonna write for much better pool membership and weighing algo.

metrics.go Outdated
// [50ms, 100ms, 200ms, ..., ~25 seconds]
latencyDistMsHistogram = prometheus.ExponentialBuckets(50, 2, 10)
// [50ms, 75ms, 100ms, ..., 525 ms]
latencyDistMsHistogram = prometheus.LinearBuckets(50, 25, 20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we start at 0 (add the 0-25,25-50?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done.

@aarshkshah1992 aarshkshah1992 merged commit 65275d0 into main Apr 7, 2023
@aarshkshah1992 aarshkshah1992 deleted the feat/more-context-cancellation branch April 7, 2023 09:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants