-
Notifications
You must be signed in to change notification settings - Fork 2
More instrumentation of context cancellations and reward low latency nodes #85
Conversation
caboose.go
Outdated
if recordIfContextErr(resourceTypeBlock, ctx, "FetchBlockApi") { | ||
return nil, ctx.Err() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only a partial view of this, right? - because asks for Has
and GetSize
also trigger parallel block triggers to Get
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willscott Yeah, just wanna focus on fetch for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then lets not record the block api at all - misreporting will lead to confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the instrumentation. We now record on the lower level block and car fetchBlockWith and fetchResourceWith instead of the public API for capturing accurate numbers.
pool.go
Outdated
if perf, ok := p.nodePerf[m.url]; ok { | ||
// Our analysis so far shows that we do have ~10-15 peers with -75 < 200ms latency. | ||
// It's not the best but it's a good start and we can tune as we go along. | ||
if perf.latencyDigest.Count() > 100 && perf.latencyDigest.Quantile(0.75) <= 200 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these parameters probably shouldn't be hard coded? - maybe environment variables at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willscott Made these varibales for now. Will make them env in the next Caboose PR that I'm gonna write for much better pool membership and weighing algo.
metrics.go
Outdated
// [50ms, 100ms, 200ms, ..., ~25 seconds] | ||
latencyDistMsHistogram = prometheus.ExponentialBuckets(50, 2, 10) | ||
// [50ms, 75ms, 100ms, ..., 525 ms] | ||
latencyDistMsHistogram = prometheus.LinearBuckets(50, 25, 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we start at 0 (add the 0-25,25-50?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done.
We do have some peers with a reasonable P75 latency even after they have served >100 successful cache hit retrievals
https://protocollabs.grafana.net/d/6g0_YjBVk/bifrost-caboose-staging?orgId=1&from=now-6h&to=now&editPanel=32
Even after the context cancellation fix by @willscott in don't keep retrying on context errors #82, a significant number of requests(for both blocks and CARs) still fail with the context cancelled error. Let's instrument them more to help Bifrost debug this.
https://protocollabs.grafana.net/d/6g0_YjBVk/bifrost-caboose-staging?orgId=1&from=now-6h&to=now&editPanel=30