More instrumentation of context cancellations and reward low latency nodes #85

aarshkshah1992 · 2023-04-07T07:29:47Z

We do have some peers with a reasonable P75 latency even after they have served >100 successful cache hit retrievals
https://protocollabs.grafana.net/d/6g0_YjBVk/bifrost-caboose-staging?orgId=1&from=now-6h&to=now&editPanel=32
Even after the context cancellation fix by @willscott in don't keep retrying on context errors #82, a significant number of requests(for both blocks and CARs) still fail with the context cancelled error. Let's instrument them more to help Bifrost debug this.
https://protocollabs.grafana.net/d/6g0_YjBVk/bifrost-caboose-staging?orgId=1&from=now-6h&to=now&editPanel=30

willscott · 2023-04-07T07:43:07Z

caboose.go

+	if recordIfContextErr(resourceTypeBlock, ctx, "FetchBlockApi") {
+		return nil, ctx.Err()
+	}


this is only a partial view of this, right? - because asks for Has and GetSize also trigger parallel block triggers to Get?

@willscott Yeah, just wanna focus on fetch for now.

then lets not record the block api at all - misreporting will lead to confusion?

@willscott

Fixed the instrumentation. We now record on the lower level block and car fetchBlockWith and fetchResourceWith instead of the public API for capturing accurate numbers.

willscott · 2023-04-07T07:44:01Z

pool.go

+			if perf, ok := p.nodePerf[m.url]; ok {
+				// Our analysis so far shows that we do have ~10-15 peers with -75 < 200ms latency.
+				// It's not the best but it's a good start and we can tune as we go along.
+				if perf.latencyDigest.Count() > 100 && perf.latencyDigest.Quantile(0.75) <= 200 {


these parameters probably shouldn't be hard coded? - maybe environment variables at least?

@willscott Made these varibales for now. Will make them env in the next Caboose PR that I'm gonna write for much better pool membership and weighing algo.

willscott · 2023-04-07T08:30:09Z

metrics.go

-	// [50ms, 100ms, 200ms, ...,  ~25 seconds]
-	latencyDistMsHistogram = prometheus.ExponentialBuckets(50, 2, 10)
+	// [50ms, 75ms, 100ms, ...,  525 ms]
+	latencyDistMsHistogram = prometheus.LinearBuckets(50, 25, 20)


can we start at 0 (add the 0-25,25-50?)

This is done.

MORE METRICS

303a77b

aarshkshah1992 requested a review from willscott April 7, 2023 07:32

remove ctx err redundant

993a85c

willscott reviewed Apr 7, 2023

View reviewed changes

aarshkshah1992 added 2 commits April 7, 2023 11:49

changes

134f74e

changes as per review

aaa8ca2

willscott reviewed Apr 7, 2023

View reviewed changes

willscott approved these changes Apr 7, 2023

View reviewed changes

review

b99404c

aarshkshah1992 merged commit 65275d0 into main Apr 7, 2023

aarshkshah1992 deleted the feat/more-context-cancellation branch April 7, 2023 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More instrumentation of context cancellations and reward low latency nodes #85

More instrumentation of context cancellations and reward low latency nodes #85

aarshkshah1992 commented Apr 7, 2023 •

edited

Loading

willscott Apr 7, 2023

aarshkshah1992 Apr 7, 2023

willscott Apr 7, 2023

aarshkshah1992 Apr 7, 2023

willscott Apr 7, 2023

aarshkshah1992 Apr 7, 2023

willscott Apr 7, 2023

aarshkshah1992 Apr 7, 2023

More instrumentation of context cancellations and reward low latency nodes #85

More instrumentation of context cancellations and reward low latency nodes #85

Conversation

aarshkshah1992 commented Apr 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aarshkshah1992 commented Apr 7, 2023 •

edited

Loading