-
Notifications
You must be signed in to change notification settings - Fork 2
Metrics Refactor and record metrics around L1 Server-Timings #70
Conversation
pool.go
Outdated
@@ -17,6 +17,10 @@ import ( | |||
"github.com/serialx/hashring" | |||
) | |||
|
|||
var ( | |||
maxPoolSize = 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a const rather than a var?
pool.go
Outdated
goLogger.Infow("trimmed pool list to 200", "first", n[0].url, "first_weight", | ||
n[0].weight, "last", n[199].url, "last_weight", n[199].weight) | ||
n = n[:maxPoolSize] | ||
goLogger.Infow(fmt.Sprintf("trimmed pool size to %d", maxPoolSize), "first", n[0].url, "first_weight", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the combination of infow
and sprintf
? can't you have "size", maxPoolSize as another pair of keys directly to infoW?
Help: "Latency observed during failed caboose car fetches from a single peer", | ||
Buckets: durationPerCarHistogram, | ||
fetchDurationBlockFailureMetric = prometheus.NewHistogram(prometheus.HistogramOpts{ | ||
Name: prometheus.BuildFQName("ipfs", "caboose", "fetch_duration_block_failure"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renaming current metrics will break current dashboards - why is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willscott Not all the metrics have been renamed. Only some that were either incorrectly named or some that have taken the place of some repetitive ones.
I'll ensure the Caboose/Bifrost dashboard/metrics encompass the correct names. We're anyways revamping both the dashboards.
fetcher.go
Outdated
@@ -284,6 +284,11 @@ func updateSuccessServerTimingMetrics(timingHeaders []string, resourceType strin | |||
fetchDurationPerPeerSuccessCacheMissTotalLassieMetric.WithLabelValues(resourceType).Observe(float64(m.Duration.Milliseconds())) | |||
case "nginx": | |||
fetchDurationPerPeerSuccessTotalL1NodeMetric.WithLabelValues(resourceType, cache_status).Observe(float64(m.Duration.Milliseconds())) | |||
networkTimeMs := totalTimeMs - m.Duration.Milliseconds() | |||
fetchNetworkSpeedPerPeerSuccessMetric.WithLabelValues(resourceType).Observe(float64(recieved) / float64(networkTimeMs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about when networkTimeMs
is 0? - that leads to a division by 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed and added more sanity checks.
Some refactor/fixes of the metrics code for correctness/grokability/dryness
Add more dimensions/labels to metrics to better capture block/car, cache-hit/cache-miss cases
Record prometheus metrics around important L1 server-timings