Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the operator latency metric cardinality #768

Merged
merged 2 commits into from
Sep 20, 2024

Conversation

jianoaix
Copy link
Contributor

@jianoaix jianoaix commented Sep 20, 2024

Why are these changes needed?

Currently attestation_latency_ms metric has a high cardinality, because each operator will have 12 time series (and there are many hundreds of operators), due to:

  • breakdown by failure/success
  • breakdown by percentiles, due to the use of latency summary
  • extra stream sum/count due to the use of latency summary

This PR changes to:

  • use gauge, not latency/distribution: this avoids the percentiles, as well as sum/count
  • ignore failed requests: just success requests latency should give us enough understanding

As a result, there is 1 time series per operator.

12 time series per operator:

eigenda_batcher_attestation_latency_ms{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="failure",quantile="0.5"} NaN
eigenda_batcher_attestation_latency_ms{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="failure",quantile="0.9"} NaN
eigenda_batcher_attestation_latency_ms{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="failure",quantile="0.95"} NaN
eigenda_batcher_attestation_latency_ms{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="failure",quantile="0.99"} NaN
eigenda_batcher_attestation_latency_ms_sum{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="failure"} 30669
eigenda_batcher_attestation_latency_ms_count{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="failure"} 1
eigenda_batcher_attestation_latency_ms{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="success",quantile="0.5"} 832
eigenda_batcher_attestation_latency_ms{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="success",quantile="0.9"} 857
eigenda_batcher_attestation_latency_ms{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="success",quantile="0.95"} 857
eigenda_batcher_attestation_latency_ms{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="success",quantile="0.99"} 857
eigenda_batcher_attestation_latency_ms_sum{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="success"} 3.465767e+06
eigenda_batcher_attestation_latency_ms_count{operator_id="ff7ca6eb4373179d6d3fb35a62911d7dda217b44d3e2db42dc6b912c393d0663",status="success"} 3288

Before:
Screenshot 2024-09-19 at 6 26 00 PM

After:
Screenshot 2024-09-19 at 6 26 10 PM

Checks

  • I've made sure the lint is passing in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

@jianoaix jianoaix requested review from dmanc and pschork September 20, 2024 01:35
t.Latency.WithLabelValues(operatorId, label).Observe(latencyMS)
// The Latency metric has "operator_id" but we null it out because it's separately
// tracked in OperatorLatency.
t.Latency.WithLabelValues("", label).Observe(latencyMS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the label if we are just going to null it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't evaluated it much, is it safe to delete the label for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything obvious that would go wrong. Maybe test in preprod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be better to keep it for a while, we may want to look back the operator latency in past month for example, if this label is gone, we cannot do that.

@jianoaix jianoaix merged commit 8c6617f into Layr-Labs:master Sep 20, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants