Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task level rollup of metrics #1413

Open
agershman opened this issue Dec 15, 2024 · 0 comments
Open

Task level rollup of metrics #1413

agershman opened this issue Dec 15, 2024 · 0 comments

Comments

@agershman
Copy link

agershman commented Dec 15, 2024

Is your feature request related to a problem? Please describe.

The offered metrics are at the task and worker level granularity. Given a highly scaled environment and also the default histogram buckets, this can lead to an explosion in cardinality of ingested metrics in Prometheus. In many cases worker level granularity of metrics is not necessary to understand the overall pattern of which task types are having which level of performance. For that reason I'd like to see how open this project is to adding additional metrics which are rolled up at the task level.

Unfortunately relabeling in Prometheus is not a valid solution as dropping the worker label would violate the constraint that all samples in a given scrape need to be distinct. Dropping the worker label would in fact lead to a label collision which is a no go. Additionally, aggregating the metrics post ingestion doesn't really solve the resource issue related to high cardinality metrics at the time of ingestion.

Describe the solution you'd like

The proposed solution would be to add additional metric instruments alongside the existing ones, but which lack the worker label. In all existing call points where those instruments are increments, set, observed into, we'd do likewise but for the these task level metrics. Basically keep what we have thus preserving backwards compatibility, and add an additional set of metrics which aren't worker specific. I would leave the worker specific metrics such as number of workers online alone. This would just be targeting the task oriented metrics.

I'm happy to send a PR for this change but first wanted to gauge whether it would be accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant