Task level rollup of metrics #1413

agershman · 2024-12-15T17:14:29Z

Is your feature request related to a problem? Please describe.

The offered metrics are at the task and worker level granularity. Given a highly scaled environment and also the default histogram buckets, this can lead to an explosion in cardinality of ingested metrics in Prometheus. In many cases worker level granularity of metrics is not necessary to understand the overall pattern of which task types are having which level of performance. For that reason I'd like to see how open this project is to adding additional metrics which are rolled up at the task level.

Unfortunately relabeling in Prometheus is not a valid solution as dropping the worker label would violate the constraint that all samples in a given scrape need to be distinct. Dropping the worker label would in fact lead to a label collision which is a no go. Additionally, aggregating the metrics post ingestion doesn't really solve the resource issue related to high cardinality metrics at the time of ingestion.

Describe the solution you'd like

The proposed solution would be to add additional metric instruments alongside the existing ones, but which lack the worker label. In all existing call points where those instruments are increments, set, observed into, we'd do likewise but for the these task level metrics. Basically keep what we have thus preserving backwards compatibility, and add an additional set of metrics which aren't worker specific. I would leave the worker specific metrics such as number of workers online alone. This would just be targeting the task oriented metrics.

I'm happy to send a PR for this change but first wanted to gauge whether it would be accepted.

The text was updated successfully, but these errors were encountered:

agershman added the enhancement label Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task level rollup of metrics #1413

Task level rollup of metrics #1413

agershman commented Dec 15, 2024 •

edited

Loading

Task level rollup of metrics #1413

Task level rollup of metrics #1413

Comments

agershman commented Dec 15, 2024 • edited Loading

agershman commented Dec 15, 2024 •

edited

Loading