You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The offered metrics are at the task and worker level granularity. Given a highly scaled environment and also the default histogram buckets, this can lead to an explosion in cardinality of ingested metrics in Prometheus. In many cases worker level granularity of metrics is not necessary to understand the overall pattern of which task types are having which level of performance. For that reason I'd like to see how open this project is to adding additional metrics which are rolled up at the task level.
Unfortunately relabeling in Prometheus is not a valid solution as dropping the worker label would violate the constraint that all samples in a given scrape need to be distinct. Dropping the worker label would in fact lead to a label collision which is a no go. Additionally, aggregating the metrics post ingestion doesn't really solve the resource issue related to high cardinality metrics at the time of ingestion.
Describe the solution you'd like
The proposed solution would be to add additional metric instruments alongside the existing ones, but which lack the worker label. In all existing call points where those instruments are increments, set, observed into, we'd do likewise but for the these task level metrics. Basically keep what we have thus preserving backwards compatibility, and add an additional set of metrics which aren't worker specific. I would leave the worker specific metrics such as number of workers online alone. This would just be targeting the task oriented metrics.
I'm happy to send a PR for this change but first wanted to gauge whether it would be accepted.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
The offered metrics are at the task and worker level granularity. Given a highly scaled environment and also the default histogram buckets, this can lead to an explosion in cardinality of ingested metrics in Prometheus. In many cases worker level granularity of metrics is not necessary to understand the overall pattern of which task types are having which level of performance. For that reason I'd like to see how open this project is to adding additional metrics which are rolled up at the task level.
Unfortunately relabeling in Prometheus is not a valid solution as dropping the
worker
label would violate the constraint that all samples in a given scrape need to be distinct. Dropping theworker
label would in fact lead to a label collision which is a no go. Additionally, aggregating the metrics post ingestion doesn't really solve the resource issue related to high cardinality metrics at the time of ingestion.Describe the solution you'd like
The proposed solution would be to add additional metric instruments alongside the existing ones, but which lack the worker label. In all existing call points where those instruments are increments, set, observed into, we'd do likewise but for the these task level metrics. Basically keep what we have thus preserving backwards compatibility, and add an additional set of metrics which aren't worker specific. I would leave the worker specific metrics such as number of workers online alone. This would just be targeting the task oriented metrics.
I'm happy to send a PR for this change but first wanted to gauge whether it would be accepted.
The text was updated successfully, but these errors were encountered: