-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics Transform Processor Proposal #332
Comments
I wonder if this should be broken up into two different processors? One that's analogous to attributesprocessor but for metrics. I think this should live in core:
And a separate one for aggregation that lives in contrib for now as it may take some trial and error to get the right config and functionality:
Note that there are discussions about changes to attributesprocessor that we would probably want to mirror: open-telemetry/opentelemetry-collector#979 (comment) |
Keep in mind that this is not easy. There are corner cases where metrics received in two different batches and after label removing will generate the same timeseries. First batch: Second batch: Without keeping a state between different batches will produce wrong aggregations, so this will work only if the receiving batch contains all the values. |
@jrcamp I really appreciate your feedback on this. Initially, I thought that it might make sense to put aggregation and rename together because after aggregation, the semantics of that metric will very likely change, thus a more accurate name should be reassigned. But separating these makes it a cleaner design because these two are essentially doing different things. Therefore, I have made adjustments to the issue proposed. :) |
@jrcamp @JingboWangGoogle |
@bogdandrutu Thank you for pointing this out! For now, I am aiming to have the aggregation processor to only handle single-batch processing (e.g. the collector collects the host metrics, which is the only source of metrics into the collector). The corner case here can be solved by "aggregation over time" under the possible extensions section. |
btw, have you taken a look at the spec relating to aggregations in the SDK: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/metrics/api.md#aggregations I'm not deeply familiar with them and don't see anything conflicting between them but wanted to make sure you saw them. @bogdandrutu it feels like we should ideally in the long run have parity between aggregations of collector and SDK's, what do you think?
@quentinmit With aggregation you are either 1) producing a new (derived) metric or 2) replacing the existing metric with the aggregated values. When renaming you're keeping all the same data but changing the metric name/labels. It seems they're different enough to warrant treating separately. Especially for somebody who only wanting to rename I think it'd be confusing to try and do renaming via aggregation (presumably by an aggregation that is actually a noop?) Am I missing something? |
@jrcamp I don't think the user would need to think about aggregation if they only want to rename, or renaming if they only want to aggregate. The default config for each operation should already be a noop. Maybe it's better to illustrate with some strawman configs: metricsprocessor:
# Remove the core label from the CPU metric
- metric_name: cpu/utilization
aggregate:
fields: [core]
invert_fields: true
reducer: MEAN # MEAN gives you 0-100, SUM/default gives you 0-n00
# Rename request_count
- metric_name: app/request_count
source_metric_name: apache/request_count
- metric_name: apache/request_count
drop: true # overlaps with filter processor
# Denormalize request count to save space
- metric_name: istio/request_count_by_response_code
source_metric_name: istio/request_count
aggregate:
fields: [response_code]
- metric_name: istio/request_count_by_source_workload
source_metric_name: istio/request_count
aggregate:
fields: [source_workload]
- metric_name: istio/request_count
drop: true Remember that when filtering labels, you always need to simultaneously perform aggregation (since there can be two or more time series that turn into one). I also think there's an argument to be made that modularity is worth trading off for the simplified configuration. |
It also needs to be able to rename labels (perhaps scoped to a particular metric). How would that fit into the config? |
@jrcamp I think renaming labels and especially renaming label values also is intertwined with aggregation (because renamed label values can collide with existing time series that have that value, or you might want to explicitly map two values to one). There's a couple different ways you could do renaming labels; there's what was suggested above by @JingboWangGoogle: - action: update_label
label: cpu
new_label: core and I could also imagine for some usecases wanting to do remap the labels at once: - map_labels:
core: cpu
socket: index (and therefore any label not mentioned would be dropped) I'm not sure which or both should be supported; I'm inclined to err on the side of making the config as easy to understand as possible, which weakly suggests that we should support both. I'm open to your thoughts on that. |
In my mind renaming may happen as a side effect of aggregation. But you may also want to rename without aggregating. For example, the OT metric is named I think the renaming part is pretty straight forward and should go in core whereas the aggregation may need to be iterated on. Given that structural situation I think they must be separate (at least for now). |
As a record to our conclusion from our meeting, we have decided to keep it as one processor, but this processor will stay in the contrib repo for now. |
In #290 we are discussing how to incorporate a (dog)statsd receiver into the collector. One of the features of statsd data is that it arrives unaggregated, as point data. Feeding statsd data directly into the collector is probably not what users want. I think what users do want is to configure which aggregations are applied to the data. This could be implemented specifically inside a statsd receiver, but it's appealing to think we could transform statsd data points into raw OTLP data and then rely on a downstream processor to aggregate the data in a way the user configures. Does this sound like a reasonable use-case for the collector metrics processor? |
It sounds quite reasonable to me, though can you clarify what you mean by "point data"? We're initially targeting aggregation-across-dimensions with this processor, and not aggregation-across-time. So if you get e.g. 5 separate points in a minute and want an output of "5" that would be aggregation-across-time. We definitely want to support that but it's more complicated so it's not likely to be in the first version of this processor. |
For example, when statsd sends histogram data, it sends one value at a time (with an "h" designation) and the receiver is expected to build the histogram. We could have a statsd receiver emit raw data points into the OTLP representation, but I wouldn't expect the exporters to transform data back into a suitable format. For example, I would not expect a Prometheus exporter to have to compute a histogram over the raw data, that sounds like part of a processing pipeline. |
The "Across time" part is indeed trickier. The OTel metric clients are already doing similar things, so I understand why this is hard. You'd have to buffer a window of data (SDK term for this: an Accumulator) before applying temporal aggregations. The statsd receiver could be configured with a small buffer of time, to work around this matter. It would accumulate 10 seconds of data as a single OTLP batch and send it for processing. The processing would not necessarily remove any labels, but it would be required to compute a histogram (or other aggregation) from raw points. Sounds like this is a good fit. |
After reading through this proposal again, it appears to be another form of the Views API in disguise. It seems like we could try to find a common configuration language for describing views here. |
Hi @jmacd that's a good point about this being related to views. @JingboWangGoogle is a Google intern working on this over the summer and our hope is that we can get a basic version of this that doesn't do across-time aggregations but that only does single-batch renames/aggregations. I think since it's in contrib it wouldn't be viewed as final and should be refactored to better conform to the views spec. Maybe we could meet to chat more if you're interested! |
* Allow metric processors to be specified in pipelines * Fix typo * Fix formatting * Add fix for unit test
@JingboWangGoogle is there any more work that needs to be done to close out this issue given the above PRs have been merged? |
@jrcamp Based on the content in this proposal and the features implemented in the merged PRs, this proposal issue can be closed as all the proposals here are realized in the PRs. Also @draffensperger @quentinmit just for transparency. |
Metrics Transform Processor
Objectives
The objective of this metrics transform processor is to give OpenTelemetry Collector users the flexibility to rename and aggregate metrics in desired ways so that the metrics are more relevant and cheaper to the users.
Background
There are cases where metrics are overly detailed for a user’s needs, which also impose unnecessary costs. In these cases, the user wants to compress the metrics written into the collector in some meaningful way before writing them out to the metrics backend. In addition, users may want to significantly transform the format that metrics appear in. This may even be a requirement due to backends requiring metrics to match specific existing formats and names.
Related Work: Filter Processor (Issue, Pull Request).
Requirements
The specific requirements include:
cpu/usage
tocpu/usage_time
)cpu
tocore
)done
tocomplete
)usage
, but don’t care about the labelscore
, andcpu
)memory{slab}
, but don’t care aboutmemory{slab_reclaimable}
&memory{slab_unreclaimable}
)Design Ideas
Proposed Configuration
Examples
Insert New Metric
Rename Labels
Aggregate Labels
Aggregate Label Values
Possible Extensions
Thanks to
@draffensperger @james-bebbington @quentinmit
The text was updated successfully, but these errors were encountered: