Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics Transform Processor Proposal #332

Closed
JingboWangGoogle opened this issue Jun 12, 2020 · 20 comments
Closed

Metrics Transform Processor Proposal #332

JingboWangGoogle opened this issue Jun 12, 2020 · 20 comments
Assignees
Labels
enhancement New feature or request

Comments

@JingboWangGoogle
Copy link
Contributor

JingboWangGoogle commented Jun 12, 2020

Metrics Transform Processor

  • The first version of aggregation would only support single-batch processing on metrics to avoid conflicts of different batches with the same time-series (more description of this corner case).

Objectives

The objective of this metrics transform processor is to give OpenTelemetry Collector users the flexibility to rename and aggregate metrics in desired ways so that the metrics are more relevant and cheaper to the users.

Background

There are cases where metrics are overly detailed for a user’s needs, which also impose unnecessary costs. In these cases, the user wants to compress the metrics written into the collector in some meaningful way before writing them out to the metrics backend. In addition, users may want to significantly transform the format that metrics appear in. This may even be a requirement due to backends requiring metrics to match specific existing formats and names.
Related Work: Filter Processor (Issue, Pull Request).

Requirements

The specific requirements include:

  • rename metrics (e.g. rename cpu/usage to cpu/usage_time)
  • rename labels (e.g. rename cpu to core)
  • rename label values (e.g. rename done to complete)
  • aggregate across label sets (e.g. only want the label usage, but don’t care about the labels core, and cpu)
    • aggregation_type: sum, average, max
  • aggregate across label values (e.g. want memory{slab}, but don’t care about memory{slab_reclaimable} & memory{slab_unreclaimable})
    • aggregation_type: sum, average, max

Design Ideas

Proposed Configuration

transforms:
  # name is used to match with the metric to operate on. This implementation doesn’t # utilize the filtermetric’s MatchProperties struct because it doesn’t match well 
  # with what I need at this phase. All is needed for this processor at this stage is # a single name string that can be used to match with selected metrics. The list of # metric names and the match type in the filtermetric’s MatchProperties struct are # unnecessary. Also, based on the issue about improving filtering configuration, it # seems like this struct is subject to be slightly modified.
  - name: <current_metric_name>

  # action specifies if the operations are performed on the current copy of the 
  # metric or on a newly created metric that will be inserted
    action: {update, insert}

  # new_name is used to rename metrics (e.g. rename cpu/usage to cpu/usage_time)
  # if action is insert, new_name is required
    new_name: <new_metric_name_inserted>

  # operations contain a list of operations that will be performed on the selected 
  # metrics. Each operation block is a key-value pair, where the key can be any 
  # arbitrary string set by the users for readability, and the value is a struct with # fields required for operations. The action field is important for the processor 
  # to identify exactly which operation to perform 
    operations:

    # update_label action can be used to update the name of a label or the values           
    # of this label (e.g. rename label `cpu` to `core`)
      - action: update_label
        label: <current_label1>
        new_label: <new_label>
        value_actions:
         - value: <current_label_value>
           new_value: <new_label_value>

    # aggregate_labels action aggregates metrics across labels (e.g. only want  
    # the label `usage`, but don’t care about the labels `core`, and `cpu`)
      - action: aggregate_labels
     # label_set contains a list of labels that will remain after the aggregation.    
     # The excluded labels will be aggregated by the way specified by  
     # aggregation_type.
        label_set: [labels...]
        aggregation_type: {sum, average, max}

    # aggregate_label_values action aggregates labels across label values (e.g. want  
    # memory{slab}, but don’t care about memory{slab_reclaimable} &   
    # memory{slab_unreclaimable})
      - action: aggregate_label_values
        label: <label>
     # aggregated_values contains a list of label values that will be aggregated by  
     # the way specified by aggregation_type into new_value. The excluded label  
     # values will remain.
       aggregated_values: [values...]
       new_value: <new_value> 
       aggregation_type: {sum, average, max}

Examples

Insert New Metric

# create host/cpu/usage_time from host/cpu/usage
name: host/cpu/usage
action: insert
new_name: host/cpu/utilization
operations:
  ...

Rename Labels

# rename the label cpu to core
operations:
  - action: update_label
    label: cpu
    new_label: core

Aggregate Labels

# aggregate away everything but `state` using summation
...
operations:
  -action: aggregate_labels
   label_set: [ state ]
   aggregation_type: sum

Aggregate Label Values

# combine slab_reclaimable & slab_unreclaimable by summation
...
operations:
  -action: aggregate_label_values
   label: state
   aggregated_values: [ slab_reclaimable, slab_unreclaimable ]
   new_value: slab 
   aggregation_type: sum

Possible Extensions

  • Supporting custom types of aggregation (e.g. defined by a formula)
  • Support aggregations of non-simple metric types (distributions, etc)
  • Support aggregation over time
  • Utilizing regex to select metrics

Thanks to

@draffensperger @james-bebbington @quentinmit

@jrcamp
Copy link
Contributor

jrcamp commented Jun 12, 2020

I wonder if this should be broken up into two different processors? One that's analogous to attributesprocessor but for metrics. I think this should live in core:

rename metrics (e.g. rename cpu/usage to cpu/usage_time)
rename labels (e.g. rename cpu to core)
rename label values (e.g. rename done to complete)

And a separate one for aggregation that lives in contrib for now as it may take some trial and error to get the right config and functionality:

aggregate across label sets (e.g. only want the label usage, but don’t care about the labels core, and cpu)
aggregation_type: sum, average, max
aggregate across label values (e.g. want memory{slab}, but don’t care about memory{slab_reclaimable} & memory{slab_unreclaimable})
aggregation_type: sum, average, max

Note that there are discussions about changes to attributesprocessor that we would probably want to mirror: open-telemetry/opentelemetry-collector#979 (comment)

@bogdandrutu
Copy link
Member

bogdandrutu commented Jun 14, 2020

aggregate across label sets (e.g. only want the label usage, but don’t care about the labels core, and cpu)
aggregation_type: sum, average, max

Keep in mind that this is not easy. There are corner cases where metrics received in two different batches and after label removing will generate the same timeseries.

First batch:
usage=foo;core=1

Second batch:
usage=foo;core=2

Without keeping a state between different batches will produce wrong aggregations, so this will work only if the receiving batch contains all the values.

@JingboWangGoogle
Copy link
Contributor Author

@jrcamp I really appreciate your feedback on this. Initially, I thought that it might make sense to put aggregation and rename together because after aggregation, the semantics of that metric will very likely change, thus a more accurate name should be reassigned. But separating these makes it a cleaner design because these two are essentially doing different things. Therefore, I have made adjustments to the issue proposed. :)

@quentinmit
Copy link
Contributor

@jrcamp @JingboWangGoogle
I think I was one of the people who suggested that rename and aggregate should be part of the same operation. More specifically I suggested that because it allows you to use the same source metric to produce two different destination metrics with different aggregation. I'm not sure how you do that if they're separate steps.

@JingboWangGoogle
Copy link
Contributor Author

@bogdandrutu Thank you for pointing this out! For now, I am aiming to have the aggregation processor to only handle single-batch processing (e.g. the collector collects the host metrics, which is the only source of metrics into the collector). The corner case here can be solved by "aggregation over time" under the possible extensions section.

@jrcamp
Copy link
Contributor

jrcamp commented Jun 15, 2020

@JingboWangGoogle

btw, have you taken a look at the spec relating to aggregations in the SDK:

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/metrics/api.md#aggregations
open-telemetry/oteps#89

I'm not deeply familiar with them and don't see anything conflicting between them but wanted to make sure you saw them.

@bogdandrutu it feels like we should ideally in the long run have parity between aggregations of collector and SDK's, what do you think?

I think I was one of the people who suggested that rename and aggregate should be part of the same operation. More specifically I suggested that because it allows you to use the same source metric to produce two different destination metrics with different aggregation. I'm not sure how you do that if they're separate steps.

@quentinmit With aggregation you are either 1) producing a new (derived) metric or 2) replacing the existing metric with the aggregated values. When renaming you're keeping all the same data but changing the metric name/labels. It seems they're different enough to warrant treating separately. Especially for somebody who only wanting to rename I think it'd be confusing to try and do renaming via aggregation (presumably by an aggregation that is actually a noop?) Am I missing something?

@quentinmit
Copy link
Contributor

quentinmit commented Jun 15, 2020

@quentinmit With aggregation you are either 1) producing a new (derived) metric or 2) replacing the existing metric with the aggregated values. When renaming you're keeping all the same data but changing the metric name/labels. It seems they're different enough to warrant treating separately. Especially for somebody who only wanting to rename I think it'd be confusing to try and do renaming via aggregation (presumably by an aggregation that is actually a noop?) Am I missing something?

@jrcamp I don't think the user would need to think about aggregation if they only want to rename, or renaming if they only want to aggregate. The default config for each operation should already be a noop. Maybe it's better to illustrate with some strawman configs:

metricsprocessor:
# Remove the core label from the CPU metric
- metric_name: cpu/utilization
  aggregate:
   fields: [core]
   invert_fields: true
   reducer: MEAN  # MEAN gives you 0-100, SUM/default gives you 0-n00
# Rename request_count
- metric_name: app/request_count
  source_metric_name: apache/request_count
- metric_name: apache/request_count
  drop: true  # overlaps with filter processor
# Denormalize request count to save space
- metric_name: istio/request_count_by_response_code
  source_metric_name: istio/request_count
  aggregate:
   fields: [response_code]
- metric_name: istio/request_count_by_source_workload
  source_metric_name: istio/request_count
  aggregate:
   fields: [source_workload]
- metric_name: istio/request_count
  drop: true

Remember that when filtering labels, you always need to simultaneously perform aggregation (since there can be two or more time series that turn into one).

I also think there's an argument to be made that modularity is worth trading off for the simplified configuration.

@jrcamp
Copy link
Contributor

jrcamp commented Jun 15, 2020

It also needs to be able to rename labels (perhaps scoped to a particular metric). How would that fit into the config?

@quentinmit
Copy link
Contributor

@jrcamp I think renaming labels and especially renaming label values also is intertwined with aggregation (because renamed label values can collide with existing time series that have that value, or you might want to explicitly map two values to one).

There's a couple different ways you could do renaming labels; there's what was suggested above by @JingboWangGoogle:

  - action: update_label
    label: cpu
    new_label: core

and I could also imagine for some usecases wanting to do remap the labels at once:

- map_labels:
  core: cpu
  socket: index

(and therefore any label not mentioned would be dropped) I'm not sure which or both should be supported; I'm inclined to err on the side of making the config as easy to understand as possible, which weakly suggests that we should support both. I'm open to your thoughts on that.

@jrcamp
Copy link
Contributor

jrcamp commented Jun 16, 2020

In my mind renaming may happen as a side effect of aggregation. But you may also want to rename without aggregating. For example, the OT metric is named gc.runtime and you expected that metric to be named java/gc/runtime (just making up an example). Yes the user would have to make sure there are no collisions when renaming metrics, just as they must not emit metrics in the first place with collisions.

I think the renaming part is pretty straight forward and should go in core whereas the aggregation may need to be iterated on. Given that structural situation I think they must be separate (at least for now).

@bogdandrutu bogdandrutu transferred this issue from open-telemetry/opentelemetry-collector Jun 16, 2020
@JingboWangGoogle
Copy link
Contributor Author

As a record to our conclusion from our meeting, we have decided to keep it as one processor, but this processor will stay in the contrib repo for now.

@jmacd
Copy link
Contributor

jmacd commented Jun 23, 2020

In #290 we are discussing how to incorporate a (dog)statsd receiver into the collector. One of the features of statsd data is that it arrives unaggregated, as point data. Feeding statsd data directly into the collector is probably not what users want. I think what users do want is to configure which aggregations are applied to the data. This could be implemented specifically inside a statsd receiver, but it's appealing to think we could transform statsd data points into raw OTLP data and then rely on a downstream processor to aggregate the data in a way the user configures.

Does this sound like a reasonable use-case for the collector metrics processor?

@quentinmit
Copy link
Contributor

It sounds quite reasonable to me, though can you clarify what you mean by "point data"?

We're initially targeting aggregation-across-dimensions with this processor, and not aggregation-across-time. So if you get e.g. 5 separate points in a minute and want an output of "5" that would be aggregation-across-time. We definitely want to support that but it's more complicated so it's not likely to be in the first version of this processor.

@jmacd
Copy link
Contributor

jmacd commented Jun 23, 2020

For example, when statsd sends histogram data, it sends one value at a time (with an "h" designation) and the receiver is expected to build the histogram. We could have a statsd receiver emit raw data points into the OTLP representation, but I wouldn't expect the exporters to transform data back into a suitable format. For example, I would not expect a Prometheus exporter to have to compute a histogram over the raw data, that sounds like part of a processing pipeline.

@jmacd
Copy link
Contributor

jmacd commented Jun 23, 2020

The "Across time" part is indeed trickier. The OTel metric clients are already doing similar things, so I understand why this is hard. You'd have to buffer a window of data (SDK term for this: an Accumulator) before applying temporal aggregations.

The statsd receiver could be configured with a small buffer of time, to work around this matter. It would accumulate 10 seconds of data as a single OTLP batch and send it for processing. The processing would not necessarily remove any labels, but it would be required to compute a histogram (or other aggregation) from raw points.

Sounds like this is a good fit.

@jmacd
Copy link
Contributor

jmacd commented Jun 24, 2020

After reading through this proposal again, it appears to be another form of the Views API in disguise.

open-telemetry/oteps#89

It seems like we could try to find a common configuration language for describing views here.

@draffensperger
Copy link
Contributor

Hi @jmacd that's a good point about this being related to views. @JingboWangGoogle is a Google intern working on this over the summer and our hope is that we can get a basic version of this that doesn't do across-time aggregations but that only does single-batch renames/aggregations.

I think since it's in contrib it wouldn't be viewed as final and should be refactored to better conform to the views spec. Maybe we could meet to chat more if you're interested!

@JingboWangGoogle
Copy link
Contributor Author

Regarding the feature brought up by @jmacd , we have agreed to move this to a later version, specifically after #417, which concludes the MVP for this proposed processor.

@jrcamp jrcamp added the enhancement New feature or request label Jul 21, 2020
mxiamxia referenced this issue in mxiamxia/opentelemetry-collector-contrib Jul 22, 2020
* Allow metric processors to be specified in pipelines

* Fix typo

* Fix formatting

* Add fix for unit test
@jrcamp
Copy link
Contributor

jrcamp commented Jul 30, 2020

@JingboWangGoogle is there any more work that needs to be done to close out this issue given the above PRs have been merged?

@jrcamp jrcamp added this to the GA 1.0 milestone Jul 30, 2020
@JingboWangGoogle
Copy link
Contributor Author

@jrcamp Based on the content in this proposal and the features implemented in the merged PRs, this proposal issue can be closed as all the proposals here are realized in the PRs. Also @draffensperger @quentinmit just for transparency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants