Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow exporters to influence Aggregation #3762

Merged
merged 23 commits into from
Oct 26, 2021

Conversation

jsuereth
Copy link
Contributor

@jsuereth jsuereth commented Oct 17, 2021

This an implementation of open-telemetry/opentelemetry-specification#2013. This may guide changes in the overall design.

Sorry this is a bit larger than I'd like. I could tease out the Aggregator refactoring as a separate PR, but it's a bit odd without the motivation. LMK if you want that, also sending it out now for early review.

Goals

MetricExporter and MetricReader should be able to denote what AggregationTemporality they support and prefer. This aligns w/ the .NET implementation, and matches the above specification.

Oddities

  • Views allow specifying an aggregation temporality. Spec is lacking specific details on whether views should "oust" exporter/reader config. For this PR, we migrated to storing null for "no temporality specified" on aggregators, and use this to denote when the reader-driven-aggregation-temporality will win. See TemporalityUtils for details
  • We now always store metrics as if we need all kinds of temporality. This could be cleaned up in the future, but was expediant.
  • We still aren't appropriately dealing with non-monotonic instruments + Histograms.
  • Tracking MetricReader + Producer and "CollectionInfo" is all a bit ugly. Thoughts on refactoring appreciated.
  • AggregatorFactory was attempting to throw when an aggregator was configured improperly in the View API. This was broken by the new Aggregation public interface, so we moved the behavior forward.

Cleanups/Shifts

  • Reworked Aggregator API to simplify creating new ones.
    • Now has merge and diff vs. merge and isStateful. Behavior is now more accurate overall.
    • Resource, InstrumentationLibraryInfo and MetricDescriptor now stored outside the aggregator and passed in when needed.
    • Removed AggregatorFactory as Aggregation now serves that job.
  • MetricStorage updated to appropriately use diff vs. merge and account for aggregation temporality. Aggregator implementations almost don't need to care outside of buildMetric
  • Gauge now sets a start time, as recommended by the metric specification.

Remaining Tasks

  • Test metric storage against both a DELTA + CUMULATIVE exporter at the same time
  • Move XYZInPlace private methods to helper class
  • Wait for specification PR to merge

- Remove all state from Aggregator but for exemplar reservoir factory.
  - Reosurce/IL passed from MP/MeterSharedState
  - MetricDescriptor already stored on MetricStorage.
- Give Aggregators explicit "diff" and "merge" operations, split Sum aggregator
- Update Storage to use explicit diff/merge in appropriate locations.
- Aggregtors are now passed AggregationTemporality in as a parameter
- Update MetricStorage to take requested aggregation tempoarlity as parameter, always storing data for either.

Note: Things compile, aggregator + storage tests pass, but e2e is broken until aggregation temporality is appropriatley wired through to readers.
- Add supported/preferred temporality to export interfaces
- Update tracking of collection information through internal packages to pass
  one data class instead of many individual components.
- Update gauge tests to account for start time (as recommended in spec).
- Allow configured temporality to be null for views
- Propogate configured temporality from View API through to storage
@codecov
Copy link

codecov bot commented Oct 17, 2021

Codecov Report

Merging #3762 (b4152b8) into main (57888c5) will increase coverage by 0.15%.
The diff coverage is 90.52%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main    #3762      +/-   ##
============================================
+ Coverage     89.15%   89.30%   +0.15%     
- Complexity     3950     3957       +7     
============================================
  Files           476      473       -3     
  Lines         12324    12309      -15     
  Branches       1206     1207       +1     
============================================
+ Hits          10987    10993       +6     
+ Misses          924      909      -15     
+ Partials        413      407       -6     
Impacted Files Coverage Δ
...metry/exporter/prometheus/PrometheusCollector.java 50.00% <0.00%> (-6.25%) ⬇️
...testing/assertj/metrics/AbstractSumDataAssert.java 100.00% <ø> (ø)
...in/java/io/opentelemetry/sdk/metrics/SdkMeter.java 100.00% <ø> (ø)
...k/metrics/internal/aggregator/EmptyAggregator.java 30.00% <0.00%> (+5.00%) ⬆️
...sdk/metrics/internal/state/EmptyMetricStorage.java 45.45% <ø> (ø)
...ry/sdk/metrics/testing/InMemoryMetricExporter.java 88.23% <0.00%> (-11.77%) ⬇️
.../opentelemetry/sdk/metrics/view/NoAggregation.java 66.66% <ø> (+66.66%) ⬆️
...entelemetry/sdk/metrics/export/MetricExporter.java 50.00% <50.00%> (ø)
...lemetry/sdk/metrics/view/LastValueAggregation.java 66.66% <50.00%> (-8.34%) ⬇️
...telemetry/sdk/metrics/DefaultSdkMeterProvider.java 85.71% <75.00%> (+0.25%) ⬆️
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 57888c5...b4152b8. Read the comment docs.

- Add cumulative+delta tests
- Fix error messages from sum data assert
This interface was fully replaced by Aggregation, and offerred little benefit post-refactor.
Comment on lines 30 to 32
default AggregationTemporality getPreferedTemporality() {
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought that this method would take an instrument type as a parameter, since a given exporter might want this to vary by instrument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Do you have an example of this (A backend/exporter that needs this level of sophistication)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought Prometheus wanted cumulative for everything but histograms, but I might very well be making that up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they need cumulative for everything. (Writing that spec now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be thinking of Summary? However in that case it's not DELTA, it's actually "last 10 minutes", which could be longer than the DELTA export interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yeah, Summary is probably what I'm thinking of. But, I could imagine an exporter might want to impose a full view or set of views, in order to limit cardinality, etc. What would you think of handing the exporter some sort of Builder thingee that it could use to fully customize the output that is needed for a particular backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence.

I think for the first version, I wish we had decided to go with ALL delta or ALL cumulative and let OTLP pick one or the other via config. Now we have an expectation of flexibility that I'm not sure where it truly belongs.

My thinking is that I expect the VIEW SDK to eventually have a "consistent across SDK" configuration file, something like:

views:
   select:
        instrumentType: counter
        instrumentationLibrary: grpc
        name: request.bytes
  aggregation:
     type: Histogram
  attributes:
     drop: [route]

I'm not so certain we'll be exposing that kind of configuration on a per-exporter level. While I think the choice of aggregation temporality really is an exporter decision, I find myself torn in that I think a consistent View configuration file to be more likely. (And what I suggest above is a really ugly version of what I think we'll eventually have, something halfway between prometheus-rewrite rules and otel-collector processor config).

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation for this, since right now you could configure the aggregations via the view API that are incompatible with the prometheus exporter.

But I don't quite understand why both the view API and the exporters get a say in the aggregation. Should aggregation temporality configuration be removed from the view API?

@jsuereth
Copy link
Contributor Author

@jack-berg

I understand the motivation for this, since right now you could configure the aggregations via the view API that are incompatible with the prometheus exporter.

Yes, not a huge fan of this right now, but I think one place needs to win in the event of conflict, and right now that's the View API.

But I don't quite understand why both the view API and the exporters get a say in the aggregation. Should aggregation temporality configuration be removed from the view API?

I think (as you see in this PR) we should do our best to avoid having anyone force an aggregation temporality. This is a step towards trying to convince people to never specify it in the View API. If @jkwatson's suggestion of having metric-level configuration for aggregation temporality is taken, then absolutely, I'd remove it as a parameter to view aggregation.

I don't think we have quite enough data behind mixed cumulative/delta use cases to remove all fine grained controls, but it's something we should consider.

Again if you look at the goals listed in this PR, primarily, I want to stop 99% of usage of aggregation temporality in the Views API because exporter-based configuration should handle all of our known use cases.

@jsuereth
Copy link
Contributor Author

Also for both @jkwatson and @jack-berg would be good for your thoughts on this PR open-telemetry/opentelemetry-specification#2032

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally makes sense to me, though it would indeed be great if the knob could be removed from the Views then unless there is user demand for that

/** Returns the preferred temporality for metrics. */
@Nullable
default AggregationTemporality getPreferredTemporality() {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AggregatedTemporality.CUMULATIVE? Just kidding (mostly) :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha!

@@ -17,6 +20,17 @@
*/
public interface MetricExporter {

/** Returns the set of all supported temporalities for this exporter. */
default EnumSet<AggregationTemporality> getSupportedTemporality() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose of this to fail if someone configures a view for an exporter that doesn't support the specified temporality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically.

However, the Metrics SiG Is removing AggregationTemporality from the View API, so the purpose here is that an exporter can "handle everything" by default.

Comment on lines +13 to 15
AbstractSumAggregator(InstrumentDescriptor instrumentDescriptor) {
this.isMonotonic = MetricDataUtils.isMonotonicInstrument(instrumentDescriptor);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems, on its face, to be a refactoring unrelated to the title in the PR. Am I reading this wrong? It would be easier to understand these PRs if they were a bit smaller and focussed. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forgot to remove the AbstractSumAggregator. I hear you about focused refactorings, promise to do a better job, going forward. IF you want I can try to pull back in the old code here and do the bug-fixing to histograms around monotonic instruments with the fix to this.

super(
resource, instrumentationLibraryInfo, instrumentDescriptor, metricDescriptor, temporality);
/**
* Constructs a histogram aggregator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy pasta?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, yeah good catch!

Comment on lines +66 to +69
if (temporality == AggregationTemporality.DELTA && !isSynchronous) {
MetricStorageUtils.diffInPlace(last.getAccumlation(), currentAccumulation, aggregator);
result = last.getAccumlation();
} else if (temporality == AggregationTemporality.CUMULATIVE && isSynchronous) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are definitely a few combinations missing here. If that's intentional, please add a comment as to why the others aren't needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, and will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a lot of docs (here and below) so PTAL.


@Override
public AggregationTemporality getPreferredTemporality() {
return AggregationTemporality.CUMULATIVE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason for this choice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong reason, it would pick this option even if we didn't prefer it. This is here to help reinforce that to folks attempting to use this for debugging.

Copy link
Contributor

@jkwatson jkwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jkwatson
Copy link
Contributor

@jack-berg Do you have anything left that needs to be resolved here?

@jack-berg
Copy link
Member

👍 🚢

@jkwatson jkwatson merged commit f6754b6 into open-telemetry:main Oct 26, 2021
@jsuereth jsuereth deleted the wip-exporters-have-aggregation branch October 28, 2021 01:12
This was referenced Dec 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants