Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to unregister/stop/destroy instruments #2232

Open
mateuszrzeszutek opened this issue Dec 20, 2021 · 29 comments
Open

Allow to unregister/stop/destroy instruments #2232

mateuszrzeszutek opened this issue Dec 20, 2021 · 29 comments
Labels
spec:metrics Related to the specification/metrics directory triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor

Comments

@mateuszrzeszutek
Copy link
Member

What are you trying to achieve?

I'm currently working on the micrometer->OTel bridge instrumentation in the javaagent. Micrometer offers the possibility to remove a meter from the MeterRegistry and stop emitting whatever metrics it used to collect. For example, suppose you use a database connection pool that's instrumented with metrics - when you close/destroy the whole pool you probably want to stop collecting any metrics associated to it (because it doesn't exist anymore). This is useful for both asynchronous instruments (since once they're registered there's no way to stop them) and synchronous instruments (you can just stop using them, but the metrics SDK will still send in the last recorded value).

Additional context.

Micrometer bridge PR: open-telemetry/opentelemetry-java-instrumentation#4919

CC @jsuereth

@tigrannajaryan
Copy link
Member

(Following https://github.com/open-telemetry/opentelemetry-specification/blob/main/CONTRIBUTING.md#issue-triaging process).

@jsuereth I am reassigning this to you, I believe you know the context of this better.

@bogdandrutu
Copy link
Member

@mateuszrzeszutek micrometer is an API + Some implementation(s). I am not sure why that should translate into calls to our API instead of offering an implementation of micrometer that implements the MetricProducer, see https://github.com/open-telemetry/opentelemetry-java/blob/44dccd6c33c7c4cb58e1cc63f0db02b3bfec1804/opencensus-shim/src/main/java/io/opentelemetry/opencensusshim/metrics/OpenCensusMetricProducer.java#L24

/cc @jsuereth

@anuraaga
Copy link
Contributor

anuraaga commented Jan 4, 2022

@bogdandrutu Micrometer was just one example but do you think apps should not have a way to stop reporting metrics? Dropwizard metrics in Java also has registry.remove and it seems idiomatic for a library that allows registering a stateful object to always allow unregistering it. While the whole MeterProvider could be closed, metric definitions are generally completely separate from the MeterProvider lifecycle and that probably doesn't solve that case.

@mateuszrzeszutek
Copy link
Member Author

There are advantages to calling OTel API in micrometer bridge instrumentation:

  • Micrometer instruments like Timer and DistributionSummary can be translated to native OTel histograms; micrometer does not have a "histogram" meter by itself, it basically exports each bucket as a separate gauge.
  • Due to the API/SDK split in the javaagent using a MetricProducer inside an instrumentation would be very painful; the SDK classes are not directly accessible in instrumentation code, they're hidden away inside the agent classloader -- we would have to bridge MetricProducer and all related classes (and do lots of copying) to make that work at all.

Also, micrometer aside, the unregister/remove functionality is still very much needed. The database connection pool example from my first post here is still valid.

@anuraaga
Copy link
Contributor

anuraaga commented Jan 4, 2022

Just another note about Micrometer, I expect it to be used for a very long time by Spring and likely Spring users too as a result. For example, for tracing they will use Micrometer Tracing, not the OpenTelemetry tracing API. I think the reality of the Java ecosystem means we should consider Micrometer as a first class API for Java metrics, and this seems OK to me in practice. As such, I'd like Micrometer usage to benefit from features like exemplars / baggage which means going through our API (or well our SDK implementations directly maybe, but not MetricProducer which is too late to add cross-cutting concern features to the micrometer metrics).

@bogdandrutu
Copy link
Member

@mateuszrzeszutek

Micrometer instruments like Timer and DistributionSummary can be translated to native OTel histograms; micrometer does not have a "histogram" meter by itself, it basically exports each bucket as a separate gauge.

It is just a representation difference, it is still a Histogram, I don't see the point.

Due to the API/SDK split in the javaagent using a MetricProducer inside an instrumentation would be very painful; the SDK classes are not directly accessible in instrumentation code, they're hidden away inside the agent classloader -- we would have to bridge MetricProducer and all related classes (and do lots of copying) to make that work at all.

The micrometer instrumentation is not really an instrumentation, is actually an SDK/Producer whatever you call it, since micrometer is an API + Impl to produce metrics. Not sure if any metrics library should be seen as "instrumentation", but rather as producers of telemetry.

Just another note about Micrometer, I expect it to be used for a very long time by Spring and likely Spring users too as a result. For example, for tracing they will use Micrometer Tracing, not the OpenTelemetry tracing API. I think the reality of the Java ecosystem means we should consider Micrometer as a first class API for Java metrics, and this seems OK to me in practice.

I am not saying to not have an integration (not instrumentation) with micrometer.

As such, I'd like Micrometer usage to benefit from features like exemplars / baggage which means going through our API (or well our SDK implementations directly maybe, but not MetricProducer which is too late to add cross-cutting concern features to the micrometer metrics).

It is very hard to do that since their API is not well designed for that (or was not designed with that in mind).

FYI: We don't have enough tracing APIs let's create another one :)))

@mateuszrzeszutek
Copy link
Member Author

Micrometer instruments like Timer and DistributionSummary can be translated to native OTel histograms; micrometer does not have a "histogram" meter by itself, it basically exports each bucket as a separate gauge.

It is just a representation difference, it is still a Histogram, I don't see the point.

Well that's true, conceptually it's the same. It results in different MetricData being exported though.

As such, I'd like Micrometer usage to benefit from features like exemplars / baggage which means going through our API (or well our SDK implementations directly maybe, but not MetricProducer which is too late to add cross-cutting concern features to the micrometer metrics).

It is very hard to do that since their API is not well designed for that (or was not designed with that in mind).

Doesn't that sort of work by default? If you're using a Counter and call counter.add(1) it'll capture Context.current() even if you don't pass the context; so if you use a bridged micrometer synchronous instrument it should get the exemplar from the current context even though micrometer API does not explicitly support this.

@jmacd jmacd added this to the Metrics API/SDK Stable Release milestone Jan 25, 2022
@jmacd
Copy link
Contributor

jmacd commented Jan 25, 2022

I would like to re-phrase this feature request: Can we allow deleting or unregistering Callbacks? For discussion in the Tuesday Jan 25 8AM PT Spec SIG.

@jmacd
Copy link
Contributor

jmacd commented Jan 25, 2022

There is a connected topic discussed in https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#asynchronous-example-attribute-removal-in-a-view

The question is how, if at all, should an SDK know to stop reporting a metric?

@anuraaga
Copy link
Contributor

Just wanted to add an example that it's fairly easy to introduce memory leaks into apps without the ability to unregister callbacks. For example if there is a connection pool that is being decomissioned, possibly due to a database resharding, then it should be able to be garbage collected. But with any natural registration (no complication of weak references) of an async instrumentation, without the ability to remove the callback from OpenTelemetry this pool can never be collected.

@jsuereth
Copy link
Contributor

To back up @anuraaga's comment, while it's not as popular, old-school Java EE servers (like tomcat) or anything that uses hot / dynamic loading of plugins need the ability to clean up their memory and usage so Java can evict their bytecode and RAM.

For @bogdandrutu I want to directly address the notion that Micrometer should use MetricProducer as is done for OpenCensus.

The only reason we use MetricProducer bridge for OpenCensus is because the APi is so divergent from OpenTelemetry that we are unable to do a direct API bridge. If we could have done direct API, we'd have preferred that one. However because OpenCensus metric model is close enough to OTel's, we're not loosing too much with the MetricProducer bridge.

I think @mateuszrzeszutek raises some great points around Metric bridges, and I want to call out a few of them:

  1. We should prefer adapting existing metric instrumentation rather than trying to replace it. Micrometer is a decent API. We're not in competition, and we can be complementary. We have no reason to force users to re-instrument if they're happy.
  2. When we interface with existing Metric APIs we should enhance. There needs to be a reason users want OTEL, and in my opinion we (will) bring a few things to the table:
  • Resource-based telemetry correlation
  • Out of the box exemplars
  • Exponential Histograms (or just histograms for those still doing gauges of percentiles / summary)

An API <-> API bridge provides the easiest path forward for instrumentation authors to accomplish that goal.

@bogdandrutu
Copy link
Member

@jsuereth Nice to hear all these great things, but I want to also ensure that we are not adding unnecessary complexity to our APIs to support this, hence the "very divergent" part that you mention. The proposal says if this API has a bad thing we should support it because we prefer this type of integration.

We should prefer adapting existing metric instrumentation rather than trying to replace it. Micrometer is a decent API. We're not in competition, and we can be complementary. We have no reason to force users to re-instrument if they're happy.

A "MetricProducer bridge" will give that to the micrometer users.

When we interface with existing Metric APIs we should enhance. There needs to be a reason users want OTEL, and in my opinion we (will) bring a few things to the table:

Resource-based telemetry correlation

A "MetricProducer bridge" will give that to the micrometer users. Resource can still be attached to the bridge instance.

Out of the box exemplars
Exponential Histograms (or just histograms for those still doing gauges of percentiles / summary)

Not 100% convinced that we cannot offer this, but I think we can with a bit of work.

@jsuereth it is very important to distinguish between a "instrumentation user friendly API" which is design to be used in the application instrumentation vs an "sink" API that you are doing when wrapping this API. I don't want us to become a "sink" API, if that is the goal then let's have a sink API design and not try to change the current API designed for users to instrument application to fit that goal.

Another argument is that micrometer is not an API :) sorry but it has way too much in that artifact to be considered an API.

@bogdandrutu
Copy link
Member

@jsuereth also I do agree that they are good reasons to support unregister/close/etc for an instrument, but I don't think the reason should be that another API has it.

@bogdandrutu
Copy link
Member

@jsuereth @mateuszrzeszutek is #2317 resolving this? I don't have that feeling...

@jmacd
Copy link
Contributor

jmacd commented Feb 16, 2022

To clarify my intention, I believe #2317 narrowly solves this problem stating the SDK SHOULD give applications a way to create callbacks that support being unregistered, which allows a user to stop certain asynchronous measurements from reporting without shutting down an entire MeterProvider.

The Prometheus ecosystem uses a Delete() verb to signal an intention to erase a metric from an exporter. None of the verbs Stop(), Close() or Destroy() signal the same intention to me, and in the presence of duplicate instrument registration support (i.e., #2317) I would expect stop/close/destroy to effect the single instance they were called on. I would not expect one instance's stop/close/destroy to effect other instances, and I would not expect any of these to delete the streams reported from a specific instrument instance. I would not want instrument "instances" to matter to the SDK semantically speaking.

As the description states, users can simply stop using synchronous instruments if they want to stop reporting measurements; this doesn't seem like a problem we need to help the user with, but if we did, I would suggest that stop/close/destroy simply invalidates the instrument instance so that it can no longer be used to capture measurements.

I suggest we file a separate issue or issues to discuss the following points:

Need for Delete() verb?

The Delete() verb for instruments, and whether it has a meaningful definition in an API that supports duplicate instrument registration. In Prometheus, Delete() is used with specific instruments and labels, i.e. it doesn't delete an entire Metric, it deletes a single stream. I think of this "deletion" as an exporter preference, in any case. Prometheus expects that exporters remember every current/cumulative value they've ever exported, except those for which Delete() was used.

Need for exporter memory preferences?

In a push-based exporter, whether using cumulative or delta temporality, there's a question of whether to push a value that has not changed. Since we expect cumulative data to be pushed into a Prometheus ecosystem, it seems wise to default to pushing all values, even unchanged ones.

In an exporter with a preference for deltas, it's typical to avoid reporting any delta where the delta (sum) or count (histogram) are zero. For synchronous instruments, this is something the SDK can facilitate by detecting stale streams and, after a while, forgetting them. SHOULD the SDK be required to support forgetting streams? For asynchronous instruments, if the user simply stops reporting values they will stop reporting, unless we require the SDK to detect staleness. Should the Prometheus exporter be responsible for this on its own, or should the SDK facilitate an exporter preference for "synchronous instrument memory"? This topic is connected with #2132 because if the SDK is required to detect staleness, then no additional memory is required to perform cumulative-to-delta translation.

@mateuszrzeszutek please signal whether simply the option to unregister callbacks is enough, coupled with memory options for exporters? Or are you looking for something like Prometheus' Delete()?

@alxbl
Copy link
Member

alxbl commented Oct 11, 2023

I'd like to add to this issue in favor of having a cleanup mechanism in the spec:

We perform monitoring of IoT devices which can be added/removed from our systems at any time by operators and would also benefit from being able to stop reporting metrics when those IoT devices are being decommissioned.

Currently, the .NET implementation of asynchronous counter/gauges supports removing metric points from a specific metric, but this causes the last value emitted for that metric point to be re-exported ad-infinitum (or just become a stale entry in the metric point list when using delta export mode) by the OTEL SDK implementation.

Another thing that would be interesting to specify is stale markers when a time series is known to be "finished".

e.g. Prometheus Remote Write also defines stale markers that can tell Prometheus that a time series is known to be stale and immediately marks it as stale so that it no longer shows up. I think it would be nice to have a specified way to do something similar via OTLP.

I would be willing to spend some time working on this if there is interest.

Thanks for the hard work on OTEL, by the way!

EDIT: I had the wrong link to stale markers in Prometheus

lubosmj added a commit to lubosmj/pulpcore that referenced this issue Feb 19, 2024
At the moment, it is not possible to destroy instruments that send
metrics. Therefore, when a user removes a domain, there might be still
the metrics about it emitted. A temporary workaround is to restart the
pulpcore-api process to reload meters.

Ref: open-telemetry/opentelemetry-specification#2232

closes pulp#4603
lubosmj added a commit to lubosmj/pulpcore that referenced this issue Feb 19, 2024
At the moment, it is not possible to destroy instruments that send
metrics. Therefore, when a user removes a domain, there might be still
the metrics about it emitted. A temporary workaround is to restart the
pulpcore-api process to reload meters.

Ref: open-telemetry/opentelemetry-specification#2232

closes pulp#4603
lubosmj added a commit to lubosmj/pulpcore that referenced this issue Feb 20, 2024
At the moment, it is not possible to destroy instruments that send
metrics. Therefore, when a user removes a domain, there might be still
the metrics about it emitted. A temporary workaround is to restart the
pulpcore-api process to reload meters.

Ref: open-telemetry/opentelemetry-specification#2232

closes pulp#4603
lubosmj added a commit to lubosmj/pulpcore that referenced this issue Feb 20, 2024
At the moment, it is not possible to destroy instruments that send
metrics. Therefore, when a user removes a domain, there might be still
the metrics about it emitted. A temporary workaround is to restart the
pulpcore-api process to reload meters.

Ref: open-telemetry/opentelemetry-specification#2232

closes pulp#4603
lubosmj added a commit to pulp/pulpcore that referenced this issue Feb 20, 2024
At the moment, it is not possible to destroy instruments that send
metrics. Therefore, when a user removes a domain, there might be still
the metrics about it emitted. A temporary workaround is to restart the
pulpcore-api process to reload meters.

Ref: open-telemetry/opentelemetry-specification#2232

closes #4603
@mtwo mtwo added triage:deciding:tc-inbox Needs attention from the TC in order to move forward triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor and removed triage:deciding:tc-inbox Needs attention from the TC in order to move forward labels Jul 9, 2024
@mtwo
Copy link
Member

mtwo commented Jul 9, 2024

We're elevating this and we have closed #3985 in favour of this issue

@grcevski
Copy link

grcevski commented Jan 9, 2025

I there anything I can do to help with moving the extension of the spec forward? Is there a PR open perhaps on proposed changes to the spec for allowing removal?

@carlosalberto
Copy link
Contributor

Hey,

I'm interested in this and would like to make progress, but first would like to understand some details.

Overall I see this as two parts:

  1. Be able to invalidate instruments in the SDK (explicitly deleting associated callbacks & disable instruments, or implicitly detecting stale streams).
    • Deleting associated callbacks is already supported in a few languages.
  2. Be able to stop exporting metric points for deleted instruments/streams (including the cumulative scenario). This could either become full exporter responsibility, or have the SDK provide some help/collaboration.
    • May need to set staleness markers to let backends know a stream is done.

Based on the feedback, I think the important part is focusing on 2), as what we have for 1) at this time seems enough.

Opinions?

cc @jmacd

@pellared
Copy link
Member

ad. 1. Is probably not only about the SDK, but also Metrics API , right?

Reference #2232 (comment)

@dashpole
Copy link
Contributor

dashpole commented Jan 21, 2025

  1. Be able to invalidate instruments in the SDK (explicitly deleting associated callbacks & disable instruments, or implicitly detecting stale streams).

Asynchronous instruments can already be unregistered:

Where the API supports registration of `callback` functions after
asynchronous instrumentation creation, the user MUST be able to undo
registration of the specific callback after its registration by some means.

Asynchronous instruments can also already manage cardinality within the callback since they can return only Measurements
with the attributes they want to keep.

Synchronous instruments don't have a way to unregister the instrument as a whole, and also don't have a way to delete a single attribute set.

  1. Be able to stop exporting metric points for deleted instruments/streams (including the cumulative scenario). This could either become full exporter responsibility, or have the SDK provide some help/collaboration.

Being able to stop exporting metric points for deleted asynchronous instruments is already a recommendation:

The implementation SHOULD NOT produce aggregated metric data for a
previously-observed attribute set which is not observed during a successful
callback. See [MetricReader](#metricreader) for more details on the persistence
of metrics across successive collections.

Presumably, if (1) is implemented for synchronous instruments, the SDK would have a similar requirement around not producing data for deleted attribute sets.

There are two orthogonal issues that are related to this, but IMO can be solved separately:

  • Cumulative data points should set StartTimeUnixNano per timeseries #4184
    • Start timestamps today are mostly set to the start time of the SDK (or creation time of the instrument), rather than when the first point with a given attribute set is observed. This becomes an issue if an attribute set is deleted, but then reappears later. The new "reappeared" series should probably not have the same start time as the original, since it is a "new" attribute set, but that isn't the case today.
  • Specify how to stop and restart metrics reporting #2711
    • Staleness markers for pushed metrics (Prometheus pull staleness already work with OTel SDKs today). This could help with showing exactly when a series disappeared, and can help show when a series "intentionally" disappears (via deletion/unregistration), rather than "unintentionally" disappears (e.g. export failure).

Overall, I would recommend starting by trying to define a way to delete an attribute set from a synchronous instrument.

@jmacd
Copy link
Contributor

jmacd commented Jan 21, 2025

@carlosalberto the way I understand this issue, we can see this as an API, an SDK, or a data model problem.

The API problem, maybe, is that there's no equivalent of Prometheus Delete(), which is an action meaning to "delete a single attribute set" from residence in the SDK.

The SDK problem, maybe, is that we have SDKs that can be configured with Delta temporality but do not automatically drop memory for unused attribute sets. This is an optimization demonstrated in https://github.com/lightstep/otel-launcher-go/blob/main/lightstep/sdk/metric/README.md, however I'm not sure which or how many OTel SDKs perform this way. @alxbl wrote about this, and
I would say OTel could add to its SDK specification for delta temporality, automatically forgetting unused attribute sets, but this will not address the API-level question faced for cumulative measurements.

Currently, the .NET implementation of asynchronous counter/gauges supports removing metric points from a specific metric, but this causes the last value emitted for that metric point to be re exported ad-infinitum (or just become a stale entry in the metric point list when using delta export mode) by the OTEL SDK implementation.

The data model problem is linked in @dashpole's reply above.

@dashpole
Copy link
Contributor

Currently, the .NET implementation of asynchronous counter/gauges supports removing metric points from a specific metric, but this causes the last value emitted for that metric point to be re exported ad-infinitum (or just become a stale entry in the metric point list when using delta export mode) by the OTEL SDK implementation.

This seems to go against this:

The implementation SHOULD NOT produce aggregated metric data for a
previously-observed attribute set which is not observed during a successful
callback. See [MetricReader](#metricreader) for more details on the persistence
of metrics across successive collections.

If that is accurate, I would open an issue in https://github.com/open-telemetry/opentelemetry-dotnet.

@alxbl
Copy link
Member

alxbl commented Feb 4, 2025

I was away from work for a while, I'm back now. I'll review the thread and open the issue in the .NET implementation. I want to validate that the behavior is still like that in the latest release (my comment was written back in 1.4.0) first though.

EDIT: I am able to repro on the 1.11.1 build of the SDK and there's already an issue open in the dotnet SDK, I've referred back to here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor
Projects
Status: Spec - Accepted
Development

No branches or pull requests