-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opentelemetry: support publishing metrics #2185
opentelemetry: support publishing metrics #2185
Conversation
Currently, there is no way to publish metrics via tracing. Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name. Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric.
Great to get started thinking about how otel-metrics fits in to tracing, as the metrics impl is still in progress you may have an easier time maintaining a separate subscriber/layer that you can update without worrying as much about breaking changes / stability guarantees. e.g. open-telemetry/opentelemetry-rust#819 is a fairly substantial change to the otel metrics api and configuring metric views / multiple metric exporters is still to be built out. Other option could be a feature flag that suggests the instability / warns about using it currently, but it may be more trouble than it's worth until the metrics side stablizes a bit more. |
This patch moves out all the metrics-related code to its own layer, and also adds integration tests. Unfortunately, the assertions don't currently fail the tests because the panics are captured by the closure. I tried using an Arc Mutex to store the Metrics data and assert from the test functions themselves, but for some reason, the closure runs after the assertions in that case, and the tests fail because of that race condition. That is, the closure which should update the Arc Mutex does so too late in the execution of the test.
Thanks @jtescher for the comment. I thought about what you said and I liked the suggestion of putting this code in a separate Layer. I pushed a new commit that moves it out to Tomorrow I will write some docs, then it should be in a mergeable state. |
I ran into an issue with my integration tests where if I put the assertions into the test function, the assertions seem to run before |
This patcd adds documentation for users to know how to utilize the new subscriber, and also handles the case where we receive an `i64` for a MonotonicCounter.
This morning I tried to improve the tests again, but the Other than that, I think this is ready to merge (if the CI passes) @carllerche |
This patch just fixes lints that show up during the PR's CI, and adds a couple more unit tests to make sure that `u64` and `i64` are handled properly by the Subscriber.
Rustdocs now show how to instantiate and register an `OpenTelemetryMetricsSubscriber` so that it can be used.
This patch reduces the number of allocations in the `MetricVisitor` by making use of the fact that the metadata for each callsite has a static name. This means that we don't need to convert to a `String` and can instead directly use the `&'static str` in the `HashMap`s.
Got some good feedback from @hlbarber (thanks!), which I've now incorporated into the PR (using |
This commit makes some changes that should significantly reduce the performance overhead of the OpenTelemetry metrics layer. In particular: * The current code will allocate a `String` with the metric name _every_ time a metric value is recorded, even if this value already exists. This is in order to use the `HashMap::entry` API. However, the performance cost of allocating a `String` and copying the metric name's bytes into that string is almost certainly worse than performing the hashmap lookup a second time, and that overhead occurs *every* time a metric is recorded. This commit changes the code for recording metrics to perform hashmap lookups by reference. This way, in the common case (where the metric already exists), we won't allocate. The allocation only occurs when a new metric is added to the map, which is infrequent. * The current code uses a `RwLock` to protect the map of metrics. However, because the current code uses the `HashMap::entry` API, *every* metric update must acquire a write lock, since it may insert a new metric. This essentially reduces the `RwLock` to a `Mutex` --- since every time a value is recorded, we must acquire a write lock, we are forcing global synchronization on every update, the way a `Mutex` would. However, an OpenTelemetry metric can have its value updated through an `&self` reference (presumably metrics are represented as atomic values?). This means that the write lock is not necessary when a metric has already been recorded once, and multiple metric updates can occur without causing all threads to synchronize. This commit changes the code for updating metrics so that the read lock is acquired when attempting to look up a metric. If that metric exists, it is updated through the read lock, allowing multiple metrics to be updated without blocking every thread. In the less common case where a new metric is added, the write lock is acquired to update the hashmap. * Currently, a *single* `RwLock` guards the *entire* `Instruments` structure. This is unfortunate. Any given metric event will only touch one of the hashmaps for different metric types, so two distinct types of metric *should* be able to be updated at the same time. The big lock prevents this, as a global write lock is acquired that prevents *any* type of metric from being updated. This commit changes this code to use more granular locks, with one around each different metric type's `HashMap`. This way, updating (for example) an `f64` counter does not prevent a `u64` value recorder from being updated at the same time, even when both metrics are inserted into the map for the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i noticed some potential performance issues with the current implementation that seem fairly significant. i've opened a PR against this branch (bryangarza#1) that rewrites some of this code to reduce the overhead of string allocation and global locking.
opentelemetry: remove per-update allocation & global lock
I cancelled the 5 CI tests that were running because seems that they are now hanging. Tried this out locally as well, but need to dig into it further. |
This patch makes the read lock go out-of-scope before we block on obtaining the write lock. Without this change, trying to acquire a write lock hangs forever.
This patch updates the `MetricsMap` type to use static strings, this will reduce the number of allocations needed for the metrics processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have basically one remaining concern about this PR, which is that some aspects of the current implementation of the metrics code make it more or less impossible to implement metrics that follow OpenTelemetry's metric naming conventions
In particular, the metric names in this PR are
- required to start with a prefix, which is part of the name of the metric that's output to OpenTelemetry
- uppercase, for some reason?
I think that using prefixes to indicate the type of the metric is fine for now (although I'm definitely open to nicer ways of specifying metric types in the future), but I think we should change the code for recognizing metrics based on field name prefixes so that the metric value that's actually emitted to the OpenTelemetry collector doesn't include a prefix like MONOTONIC_COUNTER
in the actual name of the metric. We could do this by using str::strip_prefix
in the visitor implementation, rather than starts_with
, and using the field name without the prefix as the hashmap key.
In addition, I think we may want to change the prefixes we recognize. Both tracing
and OpenTelemetry support dots in field names, and semantically, a dot can be used as a namespacing operator. I think we should probably use prefixes like monotonic_counter.
rather than MONOTONIC_COUNTER_
to more clearly separate the metric type from the rest of the metric name.
So, in summary, I think we should:
- make the metric name prefixes lowercase
- make the metric name prefixes end in a
.
- strip the prefix when matching metric names before using them as hashmap keys
- change the metric names in the various examples and tests to be all lowercase
and then I'll be very happy to merge this PR!
pub fn new(push_controller: PushController) -> Self { | ||
let meter = push_controller | ||
.provider() | ||
.meter(INSTRUMENTATION_LIBRARY_NAME, Some(CARGO_PKG_VERSION)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this add additional dimensions to the emitted metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's something that opentelemetry includes during metrics exporting
This patch changes the metric name prefixes from using upper snakecase, to using lowercase and a `.` at the end. This is to align with the OpenTelemetry metric naming conventions.
@hawkw, thanks -- done in one of my recent commits |
Co-authored-by: David Barsky <[email protected]>
This patch renames OpenTelemetryMetricsSubscriber to MetricsSubscriber, to avoid the redundant word.
… into opentelemetry-metrics
Co-authored-by: David Barsky <[email protected]>
- Remove example of what *not* to do6 - Change wording of a rustdoc that was too verbose
I implemented the PR comments into a new commit 1e26171
In the upstream `opentelemetry` crate, the `trace` and `metrics` features are gated by separate feature flags. This allows users who are only using OpenTelemetry for tracing, or who are only using it for metrics, to pick and choose what they depend on. Currently, the release version of `tracing-opentelemetry` only provides tracing functionality, and therefore, it only depends on `opentelemetry` with the `trace` feature enabled. However, the metrics support added in #2185 adds a dependency on the `opentelemetry/metrics` feature. This is currently always enabled. We should probably follow the same approach as upstream `opentelemetry`, and allow enabling/disabling metrics and tracing separately. This branch adds a `metrics` feature to `tracing-opentelemetry`, and makes the `MetricsSubscriber` from #2185 gated on the `metrics` feature. This feature flag is on by default, like the upstream `opentelemetry/metrics` feature, but it can be disabled using `default-features = false`. We should probably do something similar for the tracing components of the crate, and make them gated on a `trace` feature flag, but adding a feature flag to released APIs is not semver-compatible, so we should save that until the next breaking release.
In the upstream `opentelemetry` crate, the `trace` and `metrics` features are gated by separate feature flags. This allows users who are only using OpenTelemetry for tracing, or who are only using it for metrics, to pick and choose what they depend on. Currently, the release version of `tracing-opentelemetry` only provides tracing functionality, and therefore, it only depends on `opentelemetry` with the `trace` feature enabled. However, the metrics support added in #2185 adds a dependency on the `opentelemetry/metrics` feature. This is currently always enabled. We should probably follow the same approach as upstream `opentelemetry`, and allow enabling/disabling metrics and tracing separately. This branch adds a `metrics` feature to `tracing-opentelemetry`, and makes the `MetricsSubscriber` from #2185 gated on the `metrics` feature. This feature flag is on by default, like the upstream `opentelemetry/metrics` feature, but it can be disabled using `default-features = false`. We should probably do something similar for the tracing components of the crate, and make them gated on a `trace` feature flag, but adding a feature flag to released APIs is not semver-compatible, so we should save that until the next breaking release.
Motivation: Currently, there is no way to publish metrics via tracing. Solution: Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name. Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric. Co-authored-by: Eliza Weisman <[email protected]> Co-authored-by: David Barsky <[email protected]>
In the upstream `opentelemetry` crate, the `trace` and `metrics` features are gated by separate feature flags. This allows users who are only using OpenTelemetry for tracing, or who are only using it for metrics, to pick and choose what they depend on. Currently, the release version of `tracing-opentelemetry` only provides tracing functionality, and therefore, it only depends on `opentelemetry` with the `trace` feature enabled. However, the metrics support added in #2185 adds a dependency on the `opentelemetry/metrics` feature. This is currently always enabled. We should probably follow the same approach as upstream `opentelemetry`, and allow enabling/disabling metrics and tracing separately. This branch adds a `metrics` feature to `tracing-opentelemetry`, and makes the `MetricsLayer` from #2185 gated on the `metrics` feature. This feature flag is on by default, like the upstream `opentelemetry/metrics` feature, but it can be disabled using `default-features = false`. We should probably do something similar for the tracing components of the crate, and make them gated on a `trace` feature flag, but adding a feature flag to released APIs is not semver-compatible, so we should save that until the next breaking release.
Motivation: Currently, there is no way to publish metrics via tracing. Solution: Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name. Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric. Co-authored-by: Eliza Weisman <[email protected]> Co-authored-by: David Barsky <[email protected]>
In the upstream `opentelemetry` crate, the `trace` and `metrics` features are gated by separate feature flags. This allows users who are only using OpenTelemetry for tracing, or who are only using it for metrics, to pick and choose what they depend on. Currently, the release version of `tracing-opentelemetry` only provides tracing functionality, and therefore, it only depends on `opentelemetry` with the `trace` feature enabled. However, the metrics support added in #2185 adds a dependency on the `opentelemetry/metrics` feature. This is currently always enabled. We should probably follow the same approach as upstream `opentelemetry`, and allow enabling/disabling metrics and tracing separately. This branch adds a `metrics` feature to `tracing-opentelemetry`, and makes the `MetricsLayer` from #2185 gated on the `metrics` feature. This feature flag is on by default, like the upstream `opentelemetry/metrics` feature, but it can be disabled using `default-features = false`. We should probably do something similar for the tracing components of the crate, and make them gated on a `trace` feature flag, but adding a feature flag to released APIs is not semver-compatible, so we should save that until the next breaking release.
Motivation: Currently, there is no way to publish metrics via tracing. Solution: Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name. Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric. Co-authored-by: Eliza Weisman <[email protected]> Co-authored-by: David Barsky <[email protected]>
In the upstream `opentelemetry` crate, the `trace` and `metrics` features are gated by separate feature flags. This allows users who are only using OpenTelemetry for tracing, or who are only using it for metrics, to pick and choose what they depend on. Currently, the release version of `tracing-opentelemetry` only provides tracing functionality, and therefore, it only depends on `opentelemetry` with the `trace` feature enabled. However, the metrics support added in tokio-rs#2185 adds a dependency on the `opentelemetry/metrics` feature. This is currently always enabled. We should probably follow the same approach as upstream `opentelemetry`, and allow enabling/disabling metrics and tracing separately. This branch adds a `metrics` feature to `tracing-opentelemetry`, and makes the `MetricsLayer` from tokio-rs#2185 gated on the `metrics` feature. This feature flag is on by default, like the upstream `opentelemetry/metrics` feature, but it can be disabled using `default-features = false`. We should probably do something similar for the tracing components of the crate, and make them gated on a `trace` feature flag, but adding a feature flag to released APIs is not semver-compatible, so we should save that until the next breaking release.
Motivation
Currently, there is no way to publish metrics via tracing.
Solution
Rendered docs
Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name.
Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric.
Example
It's this simple:
Other notes
Meter
)