-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new JVM runtime environment metrics #44
Add new JVM runtime environment metrics #44
Conversation
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
This metric is obtained from [`jdk.JavaMonitorWait`](https://sap.github.io/SapMachine/jfrevents/21.html#javamonitorwait) and [`jdk.JavaMonitorEnter`](https://sap.github.io/SapMachine/jfrevents/21.html#javamonitorenter) JFR events. | ||
|
||
This metric SHOULD be specified with | ||
[`ExplicitBucketBoundaries`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument-advice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we recommending empty buckets? Do we not care about percentiles here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be better to have a set of buckets, but I'm not sure what to best recommend. A thread could be blocked/waiting for any length of time. There is also a single bucket precedent for GC pause time. Maybe exponential buckets would make sense.
@jack-berg what was your reasoning for suggesting a single bucket:
I recommend we downgrade this to a summary by default by specifying an empty set of explicit bucket bounds using advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasoning is as follows:
- We need a recommended set of default buckets since the SDK defaults won't be useful.
- Coming up with sensible defaults is hard, requiring real world data for some distribution of apps.
- A single bucket histogram, or effectively a summary, probably gives most people (> 50%) the data they want. E.g. you can look at the max time a monitor is held, see how much time monitors are waiting as a rate. The distribution of the monitor wait times is a nice to have and more niche. Let those people who need it opt into it with views, but don't burden everybody by default with the extra data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't buy the reasonable defaults argument in all cases. I agree a summary is useful, but I still think there's likely SOME baseline buckets that could be provided.
|
||
This metric SHOULD be specified with | ||
[`ExplicitBucketBoundaries`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument-advice) | ||
of `[]` (single bucket histogram capturing count, sum, min, max). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we define some buckets to understand better if our bytes are falling within standard packet sizes?
I'm not sure I understand this metric so well. What would count
be here? Is count
the number of packets or the number of times JFR reports data (i.e. relativley useless to a user).
If the later, why isn't this just a Counter of some sort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comment above, I'm not sure how to suggest the best bucket bounds. count
would be number of times JFR network io events are emitted (every time network IO occurs). So I think count
would still be informative. It lets the user know roughly how busy network IO is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd encourage thinking about setting baseline buckets to common network packet sizes here. I think the value of even 1-2 buckets would dramatically increase the overall value of this metric.
Hell, even understanding if the bytes written tend to be "larger than fits on a packet" or "smaller than fits on a packet" is a useful count to track.
@@ -36,6 +36,10 @@ semantic conventions when instrumenting runtime environments. | |||
* [Metric: `process.runtime.jvm.buffer.usage`](#metric-processruntimejvmbufferusage) | |||
* [Metric: `process.runtime.jvm.buffer.limit`](#metric-processruntimejvmbufferlimit) | |||
* [Metric: `process.runtime.jvm.buffer.count`](#metric-processruntimejvmbuffercount) | |||
* [Metric: `process.runtime.jvm.cpu.monitor.duration`](#metric-processruntimejvmcpumonitorduration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to rebase and we should decide whether any of these should be included in the effort for stable semantic conventions, or be experimental for now and followup stabilizing them later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I brought it up here open-telemetry/opentelemetry-specification#3419 but it seemed like the plan was to exclude them from initial stability. I think these metrics would be quite useful to have, but I guess are not worth delaying stabilization for since they only apply to Java 14+. I've moved things to process-runtime-jvm-metrics-experimental.yaml
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `class` | string | Class of the monitor. | `java.lang.Object` | Opt-In | | ||
| `state` | string | Action taken at monitor. | `blocked`; `wait` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should note that we we may need to add a namespace to this dependent on the result of #51.
attributes: | ||
- ref: thread.id | ||
requirement_level: opt_in | ||
- id: network.direction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we extend the general network conventions to include the direction?
- id: network-connection-and-carrier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think that makes sense. I've added a new attribute in the general network conventions (connection.direction
) and referenced it from this metric
5f033a9
to
70edf6e
Compare
@roberttoyonaga if you have a chance, it would be great to get the merge conflicts resolved here |
214a44d
to
0ab671f
Compare
@trask done! |
…-metrics.md Co-authored-by: jack-berg <[email protected]>
Co-authored-by: jack-berg <[email protected]>
…-metrics.md Co-authored-by: jack-berg <[email protected]>
7c4776d
to
9730964
Compare
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Hi @roberttoyonaga ! We changed how the CHANGELOG.md is managed. Please take a look at https://github.com/open-telemetry/semantic-conventions/blob/main/CONTRIBUTING.md#adding-a-changelog-entry to see what needs to be done. Sorry for the disruption. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
This is the same PR as in the old semantic conventions location here: open-telemetry/opentelemetry-specification#3352 (review)
This PR adds process.runtime.jvm.cpu.monitor.time, process.runtime.jvm.network.io, process.runtime.jvm.network.time, and process.runtime.jvm.cpu.context_switch metrics to the runtime environment metrics.
Metric gathering implementations for these new metrics already exist in a basic form in https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry-jfr/library
Once the details around these new metrics are decided, the implementations can be updated.
JFR streaming would be used to gather these metrics. This feature has only been available since JDK 14 so these metrics would only be supported for JDK17+.
Please see original discussion in this open-telemetry/opentelemetry-java-instrumentation#7886 (comment) and at the Java + Instrumentation SIG.
Related issues #1222