Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new JVM runtime environment metrics #44

Closed

Conversation

roberttoyonaga
Copy link
Contributor

This is the same PR as in the old semantic conventions location here: open-telemetry/opentelemetry-specification#3352 (review)

This PR adds process.runtime.jvm.cpu.monitor.time, process.runtime.jvm.network.io, process.runtime.jvm.network.time, and process.runtime.jvm.cpu.context_switch metrics to the runtime environment metrics.

Metric gathering implementations for these new metrics already exist in a basic form in https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry-jfr/library
Once the details around these new metrics are decided, the implementations can be updated.

JFR streaming would be used to gather these metrics. This feature has only been available since JDK 14 so these metrics would only be supported for JDK17+.

Please see original discussion in this open-telemetry/opentelemetry-java-instrumentation#7886 (comment) and at the Java + Instrumentation SIG.

Related issues #1222

This metric is obtained from [`jdk.JavaMonitorWait`](https://sap.github.io/SapMachine/jfrevents/21.html#javamonitorwait) and [`jdk.JavaMonitorEnter`](https://sap.github.io/SapMachine/jfrevents/21.html#javamonitorenter) JFR events.

This metric SHOULD be specified with
[`ExplicitBucketBoundaries`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument-advice)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we recommending empty buckets? Do we not care about percentiles here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be better to have a set of buckets, but I'm not sure what to best recommend. A thread could be blocked/waiting for any length of time. There is also a single bucket precedent for GC pause time. Maybe exponential buckets would make sense.

@jack-berg what was your reasoning for suggesting a single bucket:

I recommend we downgrade this to a summary by default by specifying an empty set of explicit bucket bounds using advice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasoning is as follows:

  • We need a recommended set of default buckets since the SDK defaults won't be useful.
  • Coming up with sensible defaults is hard, requiring real world data for some distribution of apps.
  • A single bucket histogram, or effectively a summary, probably gives most people (> 50%) the data they want. E.g. you can look at the max time a monitor is held, see how much time monitors are waiting as a rate. The distribution of the monitor wait times is a nice to have and more niche. Let those people who need it opt into it with views, but don't burden everybody by default with the extra data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't buy the reasonable defaults argument in all cases. I agree a summary is useful, but I still think there's likely SOME baseline buckets that could be provided.


This metric SHOULD be specified with
[`ExplicitBucketBoundaries`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument-advice)
of `[]` (single bucket histogram capturing count, sum, min, max).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we define some buckets to understand better if our bytes are falling within standard packet sizes?

I'm not sure I understand this metric so well. What would count be here? Is count the number of packets or the number of times JFR reports data (i.e. relativley useless to a user).

If the later, why isn't this just a Counter of some sort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment above, I'm not sure how to suggest the best bucket bounds. count would be number of times JFR network io events are emitted (every time network IO occurs). So I think count would still be informative. It lets the user know roughly how busy network IO is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd encourage thinking about setting baseline buckets to common network packet sizes here. I think the value of even 1-2 buckets would dramatically increase the overall value of this metric.

Hell, even understanding if the bytes written tend to be "larger than fits on a packet" or "smaller than fits on a packet" is a useful count to track.

@@ -36,6 +36,10 @@ semantic conventions when instrumenting runtime environments.
* [Metric: `process.runtime.jvm.buffer.usage`](#metric-processruntimejvmbufferusage)
* [Metric: `process.runtime.jvm.buffer.limit`](#metric-processruntimejvmbufferlimit)
* [Metric: `process.runtime.jvm.buffer.count`](#metric-processruntimejvmbuffercount)
* [Metric: `process.runtime.jvm.cpu.monitor.duration`](#metric-processruntimejvmcpumonitorduration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to rebase and we should decide whether any of these should be included in the effort for stable semantic conventions, or be experimental for now and followup stabilizing them later.

Copy link
Contributor Author

@roberttoyonaga roberttoyonaga Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought it up here open-telemetry/opentelemetry-specification#3419 but it seemed like the plan was to exclude them from initial stability. I think these metrics would be quite useful to have, but I guess are not worth delaying stabilization for since they only apply to Java 14+. I've moved things to process-runtime-jvm-metrics-experimental.yaml

| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| `class` | string | Class of the monitor. | `java.lang.Object` | Opt-In |
| `state` | string | Action taken at monitor. | `blocked`; `wait` | Recommended |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should note that we we may need to add a namespace to this dependent on the result of #51.

attributes:
- ref: thread.id
requirement_level: opt_in
- id: network.direction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extend the general network conventions to include the direction?

- id: network-connection-and-carrier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think that makes sense. I've added a new attribute in the general network conventions (connection.direction) and referenced it from this metric

@roberttoyonaga roberttoyonaga force-pushed the add-jfr-metrics branch 2 times, most recently from 5f033a9 to 70edf6e Compare June 16, 2023 16:42
@trask
Copy link
Member

trask commented Aug 14, 2023

@roberttoyonaga if you have a chance, it would be great to get the merge conflicts resolved here

@roberttoyonaga roberttoyonaga requested a review from a team August 14, 2023 15:16
@roberttoyonaga
Copy link
Contributor Author

@roberttoyonaga if you have a chance, it would be great to get the merge conflicts resolved here

@trask done!

Copy link

github-actions bot commented Feb 3, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 3, 2024
@joaopgrassi
Copy link
Member

Hi @roberttoyonaga !

We changed how the CHANGELOG.md is managed. Please take a look at https://github.com/open-telemetry/semantic-conventions/blob/main/CONTRIBUTING.md#adding-a-changelog-entry to see what needs to be done. Sorry for the disruption.

@github-actions github-actions bot removed the Stale label Feb 19, 2024
Copy link

github-actions bot commented Mar 5, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 5, 2024
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants