Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime instrumentation #207
Runtime instrumentation #207
Changes from 4 commits
0ce19d6
b37d5c8
70e56c0
10887a2
3033276
7c101bd
e5b550b
9a806fa
e55e8eb
0a0eb3b
93b5323
e01714e
cc270ca
fc8360a
28ad40e
7293304
542a3d5
9056811
da17698
69ec265
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the names for the metrics in this PR should be prefixed with the
process.runtime.dotnet.
namespace. In this case something likeprocess.runtime.dotnet.assembly_count
(using an_
rather than-
since I think that's more commonly what I've seen in other semantic conventions).The spec has some general guidance here for runtime metrics: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/runtime-environment-metrics.md#runtime-environment-specific-metrics---processruntimeenvironment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Names are now prefixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twenzel this is doing exactly same as #177 as far as "configure the polling schedule" - except its hardcoded to 1 second. This PR is making this specific to runtime metrics as opposed to any available event counters including what other libraries/ customers might already have in place. @reyang is this what you had in mind for #207 - to restrict to runtime metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is dedicated to the runtime metrics and is using the API directly to retrieve the values.
Unfortunately sometimes the API is internal, so we've to fallback to the EventCounter.
The advantage of a this instrumentation is that it can use the direct API as soon as the runtime make it public. It will perform better as retrieving the values from event counters with less overhead.
The EventCounter instrumentation from #177 is a general purpose implementation to retrieve any existing EventCounter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally we want this "Runtime Instrumentation" to use the OpenTelemetry Metrics API rather than relying on the EventCounter. The main reason is efficiency and accuracy - especially considering Prometheus scenario that using an asynchronous instrument will make sure we only invoke the callback when Prometheus is scaping metrics, rather than having to poll the EventCounter based on some predetermined schedule.
If there are cases where we cannot use OpenTelemetry Metrics API (e.g. there is no proper API - such as ObservableUpDownCounter) or we cannot access the internal functionality of .NET runtime (e.g. if a certain EventCounter is implemented by calling some non-public API), we should discuss and see if we can find better approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we call out the list of APIs, and we should work with the .NET Runtime team to see if it makes sense to make these public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hananiel the goal for #204 is to have a pure OpenTelemetry Metrics API based instrumentation library for .NET runtime - that does not rely on EventCounter at all (as a consequence, provides better efficiency, sampling accuracy and probably better mode - e.g. different gen of GC could be modeled as metric dimensions/attributes rather than each modeled as a separate metric).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cijothomas and I had a discussion with the .NET runtime team (@maryamariyan, @noahfalk and @tarekgh), here goes a quick summary:
RuntimeEventSourceHelper.GetCpuUsage()
) were introduced into the runtime to reduce dependencies, there is likely going to be public APIs in other .NET runtime assemblies that meet the requirements here.@noahfalk to share more insights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GC.GetGenerationSize() could be replaced with GC.GetGCMemoryInfo().GenerationInfo[i].SizeAfterBytes
GetCpuUsage() is a bit messy design-wise because the computation depends on how long it has been since the API was last called. It works OK if there is only one consumer of the counter, but imagine what happens if there are two exporters and each of them invokes the callback. If we imagine each of them is polling on the same interval (lets say once per minute), then one of them will be reporting the average CPU usage over the last minute and the other one will be reporting the average CPU usage over the last ~1ms since the previous exporter invoked the callback. That tiny sampling window could give very skewed results. We might need to redefine the metric or reconsider how it should be computed. The underlying calculation should be something like:
At the beginning and end of an interval evaluate System.Diagnostics.Process.GetCurrentProcess().TotalProcessorTime, subtract the end from the beginning, and divide by (time_interval * Environment.ProcessorCount).
For the other three APIs I created a runtime issue to track adding them: dotnet/runtime#65989
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reyang @cijothomas - Talking with Maoni about the percent-time-in-gc counter a similar issue arises that I see with the cpu % counter. It seems like what we'd want to do is represent these things as ObservableCounters which measure the number of seconds where something was true:
CPU = seconds/interval where CPU was in use
GC = seconds/interval where GC had all threads paused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take an example to make sure I understand the proposal:
Option 1:
The consumer can configure the exporter to use delta temporality, so proper rate can be calculated downstream.
Option 2:
It seems Option 2 won't work, as the data being recorded won't fit any existing instrument (unless we want to pick something like async Gauge).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep exactly, option 1 is what I was suggesting we should start doing and option 2 is the problematic path that the cpu-usage EventCounter is doing right now.
Option (2) works ok with an ObservableGauge only if there is guaranteed exactly one observer/exporter, allowing the callback to infer the interval by timestamping when it was last called. As soon as there is a 2nd observer then the callback can't distinguish if the call at T2 is a new observer trying to measure the interval [0, T2] or the same observer that called at T1 now trying to measure [T1, T2].
At risk of getting a little into the weeds, the current runtime implementation of GetLastGCPercentTimeInGC() doesn't capture a timestamp of when it was last called, but the way it avoids doing that produces a measurement people are likely to find confusing. In your example above, imagine that the various GCs occured at:
GC 1: 1.0-1.2 (0.1 sec paused)
GC 2: 2.5-3.9 (0.9 sec paused)
GC 3: 4.0-4.1 (0.1 sec paused)
GC 4: 15.0-16.2 (1 sec paused)
You can see GCs can also do some work while the threads aren't paused. Adding up the pause times gives the same 0.1, 1.1 and 2.1 measurements as above when observed at T=1.5, 6.5, and 16.5.
If you measure at some time T, GetLastGCPercentTimeInGC() finds the most recent GC X that finished before time T and calculates GC_x_pause/ ( GC_x_end - GC_x-1_end). For example if you query at T=6.5 it determines GC 3 was the most recent completed GC before that point which ended at 4.1. Before that GC 2 ended at 3.9. That gives 0.1 pause time/(4.1-3.9) = 50% time in GC. Users were probably hoping to see (1.1-0.1)/(6.5-1.5)=20% time in GC and will be surprised not to see that. So the current GC stats don't suffer non-determinism when subjected to multiple observers, but only by using a surprising calculation that isn't necessarily a good approximation of what users really wanted.