-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aligning metrics with OTel semantic conventions #4482
Conversation
Can you also align the resilience tags? |
Surely I will, thanks for pointing that out |
@lmolkova - are you interested in taking a look at these? |
/// </remarks> | ||
/// <seealso cref="System.Diagnostics.Metrics.Instrument"/> | ||
public static class ResourceUtilizationCounters | ||
{ | ||
/// <summary> | ||
/// Gets the CPU consumption of the running application in percentages. | ||
/// Gets the CPU consumption of the running application in range <c>[0, 1]</c>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a stepping back question regarding all utilizations measured by %.
Related discussions and PRs:
- Runtime instrumentation: Can time-in-gc metric be published ? open-telemetry/opentelemetry-dotnet-contrib#1163 (comment)
- [Instrumentation.Process] Added the metrics for CpuTime and CpuUtilization. open-telemetry/opentelemetry-dotnet-contrib#625 (comment)
- [Instrumentation.Process] Added Cpu related metrics and addressed comments. open-telemetry/opentelemetry-dotnet-contrib#612 (comment)
FYI @noahfalk IIRC for garbage collector metrics we decided to use absolute value (e.g. total time spent in GC since the beginning of the process lifecycle), and encourage users to derive utilization based on their actual need (so the user can choose whatever sliding window and sampling frequency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I recall this component has a sampling mechanism and sliding window built into it and the app developer can configure some of those parameters or accept defaults. So when this component reports an instantaneous utilization measurement of 73%, it really means that over the last X minutes some aggregation of Y CPU usage samples had an average CPU utilization of 73% for the app developer's chosen X and Y. Its not the pattern I'd expect most metrics to follow as there is extra implementation complexity and it can't accurately be re-aggregated to compute other sliding windows later. However because the app developer configures a sampling window of known fixed size it doesn't suffer from the issues that those other metrics had where they inferred a sample window from the polling frequency or from some unpredictable process behavior.
A while back I encouraged the engineers who were working on this to simplify and not maintain their own sliding window sampling reservoir, but I feel like what is there now is still well defined and likely useful as long as the sampling window is configured well up-front. I didn't object to this even though I expect metrics of this style will be an outlier in the long run. Having something like this shouldn't block us from later creating an alternate metric that emits absolute measures of CPU and memory usage with more aggregation flexibility.
@reyang do you feel like that addresses your concerns or you are still worried here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too concerned about having alternative metrics that customers can conveniently use - as long as there is clarity:
- What exactly does this metric mean. Refer to https://github.com/dotnet/extensions/pull/4482/files#r1347628433.
- What's the recommendation if there are multiple ways to get similar things - e.g. pros/cons, pitfalls, best practices. For example, https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What's the recommendation if there are multiple ways to get similar things - e.g. pros/cons, pitfalls, best practices. For example, https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Process.
@noahfalk given OpenTelemetry.Instrumentation.Process is still in Preview, I think we should make a decision about the direction - e.g. all CPU/Memory metrics will be exposed by Microsoft.Extensions.Diagnostics.ResourceMonitoring
, and we will deprecate OpenTelemetry.Instrumentation.Process
and point customers to Microsoft.Extensions.Diagnostics.ResourceMonitoring
. Meanwhile we'll make sure Microsoft.Extensions.Diagnostics.ResourceMonitoring
works for all supported version of the runtime + operating systems. Does this make sense and do you feel we can make a call before .NET 8 GA?
@Yun-Ting FYI since you've worked on OpenTelemetry.Instrumentation.Process
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought would be:
In .NET 8 - Point devs to these ResourceMonitoring metrics. Either deprecate OTel.Instrumentation.Process or adjust guidance so that we only suggest it only as a fallback for devs that need the metric in a specific form not already handled by ResourceMonitoring.
In the future (.NET 9?) - Implement the core CPU + memory metrics currently in OTel.Instrumentation.Process inside System.Diagnostic.DiagnosticSource. I think metrics like process.cpu.time
or process.memory.virtual
are so commonly desired that we shouldn't be asking devs to add extra library dependencies or new background polling services (serviceCollection.AddResourceMonitoring()
) in order to enable them. At this point ResourceMonitoring would serve two roles going forward: (1) support for networking metrics which are a little more involved to configure (2) back-compat for projects already using the other process metrics defined here and wanting to continue doing so.
What does everyone think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought would be: In .NET 8 - Point devs to these ResourceMonitoring metrics.
Just confirming - point devs that are using any supported version of .NET (core/framework), not just limited to folks who use .NET 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I meant as of November 2023 (or whenever this assembly releases as stable), all .NET devs should be able to use ResourceMonitoring to satisfy their need for CPU/Virtual Memory metrics.
NOTE: After I wrote that I saw that this library has a weirdly inconsistent approach to how it measures memory. I'm presuming that gets resolved. If it didn't then I wouldn't feel comfortable recommending it.
...raries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/Linux/LinuxUtilizationProvider.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.Diagnostics.HealthChecks.Common/Metric.cs
Outdated
Show resolved
Hide resolved
public static partial HealthCheckReportCounter CreateHealthCheckReportCounter(Meter meter); | ||
|
||
[Counter("name", "status", Name = @"R9\\HealthCheck\\UnhealthyHealthCheck")] | ||
[Counter("health_check.name", "health.status", Name = "health_check.unhealthy_checks")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fyi @timmydo - A little while back you were looking for a counter on health checks. Perhaps this functionality matches what you wanted?
...Libraries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/ResourceUtilizationCounters.cs
Outdated
Show resolved
Hide resolved
...Libraries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/ResourceUtilizationCounters.cs
Outdated
Show resolved
Hide resolved
...Libraries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/ResourceUtilizationCounters.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Noah Falk <[email protected]>
...raries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/Linux/LinuxUtilizationProvider.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/Windows/WindowsCounters.cs
Outdated
Show resolved
Hide resolved
…/MetricCollector.cs
src/Libraries/Microsoft.Extensions.Resilience/Resilience/Internal/ResilienceTagNames.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.Resilience/Resilience/Internal/ResilienceTagNames.cs
Outdated
Show resolved
Hide resolved
|
||
public const string FailureReason = "failure-reason"; | ||
public const string FailureReason = "resilience.failure.reason"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on what I see here https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.Resilience/Resilience/Internal/ResilienceMetricsEnricher.cs, we can use error.type
attribute here (and full exception name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note ere. These are just additional tags on top of the ones built-in into Polly:
https://www.pollydocs.org/advanced/telemetry.html#metrics
One of the tags is exception.type
one as defined by:
https://opentelemetry.io/docs/specs/otel/trace/exceptions/
There is an intersection here between exception.type
and error.type
and I am wondering how should we handle this.
Just for context:
This is how HttpResponseMessage
is translated into FailureResultContext
:
Line 156 in 1247364
return FailureResultContext.Create( |
And this is how exceptions are translated:
Line 37 in 1247364
context.Tags.Add(new(ResilienceTagNames.FailureSource, e.Source)); |
Can we trim the FailureResultContext
or maybe just expose add the following tags?
error.summary
: will be status code such as500
forHttpResponseMessage
or exception summary for exceptions.exception.source
: applicable only for exceptions (or even don't add anything here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the context!
To the original comment: if polly already reports exception.type
, is there any reason to report resilience.failure.reason
at all since it's always an exception type?
Can we trim the FailureResultContext or maybe just expose add the following tags?
error.summary: will be status code such as 500 for HttpResponseMessage or exception summary for exceptions.
exception.source: applicable only for exceptions (or even don't add anything here)
I'd avoid adding new tags to OTel-defined namespaces - this is against OTel recommendations https://github.com/open-telemetry/opentelemetry-specification/blob/563958cb2bd8529990f19fdce7a5f3643bf63091/specification/common/attribute-naming.md?plain=1#L128
But I believe that if the exception summary has low cardinality, you can put it into error.type
attribute which is intended to capture all kinds of error codes and fallback to (full) exception type.
We can also introduce dotnet.exception.source
attribute and (hopefully) reuse in other parts of .NET and 3rd party libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with having just:
error.type
- to represent both exception summary and the failure reason for resultsdotnet.exception.source
And at initial release we won't add tags for the following properties?
FailureResultContext.Source
FailureResultContext.AdditionalInformation
extensions/src/Libraries/Microsoft.Extensions.Resilience/Resilience/FailureResultContext.cs
Line 13 in 261b354
public readonly struct FailureResultContext |
Maybe we can slim the FailureResultContext
down then, or maybe completely drop it. This is because the Polly v8 supports registering custom enrichers and we can just register HTTP-based enricher to Polly directly without having additional layers?
cc @geeknoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR with the changes is up:
src/Libraries/Microsoft.Extensions.Resilience/Resilience/Internal/ResilienceTagNames.cs
Outdated
Show resolved
Hide resolved
@xakep139 Is this ready to go in? |
…ing/Windows/WindowsCounters.cs Co-authored-by: Liudmila Molkova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me, but I believe we still need to confirm a few things:
- do we expect exception summarizer to produce low-cardinality summaries
- do we still need the failure reason attribute
- what exactly does CPU utilization metric reports and if it needs
state
attributes (as OTel one)
|
||
public const string DependencyName = "dep-name"; | ||
public const string DependencyName = "dotnet.resilience.dependency.name"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public const string DependencyName = "dotnet.resilience.dependency.name"; | |
public const string DependencyName = "dotnet.dependency.name"; |
The idea behind adding these tags to resilience metrics was that we can correlate resilience events with the http metrics. In that sense the name of this tag should be the same as the tag used to track RequestName
and DependencyName
for HTTP scenarios.
Do we know how the RequestMetadata.DependencyName
and RequestMetadata.RequestName
is represented in Otel world?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current HttpClient metrics there's no such dimensions: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/built-in-metrics-system-net#instrument-httpclientrequestduration
Neither OTel specifies this: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-metrics.md#metric-httpclientrequestduration
I agree we need to unify these dimensions/attributes if they will represent same things in HttpClient metering (hopefully we'll add these bits eventually) and resilience libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup correct, these are not strictly tied to resilience so it would be better to not include it in the name to avoid doing breaking changes in the future.
Ideally, this would be the same name as for HTTP metrics.
@geeknoid can you please clarify the expected cardinality that comes from exception summarizer? Regarding CPU utilization we agreed with folks (Matej, Noah and you) that we don't cover it in that PR. |
Correct, this is also based on recommendation by @lmolkova |
@xakep139 Yes, the point of the summarizer is specifically to produce low cardinality values devoid of privacy-sensitive information. |
We agreed with @martintmk that I will apply naming changes he made in #4600 to this PR, specifically we keep these attributes:
Everything else is removed from Resilience tag names. |
I believe you take request and dependency terms from application Insights, correct? There is no direct translation for span name to a metric tag, I'd say you'd have a metric with the same name as span. If this is not an option, then consider |
...aries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/ResourceMonitoringOptions.Linux.cs
Outdated
Show resolved
Hide resolved
…ing/ResourceMonitoringOptions.Linux.cs
Excellent work! |
Fixes #4432
The final names are:
Microsoft.Extensions.Diagnostics.ResourceMonitoring
:process.cpu.utilization
, unit1
, values:0.0
-1.0
dotnet.process.memory.virtual.utilization
, unit1
, values:0.0
-1.0
system.network.connections
, unit{connection}
, tags/attributes:network.transport
=tcp
network.type
= [ipv4
,ipv6
]system.network.state
= [close
,close_wait
,closing
,delete
,established
,fin_wait_1
,fin_wait_2
,last_ack
,listen
,syn_recv
,syn_sent
,time_wait
]Microsoft.AspNetCore.HeaderParsing
:aspnetcore.header_parsing.parse_errors
, tags/attributes:error.type
aspnetcore.header_parsing.header.name
aspnetcore.header_parsing.cache_accesses
, tags/attributes:aspnetcore.header_parsing.header.name
aspnetcore.header_parsing.cache_access.type
Microsoft.Extensions.Diagnostics.HealthChecks
:dotnet.health_check.reports
, tags/attributes:dotnet.health_check.status
of typeHealthStatus
, [Degraded
,Healthy
,Unhealthy
]dotnet.health_check.unhealthy_checks
, tags/attributes:dotnet.health_check.name
dotnet.health_check.status
of typeHealthStatus
, [Degraded
,Healthy
,Unhealthy
]error.type
request.dependency.name
- to correlate with HttpClient metricsrequest.name
- to correlate with HttpClient metricsMicrosoft Reviewers: Open in CodeFlow