-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing system metrics conventions into the specification #818
Comments
Yes, I need to add this when copying over the OTEP. @jmacd can you assign this to me? |
@james-bebbington also requested a process count metric open-telemetry/oteps#119 (comment):
|
Following up on #819, I think all of the The metrics API points out the similarity/confusion between
For most of those metrics, I'd imagine a distribution (summary) would be more valuable than an aggregate sum across machines. |
Now I'm not sure if the same thing applies to the |
Usually when I look at We've sometimes referred to this as a structural difference, between ValueRecorder and SumObserver. For a SumObserver (Adding Structure), we know that the differences between M and N and between (M+Offset) and (N+Offset) are equivalent, because Additive measurements are linear (e.g., difference between 1000s and 2000s of CPU time is the same as between 2000 and 3000). This applies to OTOH, a
I looked over the UpDownSumObserver metrics there, and:
|
I'm not sure if I follow how utilization metrics wouldn't be linear in the same way, since they are just value ratios. E.g. "there was a 10% increase in X". I do agree that they should be ValueObserver though, since they don't really represent a sum.
@bogdandrutu, I'm not sure if I understood what makes this a sum. To measure it, you would just be observing the number of connections once each interval and export that value, where is the summation? |
@jmacd after the SIG meeting, I also want to clarify why OTEP 119 calls it
The same thing goes for metrics like I am happy to simplify it for the spec if anyone has a strong opinion against this convention. I'm not sure if these distinctions are even valuable. |
The number of connections is a "sum" the entity that calculates that does a "sum" +/- 1 when a connection is open/close. |
In other words if you were to instrument directly the calls of Open/Close connection you would use an |
@aabmass, regarding the naming when measuring time, I think that I'm unsure about the effects of using a sum aggregation on |
@aabmass I've come to understand this issue and definitely agree that
@kjordy the thing we are trying to define here somehow avoids talking about cumulative vs. delta measurements, and that's an area where language can probably be improved in several places. The Sum aggregation for a series of delta measurements is computed with addition, whereas the Sum aggregation for a series of cumulative measurements are simply point values. @aabmass I've struggled with how to explain this idea of how to choose between Adding instruments, where measurements are "things we add", and Grouping instruments, where measurements are "things we average". One of the best resources I've found to explain these differences is "On the Theory of Scales of Measurement" (S. S. Stevens, 1946), which is old, findable, and worth reading. The reason why @bogdandrutu states that connections should be sums, is that they are things where each unit is the same countable amount--they are "interval" measurements in the terminology of that paper, whereas utilization is a "ratio" measurement in the terminology of that paper. Each unit in a sum of connections is the same as another, whereas we can't say that about ratios. Everyone: I'd like to improve the API specification on this matter, welcome suggestions & PRs. 😀 @aabmass I think we're at 💯 on moving OTEP 119 into the specification, where |
What are you trying to achieve?
Trying to finalize host and runtime metrics instrumentation plugins for OTel-Go.
In theory this instrumentation should emit metrics like those the collector would if it were reporting on the same host or container. I've been reviewing the hostmetrics receiver and notice it has "process.cpu.time" while OTEP 119 uses "system.cpu.time" in its examples. (CC: @aabmass)
I would specify that "process.cpu.time" reflects the process itself and "system.cpu.time" reflects the host system. This is consistent with the current OTel collector hostmetrics, does it sound right?
The guidelines OTEP 119 should be copied into the specification on metrics semantic conventions.
Additional context.
open-telemetry/oteps#119 discusses conventional metric naming guidelines.
https://github.com/open-telemetry/opentelemetry-collector/tree/master/receiver/hostmetricsreceiver/internal/scraper/processscraper is the hostmetrics code that generates CPU timing metrics.
The text was updated successfully, but these errors were encountered: