Writing system metrics conventions into the specification #818

jmacd · 2020-08-17T20:23:15Z

What are you trying to achieve?

Trying to finalize host and runtime metrics instrumentation plugins for OTel-Go.

In theory this instrumentation should emit metrics like those the collector would if it were reporting on the same host or container. I've been reviewing the hostmetrics receiver and notice it has "process.cpu.time" while OTEP 119 uses "system.cpu.time" in its examples. (CC: @aabmass)

I would specify that "process.cpu.time" reflects the process itself and "system.cpu.time" reflects the host system. This is consistent with the current OTel collector hostmetrics, does it sound right?

The guidelines OTEP 119 should be copied into the specification on metrics semantic conventions.

Additional context.

open-telemetry/oteps#119 discusses conventional metric naming guidelines.

https://github.com/open-telemetry/opentelemetry-collector/tree/master/receiver/hostmetricsreceiver/internal/scraper/processscraper is the hostmetrics code that generates CPU timing metrics.

aabmass · 2020-08-17T22:16:54Z

I would specify that "process.cpu.time" reflects the process itself and "system.cpu.time" reflects the host system. This is consistent with the current OTel collector hostmetrics, does it sound right?

Yes, I need to add this when copying over the OTEP. @jmacd can you assign this to me?

aabmass · 2020-08-17T22:19:59Z

@james-bebbington also requested a process count metric open-telemetry/oteps#119 (comment):

I believe this PR is ready to be merged but when writing this up for the specs repo, it would be good to add a convention for process counts (with "state" = running / inactive)

aabmass · 2020-08-26T22:10:24Z

Following up on #819, I think all of the UpDownSumObserver I put in OTEP 119 should be changed to ValueObserver.

The metrics API points out the similarity/confusion between UpDownSumObserver and ValueObserver with the example of measuring queue size:

Therefore, considering the choice between ValueObserver and UpDownSumObserver, the recommendation is to choose the instrument with the more-appropriate default aggregation. If you are observing a queue size across a group of machines and the only thing you want to know is the aggregate queue size, use SumObserver because it produces a sum, not a distribution. If you are observing a queue size across a group of machines and you are interested in knowing the distribution of queue sizes across those machines, use ValueObserver.

For most of those metrics, I'd imagine a distribution (summary) would be more valuable than an aggregate sum across machines.

aabmass · 2020-08-26T22:13:46Z

Now I'm not sure if the same thing applies to the SumObserver instruments. E.g. for system.cpu.time, would you rather see the summed cpu time or a distribution of cpu time?

jmacd · 2020-08-27T22:57:50Z

Usually when I look at system.cpu.usage (the OTEP 119 name, a.k.a. system.cpu.time the OTel-Collector name), I want to see the rate in terms of CPUs per second. That way, the number can be compared with the available CPUs. A distribution of CPU usage deltas would have to be divided by each measurement's interval (i.e., end_time - start_time), I think, to make it useful as a distribution. That is, in a sense, what SumObserver and UpDownSumObserver signify: that you can compute a meaningful rate.

We've sometimes referred to this as a structural difference, between ValueRecorder and SumObserver. For a SumObserver (Adding Structure), we know that the differences between M and N and between (M+Offset) and (N+Offset) are equivalent, because Additive measurements are linear (e.g., difference between 1000s and 2000s of CPU time is the same as between 2000 and 3000). This applies to .usage metrics, which I still think should be SumObserver when monotonic.

OTOH, a .utilization metric does not have the same property. The differences between M and N and between (M+Offset) and (N+Offset) are NOT necessarily equivalent (e.g., difference between 0.1 and 0.2 not the same as between 0.7 and 0.8).

Following up on #819, I think all of the UpDownSumObserver I put in OTEP 119 should be changed to ValueObserver.

I looked over the UpDownSumObserver metrics there, and:

Agree that all .utilization metrics should become ValueObserver
Still think .usage metrics should use SumObserver or UpDownSumObserver
No strong opinion for system.network.connections.

aabmass · 2020-08-28T00:48:10Z

OTOH, a .utilization metric does not have the same property. The differences between M and N and between (M+Offset) and (N+Offset) are NOT necessarily equivalent (e.g., difference between 0.1 and 0.2 not the same as between 0.7 and 0.8).

I'm not sure if I follow how utilization metrics wouldn't be linear in the same way, since they are just value ratios. E.g. "there was a 10% increase in X". I do agree that they should be ValueObserver though, since they don't really represent a sum.

No strong opinion for system.network.connections.

@bogdandrutu, I'm not sure if I understood what makes this a sum. To measure it, you would just be observing the number of connections once each interval and export that value, where is the summation?

aabmass · 2020-08-28T00:48:34Z

@jmacd after the SIG meeting, I also want to clarify why OTEP 119 calls it system.cpu.time and not usage. Here's what I put in the Semantic Conventions section of the otep:

usage - an instrument that measures an amount used out of a known total amount should be called entity.usage. For example, system.filesystem.usage for the amount of disk spaced used. A measure of the amount of an unlimited resource consumed is differentiated from usage. This may be time, data, etc.
...

time - an instrument that measures passage of time should be called entity.time. For example, system.cpu.time with varying values of label state for idle, user, etc.

The same thing goes for metrics like system.disk.operations, where you can't "use" a certain number of operations available, its unbounded. The other complication is for something like disk, you have system.disk.{io,operations,time} -- which one would you want to call "usage"?

I am happy to simplify it for the spec if anyone has a strong opinion against this convention. I'm not sure if these distinctions are even valuable.

bogdandrutu · 2020-08-31T21:10:50Z

@bogdandrutu, I'm not sure if I understood what makes this a sum. To measure it, you would just be observing the number of connections once each interval and export that value, where is the summation?

The number of connections is a "sum" the entity that calculates that does a "sum" +/- 1 when a connection is open/close.

bogdandrutu · 2020-08-31T21:22:47Z

In other words if you were to instrument directly the calls of Open/Close connection you would use an UpDownCounter, not a ValueRecorder

kjordy · 2020-09-01T13:50:04Z

@aabmass, regarding the naming when measuring time, I think that .time is clearer in denoting that it's a measurement of the number of seconds passed whereas .utilization is the ratio compared to the total time. It’s useful to see CPU as a rate in terms of CPUs per second (same thing jmacd said above), and I think CPU usage and utilization are both commonly used to name that rate metric. The Semantic Conventions section of the otep defines utilization as the ratio so I think it would be consistent for metrics that measure time to use .time and .utilization.

I'm unsure about the effects of using a sum aggregation on .usage. The metrics proto says these use a Sum type which "represents the type of a numeric int scalar metric that is calculated as a sum of all reported measurements over a time interval." But for things like filesystem.usage, I would want to see the output as a gauge. Is my interpretation of the conversation from the instrument to the metric type correct?

jmacd · 2020-09-03T06:11:58Z

I also want to clarify why OTEP 119 calls it system.cpu.time and not usage

@aabmass I've come to understand this issue and definitely agree that .time is right when measuring a sum of duration measurements. I somehow had misunderstood and thought OTEP 119 suggested .usage and the collector was in disagreement by using .time.

"represents the type of a numeric int scalar metric that is calculated as a sum of all reported measurements over a time interval."

@kjordy the thing we are trying to define here somehow avoids talking about cumulative vs. delta measurements, and that's an area where language can probably be improved in several places. The Sum aggregation for a series of delta measurements is computed with addition, whereas the Sum aggregation for a series of cumulative measurements are simply point values.

@aabmass I've struggled with how to explain this idea of how to choose between Adding instruments, where measurements are "things we add", and Grouping instruments, where measurements are "things we average". One of the best resources I've found to explain these differences is "On the Theory of Scales of Measurement" (S. S. Stevens, 1946), which is old, findable, and worth reading.

The reason why @bogdandrutu states that connections should be sums, is that they are things where each unit is the same countable amount--they are "interval" measurements in the terminology of that paper, whereas utilization is a "ratio" measurement in the terminology of that paper. Each unit in a sum of connections is the same as another, whereas we can't say that about ratios.

Everyone: I'd like to improve the API specification on this matter, welcome suggestions & PRs. 😀

@aabmass I think we're at 💯 on moving OTEP 119 into the specification, where .utilization is a ValueObserver and .time is SumObserver, and .usage is usually UpDownSumObserver.

jmacd added the spec:metrics Related to the specification/metrics directory label Aug 17, 2020

jmacd mentioned this issue Aug 18, 2020

Host metrics instrumentation open-telemetry/opentelemetry-go-contrib#231

Merged

jmacd added priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Aug 19, 2020

jmacd assigned aabmass Aug 20, 2020

aabmass mentioned this issue Aug 26, 2020

Update System Metrics open-telemetry/opentelemetry-python#1019

Merged

aabmass mentioned this issue Sep 9, 2020

System metrics semantic conventions #937

Merged

jmacd closed this as completed in #937 Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing system metrics conventions into the specification #818

Writing system metrics conventions into the specification #818

jmacd commented Aug 17, 2020

aabmass commented Aug 17, 2020 •

edited

Loading

aabmass commented Aug 17, 2020

aabmass commented Aug 26, 2020

aabmass commented Aug 26, 2020

jmacd commented Aug 27, 2020

aabmass commented Aug 28, 2020

aabmass commented Aug 28, 2020

bogdandrutu commented Aug 31, 2020

bogdandrutu commented Aug 31, 2020

kjordy commented Sep 1, 2020

jmacd commented Sep 3, 2020

Writing system metrics conventions into the specification #818

Writing system metrics conventions into the specification #818

Comments

jmacd commented Aug 17, 2020

aabmass commented Aug 17, 2020 • edited Loading

aabmass commented Aug 17, 2020

aabmass commented Aug 26, 2020

aabmass commented Aug 26, 2020

jmacd commented Aug 27, 2020

aabmass commented Aug 28, 2020

aabmass commented Aug 28, 2020

bogdandrutu commented Aug 31, 2020

bogdandrutu commented Aug 31, 2020

kjordy commented Sep 1, 2020

jmacd commented Sep 3, 2020

aabmass commented Aug 17, 2020 •

edited

Loading