Add `process.uptime` and `system.uptime` metrics to semantic conventions #2824

andrzej-stencel · 2022-09-23T11:05:51Z

Fixes open-telemetry/semantic-conventions#648

Changes

Adds new metrics: process.uptime and system.uptime to the semantic conventions.

Related issue: open-telemetry/opentelemetry-collector-contrib#14130

reyang · 2022-09-23T14:55:31Z

specification/metrics/semantic_conventions/process-metrics.md

+| `process.threads`               | UpDownCounter                                      | {threads} | Process threads count.                                                                                                              |                                                                                                                                                                                                 |
+| `process.open_file_descriptors` | UpDownCounter                                      | {count}   | Number of file descriptors in use by the process.                                                                                   |                                                                                                                                                                                                 |
+| `process.context_switches`      | Counter                                            | {count}   | Number of times the process has been context switched.                                                                              | `type` SHOULD be one of: `involuntary`, `voluntary`                                                                                                                                             |
+| `process.uptime`                | Counter                                            | s         | Number of seconds that the process has been running.                                                                                |                                                                                                                                                                                                 |


Should this be a counter or gauge?

I believe it should be a gauge, the value represents the uptime of the system at the given time of recording.
Any further aggregation or calculations hold no additional statistical meaning.

~~Agreed, gauge is more appropriate~~

I think we should be debating whether this is a Counter or an UpDownCounter. I will put my rationale in the main thread.

Do we agree on "what is uptime"? I suspect we are not on the same page 🤣

https://en.wikipedia.org/wiki/Uptime#Using_uptime

If let's say have a Chrome browser running 10 tabs (which might give us 11 processes), do we expect the uptime to be added across the 11 processes, and what does that mean?

If the browser is closed, then reopened, what does the uptime mean?

The Linux uptime command seems to be focusing on "how long has it been since the operating system started" https://en.wikipedia.org/wiki/Uptime#Linux, and the uptime would reset if the system restarted.

If we take the semantic here https://en.wikipedia.org/wiki/Uptime#Records

A Cisco router has been reported to have been running continuously for 21 years.

then making it a counter sounds like wrong?

jamesmoessis

Just a few minor comments. Thanks for raising this PR!

jamesmoessis · 2022-09-26T07:11:49Z

specification/metrics/semantic_conventions/process-metrics.md

+| `process.threads`               | UpDownCounter                                      | {threads} | Process threads count.                                                                                                              |                                                                                                                                                                                                 |
+| `process.open_file_descriptors` | UpDownCounter                                      | {count}   | Number of file descriptors in use by the process.                                                                                   |                                                                                                                                                                                                 |
+| `process.context_switches`      | Counter                                            | {count}   | Number of times the process has been context switched.                                                                              | `type` SHOULD be one of: `involuntary`, `voluntary`                                                                                                                                             |
+| `process.uptime`                | Counter                                            | s         | Number of seconds that the process has been running.                                                                                |                                                                                                                                                                                                 |


~~Agreed, gauge is more appropriate~~

specification/metrics/semantic_conventions/system-metrics.md

jamesmoessis · 2022-09-26T07:16:51Z

specification/metrics/semantic_conventions/system-metrics.md

@@ -29,6 +30,14 @@ instruments not explicitly defined in the specification.

 ## Metric Instruments

+### `system.` - General system metrics


A question I asked myself - would it make sense to have a namespace for system metadata like this? Perhaps system.info.*. It would mean you could group other system information together like system.info.boottime, system.info.uptime and any others. Personally I'm not sure, but it's something to think about if we are looking to add other system metadata to the semconv.

jmacd · 2022-09-26T17:36:06Z

Counter is appropriate because the value is monotonic. We are explicitly interested in detecting resets via this metric, which suggests that UpDownCounter is appropriate.

Note the expression rate(uptime) is definitely meaningful and useful, it's semantically identical (but operationally different than) the Prometheus up metric. However, if we let uptime be a Counter and the user prefers Delta aggregation temporality preference, the exported data (i.e., uptime in delta temporality) substantially loses utility but is (IMO) technically still correct. This suggests, again, UpDownCounter to avoid degraded utility due to Delta temporality.

I think uptime should not be a Gauge because Gauge metric series do not include start timestamps, which is a key aspect of detecting overlapping series -- IMO critical if we are to derive an up-like metric from uptime.

As for statistics, the sum (thus, the rate) can meaningfully be aggregated. Consider 10 processes running at a point in time--the rate(sum(uptime)) is 10 and equals sum(up) in a Prometheus setting.

This works for spatial aggregation: the rate(uptime) usefully equals the number of processes that are up and the sum meaningfully equals their total uptime. This works with subdivided metrics: I can label uptime with an exclusive state attribute (e.g., "idle", "running", "shutdown", ...), now the sum and the rate and be grouped by states.

This also works for temporal aggregation: if a process has been started and stopped and restarted over a period of time, we can divide rate(uptime) by the elapsed time to derive the processes fractional uptime. You can divide process.cpu.time / process.uptime to calculate average CPUs used by one or many processes.

Note that I'm writing rate(uptime) informally. The important detail hiding here is that I want to query this like a Counter in the sense that when a process disappears and restarts, the "reset" which results in a constant offset in the timeseries does not impact the rate. Thus for most purposes, we should think of uptime as a Counter we explicitly do not reset, thus an UpDownCounter.

jamesmoessis · 2022-09-28T01:43:34Z

@jmacd you put a good case forward for it being an UpDownCounter, and after reading your points I think I agree.

It seems that an Asynchronous UpDownCounter would result in a meaningful aggregations for these metrics, as per your examples. Note that it must be asynchronous because the absolute value is reported, and the API doesn't allow synchronous counters to report absolute values. These implementation details are not referenced in the semantic convention but it's worthwhile mentioning here.

jsuereth · 2022-10-04T13:04:59Z

I have a few complaints, mostly based on naming.

up and uptime metrics can be problematic in practice. Particularly if we use UpDownCounter in prometheus we may be creating an alerting nightmare for folks interacting with Gauge alignment issues.
Having a metric which captures proccess.running_time seems reasonable to me. However just because the process is running doesn't mean it's "up". For that you need additional health checks, like some kind of synthetic monitoring or other "alive" signal from the process. given the description though, I would not label this process.uptime, particularly because "Up" and "health" metrics should really be derived from raw signals, of which this appears to be one.

andrzej-stencel · 2022-10-05T09:19:59Z

I have a few complaints, mostly based on naming.

* `up` and `uptime` metrics can be problematic in practice.   Particularly if we use UpDownCounter in prometheus we may be creating an alerting nightmare for folks interacting with Gauge alignment issues.

* Having a metric which captures `proccess.running_time` seems reasonable to me.  However just because the process is running doesn't mean it's "up".  For that you need additional health checks, like some kind of synthetic monitoring or other "alive" signal from the process.   given the description though, I would _not_ label this process.uptime, particularly because "Up" and "health" metrics should really be derived from raw signals, of which this appears to be one.

I see your point. I agree wrt. to the up metric, not sure about the uptime. How about system.uptime - does this sound OK? The uptime name is common to describe the time a system has been running, regardless of whether it was responsive or not.

Would your preference Josh be to name the system metric system.uptime and the process metric process.running_time? Or name both running_time?

reyang · 2022-10-05T16:43:58Z

I see your point. I agree wrt. to the up metric, not sure about the uptime. How about system.uptime - does this sound OK? The uptime name is common to describe the time a system has been running, regardless of whether it was responsive or not.

Would your preference Josh be to name the system metric system.uptime and the process metric process.running_time? Or name both running_time?

I think there are three things we should have clear distinction.

Imagine a server started at 10AM, running till 11AM then put to sleep/hibernate, woke up at 1PM and run till it was shutdown at 2PM. Then the server started again at 3PM and run till now (4PM).

I guess the uptime would be 1 hour, the "runaway time" would be 5 hours, and the "running time" would 3 hours. It is totally fine if we want to use different terms for these concepts, I just want to point out that these are different things and so far from this PR it's hard to understand what/which exactly do we want.

andrzej-stencel · 2022-10-06T06:44:04Z

Imagine a server started at 10AM, running till 11AM then put to sleep/hibernate, woke up at 1PM and run till it was shutdown at 2PM. Then the server started again at 3PM and run till now (4PM).

I guess the uptime would be 1 hour, the "runaway time" would be 5 hours, and the "running time" would 3 hours.

Thanks Reiley for putting it out clearly. Yes, you could say these are three different pieces of information.

Where this pull request started is this proposal in contrib repo: open-telemetry/opentelemetry-collector-contrib#14130. It talks specifically about system.uptime (and not process.uptime) in the context of existing functionality in other metrics collectors. Collectd/SignalFX has the uptime plugin, Telegraf has the system input, both report the system uptime (at least for Linux) as defined in /proc/uptime, i.e. the uptime of the system (including time spent in suspend), which I suppose in the example above would be 5 hours?

reyang · 2022-10-06T17:10:24Z

Imagine a server started at 10AM, running till 11AM then put to sleep/hibernate, woke up at 1PM and run till it was shutdown at 2PM. Then the server started again at 3PM and run till now (4PM).
I guess the uptime would be 1 hour, the "runaway time" would be 5 hours, and the "running time" would 3 hours.

Thanks Reiley for putting it out clearly. Yes, you could say these are three different pieces of information.

Where this pull request started is this proposal in contrib repo: open-telemetry/opentelemetry-collector-contrib#14130. It talks specifically about system.uptime (and not process.uptime) in the context of existing functionality in other metrics collectors. Collectd/SignalFX has the uptime plugin, Telegraf has the system input, both report the system uptime (at least for Linux) as defined in /proc/uptime, i.e. the uptime of the system (including time spent in suspend), which I suppose in the example above would be 5 hours?

@astencel-sumo thanks! I think we should clarify it in the spec to avoid confusion and misinterpretation. Wikipedia seems to have different explanation, e.g. https://en.wikipedia.org/wiki/Uptime#Determining_system_uptime

If it is 5 hours, it looks like a good fit for Counter; if it is 1 hour, it looks like a good fit for Gauge.

reyang

The semantic and the wording "has been running" are murky. I think we should be crystal clear about #2824 (comment).

github-actions · 2022-10-27T03:47:50Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

andrzej-stencel · 2022-11-07T12:17:27Z

I don't currently have an answer on how to resolve this. Given that this PR does not block my work and that I have other more important PRs that I want to focus on, I'm going to close this PR. For anybody interested in continuing this work, feel free to use this work in any way.

andrzej-stencel force-pushed the add-process-system-uptime-metrics branch from 60189f8 to d09d87a Compare September 23, 2022 11:06

andrzej-stencel marked this pull request as ready for review September 23, 2022 11:07

andrzej-stencel requested review from a team September 23, 2022 11:07

github-actions bot assigned jsuereth Sep 23, 2022

This was referenced Sep 23, 2022

Add process.start_time resource attribute to semantic conventions #2825

Closed

[receiver/hostmetrics] Add process.uptime metric open-telemetry/opentelemetry-collector-contrib#14460

Closed

reyang reviewed Sep 23, 2022

View reviewed changes

reyang mentioned this pull request Sep 23, 2022

Support Elastic Common Schema in OpenTelemetry open-telemetry/oteps#199

Closed

jamesmoessis approved these changes Sep 26, 2022

View reviewed changes

andrzej-stencel force-pushed the add-process-system-uptime-metrics branch from d09d87a to facafef Compare September 30, 2022 14:00

andrzej-stencel force-pushed the add-process-system-uptime-metrics branch from 78e91f5 to ad4ce5e Compare October 5, 2022 09:03

andrzej-stencel force-pushed the add-process-system-uptime-metrics branch from ad4ce5e to c4bfb93 Compare October 13, 2022 13:09

reyang requested changes Oct 19, 2022

View reviewed changes

github-actions bot added the Stale label Oct 27, 2022

Add process.uptime and system.uptime metrics to semantic conventions

97994e9

andrzej-stencel force-pushed the add-process-system-uptime-metrics branch from c4bfb93 to 97994e9 Compare October 31, 2022 14:05

andrzej-stencel closed this Nov 7, 2022

jmacd mentioned this pull request Nov 16, 2022

How to create an "up" metric #2923

Open

makeavish mentioned this pull request May 29, 2023

[receiver/mysql]: add mysql.uptime metric open-telemetry/opentelemetry-collector-contrib#14747

Merged

andrzej-stencel mentioned this pull request May 29, 2024

Additional system attributes open-telemetry/opentelemetry-collector-contrib#31627

Closed

mjwolf mentioned this pull request Jul 22, 2024

Add additional process fields from ECS open-telemetry/semantic-conventions#993

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `process.uptime` and `system.uptime` metrics to semantic conventions #2824

Add `process.uptime` and `system.uptime` metrics to semantic conventions #2824

andrzej-stencel commented Sep 23, 2022

reyang Sep 23, 2022

MovieStoreGuy Sep 26, 2022

jamesmoessis Sep 26, 2022 •

edited

Loading

jmacd Sep 26, 2022

reyang Sep 28, 2022 •

edited

Loading

reyang Sep 28, 2022

jamesmoessis left a comment

jamesmoessis Sep 26, 2022 •

edited

Loading

jamesmoessis Sep 26, 2022

jmacd commented Sep 26, 2022 •

edited

Loading

jamesmoessis commented Sep 28, 2022

jsuereth commented Oct 4, 2022

andrzej-stencel commented Oct 5, 2022

reyang commented Oct 5, 2022

andrzej-stencel commented Oct 6, 2022 •

edited

Loading

reyang commented Oct 6, 2022

reyang left a comment

github-actions bot commented Oct 27, 2022

andrzej-stencel commented Nov 7, 2022

		@@ -29,6 +30,14 @@ instruments not explicitly defined in the specification.

		## Metric Instruments

		### `system.` - General system metrics

Add process.uptime and system.uptime metrics to semantic conventions #2824

Add process.uptime and system.uptime metrics to semantic conventions #2824

Conversation

andrzej-stencel commented Sep 23, 2022

Changes

reyang Sep 23, 2022

Choose a reason for hiding this comment

MovieStoreGuy Sep 26, 2022

Choose a reason for hiding this comment

jamesmoessis Sep 26, 2022 • edited Loading

Choose a reason for hiding this comment

jmacd Sep 26, 2022

Choose a reason for hiding this comment

reyang Sep 28, 2022 • edited Loading

Choose a reason for hiding this comment

reyang Sep 28, 2022

Choose a reason for hiding this comment

jamesmoessis left a comment

Choose a reason for hiding this comment

jamesmoessis Sep 26, 2022 • edited Loading

Choose a reason for hiding this comment

jamesmoessis Sep 26, 2022

Choose a reason for hiding this comment

jmacd commented Sep 26, 2022 • edited Loading

jamesmoessis commented Sep 28, 2022

jsuereth commented Oct 4, 2022

andrzej-stencel commented Oct 5, 2022

reyang commented Oct 5, 2022

andrzej-stencel commented Oct 6, 2022 • edited Loading

reyang commented Oct 6, 2022

reyang left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 27, 2022

andrzej-stencel commented Nov 7, 2022

Add `process.uptime` and `system.uptime` metrics to semantic conventions #2824

Add `process.uptime` and `system.uptime` metrics to semantic conventions #2824

jamesmoessis Sep 26, 2022 •

edited

Loading

reyang Sep 28, 2022 •

edited

Loading

jamesmoessis Sep 26, 2022 •

edited

Loading

jmacd commented Sep 26, 2022 •

edited

Loading

andrzej-stencel commented Oct 6, 2022 •

edited

Loading