How to create an "up" metric #2923

tomasmota · 2022-11-07T13:50:59Z

Sorry if this is out of place here, I have tried to ask in the slack but did not get any satisfiable answer.

What are you trying to achieve?
I want to create an up metric for my services.
When using bare prometheus, due to the pull mechanism, it is easy to know that if the scrape failed, the service is down. With otel, since you set a metric-expiration (which is nice to have for other metrics), the service might no longer be sending the up metric but it is still reported as being there with a value of 1, so we won't know that the service is down until it expires.

Is there some way to achieve this that I'm overlooking? Something I considered as a workaround is having an uptime metric, or current time metric, that we would expect to keep increasing if the service is up.

tomasmota · 2022-11-08T06:40:54Z

Forgot to mention, this is using the prometheus exporter

dashpole · 2022-11-11T15:01:14Z

The prometheus up metric is basically a healthcheck-as-a-metric. You could achieve almost the same behavior by having something external to the application periodically health-check the application, and report a metric based on the result. It won't be tied to your other metrics in the same way the prometheus up metric is, meaning this metric being in the healthy state doesn't imply your other metrics have been successfully collected in the way it does for prometheus. But it does give you an external health signal. The httpcheck receiver in the collector might be able to fulfil that role.

Alternatively, you could introduce a metric in the application whose value is the current time, and alert when the value of the metric is older than a threshold.

jsuereth · 2022-11-11T15:01:22Z

If you're using a prometheus exporter + pull-based system, then an up metric should be created BY prometheus and makes a lot of sense.

up metrics aren't really the same in a push-based world. Typically, you'd have a heartbeat metric, rather than an up, and you need to interact with them subtly different. up means the endpoint was a live and CPU responded and such. A heartbeat, you need to expect it every N seconds, and delays/shifts can signal problems, but with a lot of noise.

Generally, I think the notion of "health" is important, but I wouldn't use the exact same solution for pull-based health + push-based health.

I know @jmacd has done a lot of thinking here. Josh, let us know if you think this idea has legs or if we should go a different direction.

tomasmota · 2022-11-11T22:11:34Z

I appreciate the feedback and suggestions. Might very well go with the current time solution, as there doesnt seem to be any other way to emulate a "liveness" metric in this setup.

As you can see from the diagram @jsuereth, it is a mix of push and pull, that is why it is hard to give the "up" metric responsibility to prometheus. Is there a standardized way of implementing such a heartbeat like you suggest? Or would using something like current time or seconds since start-up be as good as any other solution?

graph TD
    A[Service A] -->|send otlp| B(local otel collector)
    T[Service B] -->|send otlp| F(local otel collector)
    B[local otel collector] -->|send otlp| C(gateway otel collector)
    F[local otel collector] -->|send otlp| C(gateway otel collector)
    D(prometheus) -->|scrape| C(gateway otel collector)

jsuereth · 2022-11-14T17:10:08Z

Totally understand this concern. You want a unified view of "up" in the world of mixed push/pull metrics.

My own thoughts here are that we should have something matching you diagram in how we monitor "up" for services. E.g. a way of monitoring here:

graph TD
    A[Pull-based Up metric] --> B(Derived uptime metric)
    T[Push-based Heartbeat metric] --> B(Derived uptime metric)

That is, a "derived" metric would be one that can query/join across other metrics to give a cohesive view.

In any case, I think this topic actually deserves an "expert group" to think through and propose a good working convention for OTEL here. It's worth collecting some metric experts, in addition to figuring out where Prometheus stands on this issue.

cc @jmacd @reyang @gouthamve for some attention.

reyang · 2022-11-14T17:35:27Z

@tomas-mota what does "up" mean? A concrete scenario might help as I bet everyone will have their different version of understanding of "up".

tomasmota · 2022-11-14T17:52:25Z

My requirement here is simply "I want a metric that I can reliably query in order to know whether or not the service is running". As previously explained, a simple up=1 metric does not work here because of the expiration time in the collector.

reyang · 2022-11-14T18:02:32Z

My requirement here is simply "I want a metric that I can reliably query in order to know whether or not the service is running". As previously explained, a simple up=1 metric does not work here because of the expiration time in the collector.

I don't fully understand the ask here, maybe you only consider one single instance (do you have a service with multiple instances)? Anyways I think one could send the local timestamp as a gauge.

tomasmota · 2022-11-14T18:11:33Z

Exactly, I was just trying to figure out if there would be an "otel" way of stating that the service is up. Using the timestamp is also totally fine for me, just wonder, like @jsuereth , if there should be a more out of the box way of doing this, so it is clear for other people.

jsuereth · 2022-11-15T16:41:40Z

@jmacd Is going to take this to the prometheus WG and attempt to make progress specifically on how to handle "up" metrics from OTLP into prometheus.

jmacd · 2022-11-15T16:44:38Z

Related OTEP:
open-telemetry/oteps#185 (@jsuereth)

Related issues:

#1078 (on topic, closed in favor of the OTEP above).

Tangentially-related:
#2711
#1273

jsuereth · 2022-11-15T17:04:52Z

Thanks for reminding me of that proposal. I still think the general shape of the proposal is right, if we had the right names attached to it :)

jmacd · 2022-11-16T17:08:00Z

#2825
#2824

tomasmota · 2022-11-17T15:09:54Z

Awesome, thanks for taking this on @jsuereth ! Should I close this issue then?

tomasmota added the spec:metrics Related to the specification/metrics directory label Nov 7, 2022

github-actions bot assigned jsuereth Nov 7, 2022

jsuereth added the enhancement New feature or request label Nov 11, 2022

jsuereth assigned jmacd and unassigned jsuereth Nov 15, 2022

djaglowski mentioned this issue Dec 27, 2022

[receiver/mongodb]: Add uptime/health metrics open-telemetry/opentelemetry-collector-contrib#17022

Merged

jmacd mentioned this issue Jan 6, 2023

Specify MeterProvider configurable cardinality limits #2960

Merged

tomasmota mentioned this issue Apr 20, 2023

Semantic conventions for Uptime Monitoring open-telemetry/oteps#185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create an "up" metric #2923

How to create an "up" metric #2923

tomasmota commented Nov 7, 2022

tomasmota commented Nov 8, 2022

dashpole commented Nov 11, 2022

jsuereth commented Nov 11, 2022

tomasmota commented Nov 11, 2022

jsuereth commented Nov 14, 2022

reyang commented Nov 14, 2022

tomasmota commented Nov 14, 2022

reyang commented Nov 14, 2022

tomasmota commented Nov 14, 2022

jsuereth commented Nov 15, 2022

jmacd commented Nov 15, 2022

jsuereth commented Nov 15, 2022

jmacd commented Nov 16, 2022

tomasmota commented Nov 17, 2022

How to create an "up" metric #2923

How to create an "up" metric #2923

Comments

tomasmota commented Nov 7, 2022

tomasmota commented Nov 8, 2022

dashpole commented Nov 11, 2022

jsuereth commented Nov 11, 2022

tomasmota commented Nov 11, 2022

jsuereth commented Nov 14, 2022

reyang commented Nov 14, 2022

tomasmota commented Nov 14, 2022

reyang commented Nov 14, 2022

tomasmota commented Nov 14, 2022

jsuereth commented Nov 15, 2022

jmacd commented Nov 15, 2022

jsuereth commented Nov 15, 2022

jmacd commented Nov 16, 2022

tomasmota commented Nov 17, 2022