Telemetry metrics should always be present and zeroed at startup #8590

hartfordfive · 2020-08-31T19:22:29Z

Overview

My self and my team have come across a consul issue with regards to agents failing, which was eventually causing other issues. In response to this, we've made changes to collect and alert, via Prometheus, on various consul metrics. With regards to failed agents, we can use the consul.serf.member.failed(or consul_serf_member_failed in Prometheus) metric (from the Cluster Health metrics). Unfortunately, this metric doesn't show after the consul startup when observing http://localhost:8500/v1/agent/metrics?format=prometheus. It seems that it only appears once the value is first incremented along with the following metrics:

consul.serf.member.failed
consul.serf.member.flap
consul.serf.member.join
consul.serf.member.left

This is problematic for a few reasons and it goes against best practices with regards to exposing metrics for Prometheus (see here also for good explanation). My main issue in this case is that I'd like to trigger an alert based on the increase of consul.serf.member.failed, similarly to what I do with other metrics which are also typed as counters with respect to Prometheus.

For example, say we have the prometheus_retention_time flag set to 5m. An agent has been up for 5 minutes without any issues and then suddenly the consul agent it killed for unexpectedly killed for some reason. In this case, the metric goes from being non-existent to being present with a value of 1:

....
consul.serf.member.failed 1
....

If I wanted to be able to detect a positive change in failed agents over the last 5 minutes, the increase function wouldn't work:

increase(consul_serf_member_failed) >= 1

In this case as there is no actual increase in the value as it starts off as 1, even though we implicitly know this means there has been an increase. (The following Prometheus github issue describes this exactly problem). As an alternative, some might say you could potentially just have a condition that checks if the value is greater than 0:

consul_serf_member_failed > 0

although this is problematic when dealing with counters as an increase may have occurred at some point in time and the issue might have been resolved since them. In this case, an alert with this condition would continuously trigger until the counter is reset. This is especially problematic if you have prometheus_retention_time set to a high value (e.g. 24h) as this would mean a single failed member in that time period would cause an alert for the remainder of the day.

As a solution, I propose all available metrics, at least those available in Prometheus format, should be made available at agent startup and all initialized with a zero value.

Reproduction Steps

Startup a consul cluster with 3 agents
Start the output of curl "http://localhost:8500/v1/agent/metrics?format=prometheus" every 1s
Kill one of the consul agent
Observe the consul.serf.member.failed metric has now appeared with a value of 1

Consul info for both Client and Server

v1.4.4+

The text was updated successfully, but these errors were encountered:

mkcp · 2020-08-31T20:11:51Z

Hey @hartfordfive, thanks for spending the time to write out the feature request. We're aware of this behavior and recognize that it goes against Prometheus' best practices. Much of Consul's telemetry predates Prometheus' pull model and so we don't yet have a mechanism to initialize metrics (something you don't need to do with a wire protocol). We're actively working on this and you can expect it to be addressed in an upcoming release.

mkcp · 2020-10-20T20:57:56Z

Closing this out as a duplicate of #5140

dnephin added the theme/telemetry Anything related to telemetry or observability label Aug 31, 2020

mkcp mentioned this issue Aug 31, 2020

☂️ Metrics Bugfixes #8417

Closed

mkcp mentioned this issue Oct 5, 2020

add NaNExpiry option and behavior to PrometheusSink hashicorp/go-metrics#119

Closed

mkcp closed this as completed Oct 20, 2020

mkcp mentioned this issue Oct 28, 2020

Add {Gauge,Summary,Counter}Definitions to PrometheusOpts to define non-expiring metrics hashicorp/go-metrics#120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry metrics should always be present and zeroed at startup #8590

Telemetry metrics should always be present and zeroed at startup #8590

hartfordfive commented Aug 31, 2020

mkcp commented Aug 31, 2020

mkcp commented Oct 20, 2020

Telemetry metrics should always be present and zeroed at startup #8590

Telemetry metrics should always be present and zeroed at startup #8590

Comments

hartfordfive commented Aug 31, 2020

Overview

Reproduction Steps

Consul info for both Client and Server

mkcp commented Aug 31, 2020

mkcp commented Oct 20, 2020