You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My self and my team have come across a consul issue with regards to agents failing, which was eventually causing other issues. In response to this, we've made changes to collect and alert, via Prometheus, on various consul metrics. With regards to failed agents, we can use the consul.serf.member.failed(or consul_serf_member_failed in Prometheus) metric (from the Cluster Health metrics). Unfortunately, this metric doesn't show after the consul startup when observing http://localhost:8500/v1/agent/metrics?format=prometheus. It seems that it only appears once the value is first incremented along with the following metrics:
This is problematic for a few reasons and it goes against best practices with regards to exposing metrics for Prometheus (see here also for good explanation). My main issue in this case is that I'd like to trigger an alert based on the increase of consul.serf.member.failed, similarly to what I do with other metrics which are also typed as counters with respect to Prometheus.
For example, say we have the prometheus_retention_time flag set to 5m. An agent has been up for 5 minutes without any issues and then suddenly the consul agent it killed for unexpectedly killed for some reason. In this case, the metric goes from being non-existent to being present with a value of 1:
....
consul.serf.member.failed 1
....
If I wanted to be able to detect a positive change in failed agents over the last 5 minutes, the increase function wouldn't work:
increase(consul_serf_member_failed) >= 1
In this case as there is no actual increase in the value as it starts off as 1, even though we implicitly know this means there has been an increase. (The following Prometheus github issue describes this exactly problem). As an alternative, some might say you could potentially just have a condition that checks if the value is greater than 0:
consul_serf_member_failed > 0
although this is problematic when dealing with counters as an increase may have occurred at some point in time and the issue might have been resolved since them. In this case, an alert with this condition would continuously trigger until the counter is reset. This is especially problematic if you have prometheus_retention_time set to a high value (e.g. 24h) as this would mean a single failed member in that time period would cause an alert for the remainder of the day.
As a solution, I propose all available metrics, at least those available in Prometheus format, should be made available at agent startup and all initialized with a zero value.
Reproduction Steps
Startup a consul cluster with 3 agents
Start the output of curl "http://localhost:8500/v1/agent/metrics?format=prometheus" every 1s
Kill one of the consul agent
Observe the consul.serf.member.failed metric has now appeared with a value of 1
Consul info for both Client and Server
v1.4.4+
The text was updated successfully, but these errors were encountered:
Hey @hartfordfive, thanks for spending the time to write out the feature request. We're aware of this behavior and recognize that it goes against Prometheus' best practices. Much of Consul's telemetry predates Prometheus' pull model and so we don't yet have a mechanism to initialize metrics (something you don't need to do with a wire protocol). We're actively working on this and you can expect it to be addressed in an upcoming release.
Overview
My self and my team have come across a consul issue with regards to agents failing, which was eventually causing other issues. In response to this, we've made changes to collect and alert, via Prometheus, on various consul metrics. With regards to failed agents, we can use the
consul.serf.member.failed
(or consul_serf_member_failed in Prometheus) metric (from the Cluster Health metrics). Unfortunately, this metric doesn't show after the consul startup when observinghttp://localhost:8500/v1/agent/metrics?format=prometheus
. It seems that it only appears once the value is first incremented along with the following metrics:This is problematic for a few reasons and it goes against best practices with regards to exposing metrics for Prometheus (see here also for good explanation). My main issue in this case is that I'd like to trigger an alert based on the increase of
consul.serf.member.failed
, similarly to what I do with other metrics which are also typed as counters with respect to Prometheus.For example, say we have the
prometheus_retention_time
flag set to 5m. An agent has been up for 5 minutes without any issues and then suddenly the consul agent it killed for unexpectedly killed for some reason. In this case, the metric goes from being non-existent to being present with a value of 1:If I wanted to be able to detect a positive change in failed agents over the last 5 minutes, the
increase
function wouldn't work:In this case as there is no actual increase in the value as it starts off as 1, even though we implicitly know this means there has been an increase. (The following Prometheus github issue describes this exactly problem). As an alternative, some might say you could potentially just have a condition that checks if the value is greater than 0:
although this is problematic when dealing with counters as an increase may have occurred at some point in time and the issue might have been resolved since them. In this case, an alert with this condition would continuously trigger until the counter is reset. This is especially problematic if you have
prometheus_retention_time
set to a high value (e.g. 24h) as this would mean a single failed member in that time period would cause an alert for the remainder of the day.As a solution, I propose all available metrics, at least those available in Prometheus format, should be made available at agent startup and all initialized with a zero value.
Reproduction Steps
curl "http://localhost:8500/v1/agent/metrics?format=prometheus"
every 1sconsul.serf.member.failed
metric has now appeared with a value of 1Consul info for both Client and Server
v1.4.4+
The text was updated successfully, but these errors were encountered: