Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry metrics should always be present and zeroed at startup #8590

Closed
hartfordfive opened this issue Aug 31, 2020 · 2 comments
Closed

Telemetry metrics should always be present and zeroed at startup #8590

hartfordfive opened this issue Aug 31, 2020 · 2 comments
Labels
theme/telemetry Anything related to telemetry or observability

Comments

@hartfordfive
Copy link

Overview

My self and my team have come across a consul issue with regards to agents failing, which was eventually causing other issues. In response to this, we've made changes to collect and alert, via Prometheus, on various consul metrics. With regards to failed agents, we can use the consul.serf.member.failed(or consul_serf_member_failed in Prometheus) metric (from the Cluster Health metrics). Unfortunately, this metric doesn't show after the consul startup when observing http://localhost:8500/v1/agent/metrics?format=prometheus. It seems that it only appears once the value is first incremented along with the following metrics:

consul.serf.member.failed
consul.serf.member.flap
consul.serf.member.join
consul.serf.member.left

This is problematic for a few reasons and it goes against best practices with regards to exposing metrics for Prometheus (see here also for good explanation). My main issue in this case is that I'd like to trigger an alert based on the increase of consul.serf.member.failed, similarly to what I do with other metrics which are also typed as counters with respect to Prometheus.

For example, say we have the prometheus_retention_time flag set to 5m. An agent has been up for 5 minutes without any issues and then suddenly the consul agent it killed for unexpectedly killed for some reason. In this case, the metric goes from being non-existent to being present with a value of 1:

....
consul.serf.member.failed 1
....

If I wanted to be able to detect a positive change in failed agents over the last 5 minutes, the increase function wouldn't work:

increase(consul_serf_member_failed) >= 1

In this case as there is no actual increase in the value as it starts off as 1, even though we implicitly know this means there has been an increase. (The following Prometheus github issue describes this exactly problem). As an alternative, some might say you could potentially just have a condition that checks if the value is greater than 0:

consul_serf_member_failed > 0

although this is problematic when dealing with counters as an increase may have occurred at some point in time and the issue might have been resolved since them. In this case, an alert with this condition would continuously trigger until the counter is reset. This is especially problematic if you have prometheus_retention_time set to a high value (e.g. 24h) as this would mean a single failed member in that time period would cause an alert for the remainder of the day.

As a solution, I propose all available metrics, at least those available in Prometheus format, should be made available at agent startup and all initialized with a zero value.

Reproduction Steps

  1. Startup a consul cluster with 3 agents
  2. Start the output of curl "http://localhost:8500/v1/agent/metrics?format=prometheus" every 1s
  3. Kill one of the consul agent
  4. Observe the consul.serf.member.failed metric has now appeared with a value of 1

Consul info for both Client and Server

v1.4.4+

@dnephin dnephin added the theme/telemetry Anything related to telemetry or observability label Aug 31, 2020
@mkcp
Copy link
Contributor

mkcp commented Aug 31, 2020

Hey @hartfordfive, thanks for spending the time to write out the feature request. We're aware of this behavior and recognize that it goes against Prometheus' best practices. Much of Consul's telemetry predates Prometheus' pull model and so we don't yet have a mechanism to initialize metrics (something you don't need to do with a wire protocol). We're actively working on this and you can expect it to be addressed in an upcoming release.

@mkcp
Copy link
Contributor

mkcp commented Oct 20, 2020

Closing this out as a duplicate of #5140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/telemetry Anything related to telemetry or observability
Projects
None yet
Development

No branches or pull requests

3 participants