Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus quantile metrics NaN #4254

Closed
jameshartig opened this issue Jun 20, 2018 · 5 comments
Closed

Prometheus quantile metrics NaN #4254

jameshartig opened this issue Jun 20, 2018 · 5 comments
Labels
theme/telemetry Anything related to telemetry or observability type/bug Feature does not function as expected

Comments

@jameshartig
Copy link
Contributor

Overview of the Issue

Using the new /agent/metrics?format=prometheus in 1.1.0 we're seeing some of the quantile metrics reporting a value of NaN:

consul_prepared_query_execute{quantile="0.5"} NaN
consul_prepared_query_execute{quantile="0.9"} NaN
consul_prepared_query_execute{quantile="0.99"} NaN
consul_prepared_query_execute_sum 927.0271777287126
consul_prepared_query_execute_count 4119

Reproduction Steps

  1. Create a cluster
  2. Create a prepared query
  3. Query prepared query multiple times
  4. curl 127.0.0.1:8500/agent/metrics?format=prometheus

Consul info for both Client and Server

Server info

agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 5174058
version = 1.1.0
consul:
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 10.142.15.195:8300
server = true
raft:
applied_index = 1374721
commit_index = 1374721
fsm_pending = 0
last_contact = 8.040834ms
last_log_index = 1374721
last_log_term = 6
last_snapshot_index = 1368148
last_snapshot_term = 6
latest_configuration = [{Suffrage:Voter ID:58c87947-99ec-81b6-0f43-0acd0e801823 Address:10.142.15.195:8300} {Suffrage:Voter ID:22f22641-84b4-418a-5533-e714e29d724b Address:10.142.15.193:8300} {Suffrage:Voter ID:b5f83794-8f13-af87-2c8a-6f2c791f5a70 Address:10.142.0.45:8300}]
latest_configuration_index = 1
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 6
runtime:
arch = amd64
cpu_count = 1
goroutines = 87
max_procs = 1
os = linux
version = go1.10.2
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 6
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 32
members = 5
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 17
members = 3
query_queue = 0
query_time = 1

Operating system and Environment details

Linux but seems platform independent

Log Fragments

I don't see any relevant logs for these metrics.

@jameshartig
Copy link
Contributor Author

jameshartig commented Jul 12, 2018

Woops the last step should be:
curl 127.0.0.1:8500/v1/agent/metrics?format=prometheus

Also, in 1.2.0 I'm not seeing consul_prepared_query_execute anymore but the same thing happens with:

consul_http_GET_v1_query__{quantile="0.5"} NaN
consul_http_GET_v1_query__{quantile="0.9"} NaN
consul_http_GET_v1_query__{quantile="0.99"} NaN
consul_http_GET_v1_query___sum 1.3336689472198486
consul_http_GET_v1_query___count 1

The actual values show up there for ~10 seconds and then switch to NaN.

My config has:

  "telemetry": {
    "disable_hostname": true,
    "prometheus_retention_time": "5m"
  }

@pearkes pearkes added the type/bug Feature does not function as expected label Jul 26, 2018
@rusbob
Copy link

rusbob commented Aug 29, 2018

Yep, I have the same issue.

Consul Exposed Metrics in Prometheus format

$curl http://localhost:8500/v1/agent/metrics?format=prometheus

# HELP consul_leader_barrier consul_leader_barrier
# TYPE consul_leader_barrier summary
consul_leader_barrier{quantile="0.5"} NaN
consul_leader_barrier{quantile="0.9"} NaN
consul_leader_barrier{quantile="0.99"} NaN
consul_leader_barrier_sum 331.68619396165013
consul_leader_barrier_count 3585
# HELP consul_leader_reconcile consul_leader_reconcile
# TYPE consul_leader_reconcile summary
consul_leader_reconcile{quantile="0.5"} NaN
consul_leader_reconcile{quantile="0.9"} NaN
consul_leader_reconcile{quantile="0.99"} NaN
consul_leader_reconcile_sum 400.63788282871246
consul_leader_reconcile_count 3585
# HELP consul_leader_reconcileMember consul_leader_reconcileMember
# TYPE consul_leader_reconcileMember summary
consul_leader_reconcileMember{quantile="0.5"} NaN
consul_leader_reconcileMember{quantile="0.9"} NaN
consul_leader_reconcileMember{quantile="0.99"} NaN
consul_leader_reconcileMember_sum 235.5608109459281
consul_leader_reconcileMember_count 3585
# HELP consul_runtime_gc_pause_ns consul_runtime_gc_pause_ns
# TYPE consul_runtime_gc_pause_ns summary
consul_runtime_gc_pause_ns{quantile="0.5"} NaN
consul_runtime_gc_pause_ns{quantile="0.9"} NaN
consul_runtime_gc_pause_ns{quantile="0.99"} NaN
consul_runtime_gc_pause_ns_sum 4.27151397e+08
consul_runtime_gc_pause_ns_count 3410

System Information

$ curl http://localhost:8500/v1/agent/self
{
    "Config": {
        "Datacenter": "eu-central-1a",
        "NodeName": "consul-server",
        "NodeID": "abc",
        "Revision": "e716d1b5f",
        "Server": true,
        "Version": "1.2.2"
    },
    ...
}

Consul is used as Container in Docker

Used image consul:1.2.2

Side Effect

Consul became down in Prometheus UI:
Prometheus Web

Get http://localhost:8500/v1/agent/metrics?format=prometheus: dial tcp 127.0.0.1:8500: connect: connection refused

@ncabatoff
Copy link

This appears to be the normal behaviour for summaries in Prometheus: the sum and count are eternal, but the quantiles expire and afterwards only contain NaN. The default for MaxAge in Prometheus is 10m, but for reasons not clear to me, in the go-metrics library used in Consul (that wraps the Prometheus client library), it's 10s:

https://github.com/armon/go-metrics/blob/f0300d1749da6fa982027e449ec0c7a145510c3c/prometheus/prometheus.go#L160

This means that you'd better use a scrape interval of 10s or less if you want to be able to capture quantile timings.

@jsosulska jsosulska added the theme/telemetry Anything related to telemetry or observability label Jul 21, 2020
@mkcp
Copy link
Contributor

mkcp commented Oct 20, 2020

Closing this issue out as it's expected behavior.

@mkcp mkcp closed this as completed Oct 20, 2020
@WojciechKuk
Copy link

This means that you'd better use a scrape interval of 10s or less if you want to be able to capture quantile timings.

but it's not a solution - there will be still "holes" in the data, and most of requests performed by prometheus will be useless.
10s is generally too small amount of time to collect enough data to calculate ANY quantiles.

if someone needs to find out p99, there should be at least 100 requests (in this case he will need > 600rpm per instance)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/telemetry Anything related to telemetry or observability type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

7 participants