Prometheus quantile metrics NaN #4254

jameshartig · 2018-06-20T18:34:58Z

Overview of the Issue

Using the new /agent/metrics?format=prometheus in 1.1.0 we're seeing some of the quantile metrics reporting a value of NaN:

consul_prepared_query_execute{quantile="0.5"} NaN
consul_prepared_query_execute{quantile="0.9"} NaN
consul_prepared_query_execute{quantile="0.99"} NaN
consul_prepared_query_execute_sum 927.0271777287126
consul_prepared_query_execute_count 4119

Reproduction Steps

Create a cluster
Create a prepared query
Query prepared query multiple times
curl 127.0.0.1:8500/agent/metrics?format=prometheus

Consul info for both Client and Server

Server info

agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 5174058
version = 1.1.0
consul:
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 10.142.15.195:8300
server = true
raft:
applied_index = 1374721
commit_index = 1374721
fsm_pending = 0
last_contact = 8.040834ms
last_log_index = 1374721
last_log_term = 6
last_snapshot_index = 1368148
last_snapshot_term = 6
latest_configuration = [{Suffrage:Voter ID:58c87947-99ec-81b6-0f43-0acd0e801823 Address:10.142.15.195:8300} {Suffrage:Voter ID:22f22641-84b4-418a-5533-e714e29d724b Address:10.142.15.193:8300} {Suffrage:Voter ID:b5f83794-8f13-af87-2c8a-6f2c791f5a70 Address:10.142.0.45:8300}]
latest_configuration_index = 1
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 6
runtime:
arch = amd64
cpu_count = 1
goroutines = 87
max_procs = 1
os = linux
version = go1.10.2
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 6
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 32
members = 5
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 17
members = 3
query_queue = 0
query_time = 1

Operating system and Environment details

Linux but seems platform independent

Log Fragments

I don't see any relevant logs for these metrics.

The text was updated successfully, but these errors were encountered:

jameshartig · 2018-07-12T00:55:33Z

Woops the last step should be:
curl 127.0.0.1:8500/v1/agent/metrics?format=prometheus

Also, in 1.2.0 I'm not seeing consul_prepared_query_execute anymore but the same thing happens with:

consul_http_GET_v1_query__{quantile="0.5"} NaN
consul_http_GET_v1_query__{quantile="0.9"} NaN
consul_http_GET_v1_query__{quantile="0.99"} NaN
consul_http_GET_v1_query___sum 1.3336689472198486
consul_http_GET_v1_query___count 1

The actual values show up there for ~10 seconds and then switch to NaN.

My config has:

  "telemetry": {
    "disable_hostname": true,
    "prometheus_retention_time": "5m"
  }

rusbob · 2018-08-29T19:12:46Z

Yep, I have the same issue.

Consul Exposed Metrics in Prometheus format

$curl http://localhost:8500/v1/agent/metrics?format=prometheus

# HELP consul_leader_barrier consul_leader_barrier
# TYPE consul_leader_barrier summary
consul_leader_barrier{quantile="0.5"} NaN
consul_leader_barrier{quantile="0.9"} NaN
consul_leader_barrier{quantile="0.99"} NaN
consul_leader_barrier_sum 331.68619396165013
consul_leader_barrier_count 3585
# HELP consul_leader_reconcile consul_leader_reconcile
# TYPE consul_leader_reconcile summary
consul_leader_reconcile{quantile="0.5"} NaN
consul_leader_reconcile{quantile="0.9"} NaN
consul_leader_reconcile{quantile="0.99"} NaN
consul_leader_reconcile_sum 400.63788282871246
consul_leader_reconcile_count 3585
# HELP consul_leader_reconcileMember consul_leader_reconcileMember
# TYPE consul_leader_reconcileMember summary
consul_leader_reconcileMember{quantile="0.5"} NaN
consul_leader_reconcileMember{quantile="0.9"} NaN
consul_leader_reconcileMember{quantile="0.99"} NaN
consul_leader_reconcileMember_sum 235.5608109459281
consul_leader_reconcileMember_count 3585
# HELP consul_runtime_gc_pause_ns consul_runtime_gc_pause_ns
# TYPE consul_runtime_gc_pause_ns summary
consul_runtime_gc_pause_ns{quantile="0.5"} NaN
consul_runtime_gc_pause_ns{quantile="0.9"} NaN
consul_runtime_gc_pause_ns{quantile="0.99"} NaN
consul_runtime_gc_pause_ns_sum 4.27151397e+08
consul_runtime_gc_pause_ns_count 3410

System Information

$ curl http://localhost:8500/v1/agent/self
{
    "Config": {
        "Datacenter": "eu-central-1a",
        "NodeName": "consul-server",
        "NodeID": "abc",
        "Revision": "e716d1b5f",
        "Server": true,
        "Version": "1.2.2"
    },
    ...
}

Consul is used as Container in Docker

Used image consul:1.2.2

Side Effect

Consul became down in Prometheus UI:

Get http://localhost:8500/v1/agent/metrics?format=prometheus: dial tcp 127.0.0.1:8500: connect: connection refused

ncabatoff · 2019-01-25T16:02:17Z

This appears to be the normal behaviour for summaries in Prometheus: the sum and count are eternal, but the quantiles expire and afterwards only contain NaN. The default for MaxAge in Prometheus is 10m, but for reasons not clear to me, in the go-metrics library used in Consul (that wraps the Prometheus client library), it's 10s:

https://github.com/armon/go-metrics/blob/f0300d1749da6fa982027e449ec0c7a145510c3c/prometheus/prometheus.go#L160

This means that you'd better use a scrape interval of 10s or less if you want to be able to capture quantile timings.

mkcp · 2020-10-20T20:56:58Z

Closing this issue out as it's expected behavior.

WojciechKuk · 2024-03-28T13:40:42Z

This means that you'd better use a scrape interval of 10s or less if you want to be able to capture quantile timings.

but it's not a solution - there will be still "holes" in the data, and most of requests performed by prometheus will be useless.
10s is generally too small amount of time to collect enough data to calculate ANY quantiles.

if someone needs to find out p99, there should be at least 100 requests (in this case he will need > 600rpm per instance)

pearkes added the type/bug Feature does not function as expected label Jul 26, 2018

jsosulska added the theme/telemetry Anything related to telemetry or observability label Jul 21, 2020

mkcp mentioned this issue Jul 30, 2020

☂️ Metrics Bugfixes #8417

Closed

mkcp closed this as completed Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus quantile metrics NaN #4254

Prometheus quantile metrics NaN #4254

jameshartig commented Jun 20, 2018

jameshartig commented Jul 12, 2018 •

edited

Loading

rusbob commented Aug 29, 2018 •

edited

Loading

ncabatoff commented Jan 25, 2019

mkcp commented Oct 20, 2020

WojciechKuk commented Mar 28, 2024

Prometheus quantile metrics NaN #4254

Prometheus quantile metrics NaN #4254

Comments

jameshartig commented Jun 20, 2018

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

jameshartig commented Jul 12, 2018 • edited Loading

rusbob commented Aug 29, 2018 • edited Loading

Consul Exposed Metrics in Prometheus format

System Information

Consul is used as Container in Docker

Side Effect

ncabatoff commented Jan 25, 2019

mkcp commented Oct 20, 2020

WojciechKuk commented Mar 28, 2024

jameshartig commented Jul 12, 2018 •

edited

Loading

rusbob commented Aug 29, 2018 •

edited

Loading