Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus output reports Error: collected metric has label dimensions inconsistent with previously collected metrics in the same metric family #2822

Closed
sbadia opened this issue May 17, 2017 · 8 comments
Labels
bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Milestone

Comments

@sbadia
Copy link

sbadia commented May 17, 2017

Bug report

Hello,

We have a regression between telegraf 1.2.1 and 1.3.0 (with the same configuration).

Relevant telegraf.conf:

[[outputs.prometheus_client]]
  listen = ":9126"
[agent]
 interval = "120s"
 debug    = false
[[inputs.ntpq]]
  dns_lookup = false

System info:

  • Telegraf v1.3.0 (git: release-1.3 2bc5594b44145368823d7aa78bfb753ab51e9235)
  • Ubuntu 16.04.2 LTS

Steps to reproduce:

  1. Install telegraf 1.3.0, with the configuration above
  2. curl http://localhost:9126/metrics x5 or more

Expected behavior:

Telegraf should expose collected metrics through /metrics endpoint

Actual behavior:

Telegraf fail to display any metrics with this error message.

Additional info:

curl http://localhost:9126/metrics
An error has occurred during metrics collection:

5 error(s) occurred:
* collected metric ntpq_jitter label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_delay label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_poll label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:64 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_offset label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric ntpq_reach label:<name:"host" value:"node-0148" > label:<name:"refid" value:".LOCL." > label:<name:"remote" value:"127.127.1.0" > label:<name:"stratum" value:"10" > label:<name:"type" value:"l" > untyped:<value:0 >  has label dimensions inconsistent with previously collected metrics in the same metric family

Use case:

  • This is a regression between telegraf 1.2.1 and 1.3.0

Thanks in advance!

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label May 17, 2017
@danielnelson danielnelson added this to the 1.3.1 milestone May 17, 2017
@danielnelson
Copy link
Contributor

Seems to be caused by the version change in github.com/prometheus/client_golang

@danielnelson
Copy link
Contributor

This happens because ntpq input generates points where the list of tagkeys changes, in particular the state_prefix tagkey is not always present:

ntpq,refid=.POOL.,remote=3.debian.pool.n,stratum=16,type=p delay=0,jitter=0,offset=0,poll=64i,reach=0i 1495059325000000000
ntpq,refid=204.9.54.119,remote=209.242.224.117,state_prefix=-,stratum=2,type=u delay=66.056,jitter=0.681,offset=2.246,poll=1024i,reach=37i,when=298i 1495059325000000000

This can be verified by excluding the tag:

[[outputs.prometheus_client]]
  tagexclude = ["state_prefix"]

@danielnelson danielnelson changed the title Error: collected metric has label dimensions inconsistent with previously collected metrics in the same metric family Prometheus output reports Error: collected metric has label dimensions inconsistent with previously collected metrics in the same metric family May 25, 2017
@danielnelson danielnelson added the regression something that used to work, but is now broken label May 25, 2017
@danielnelson danielnelson modified the milestones: 1.3.2, 1.3.1 May 31, 2017
@freeseacher
Copy link
Contributor

@danielnelson danielnelson modified the milestone: 1.3.2, 1.3.1 12 hours ago
:(((

@danielnelson
Copy link
Contributor

@freeseacher Please take a look at #2857 and comment if that fix will work for you.

@freeseacher
Copy link
Contributor

@danielnelson, yep. that fixes issue for me.
Telegraf v26055d5 (git: fix-prometheus-output-labels 26055d5)
works for about an hour on ~40 servers without that bug

@freeseacher
Copy link
Contributor

@danielnelson, any updates ?

@danielnelson
Copy link
Contributor

I'm still getting reports that the fix is not sufficient, I'm trying to get an improved version out today.

@danielnelson
Copy link
Contributor

Merged fix; 1.3.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Projects
None yet
Development

No branches or pull requests

3 participants