Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metric data job label name conflict #5345

Closed
chenjpu opened this issue Feb 21, 2019 · 16 comments · Fixed by #5850
Closed

Prometheus metric data job label name conflict #5345

chenjpu opened this issue Feb 21, 2019 · 16 comments · Fixed by #5850

Comments

@chenjpu
Copy link

chenjpu commented Feb 21, 2019

Nomad version

Nomad v0.9.0-beta2

Operating system and Environment details

CentOS Linux release 7.6.1810 (Core)

Issue

Prometheus metric data job label name conflict.
The prometheus server has a default job label

Nomad Server logs (if appropriate)

nomad: 2019-02-17T03:09:15.387+0800 [INFO ] http.prometheus_handler: error gathering metrics: 35 error(s) occurred:
nomad: * collected metric nomad_nomad_job_summary_queued label:<name:"job" value:"security:1.4.1-RC2" > label:<name:"task_group" value:"security" > gauge:<value:1 > has label dimensions inconsistent with previously collected metrics in the same metric family
nomad: * collected metric nomad_nomad_job_summary_queued label:<name:"job" value:"loki" > label:<name:"task_group" value:"loki" > gauge:<value:0 > has label dimensions inconsistent with previously collected metrics in the same metric family

@endocrimes
Copy link
Contributor

@chenjpu 👋 - Is the default job label one that is a Prometheus default or is it one you've added in your configuration?

@chenjpu
Copy link
Author

chenjpu commented Feb 21, 2019

it's prometheus default

@endocrimes
Copy link
Contributor

Interesting - I thought prometheus namespaced all of it's default labels?

Afaik you can use relabel_configs to rename collected metrics though? (my prometheus knowledge is pretty high level though) https://github.com/prometheus/prometheus/blob/c7d83b2b6a08048e1bfa046f9fd63125ae327e02/config/testdata/conf.good.yml#L56-L60

@chenjpu
Copy link
Author

chenjpu commented Feb 22, 2019

I have set honor_labels parameter ,but also display error log
has label dimensions inconsistent with previously collected metrics in the same metric family

I found that other projects had similar problems(prometheus/influxdb_exporter#23).

Besides,the above error is not present on nomad 0.8.7 version :)

@chenjpu
Copy link
Author

chenjpu commented Feb 22, 2019

client_golang(0.9.0 / 2018-10-15) mentioned that inconsistent label dimensions are now allowed

@perrymanuk
Copy link

I ran into this as well but just relabeled the job coming from nomad to job_name

@chenjpu
Copy link
Author

chenjpu commented Mar 25, 2019

I did a simple test, after I upgraded the client_go((0.9.0 / 2018-10-15)), the problem was solved.

@peimanja
Copy link

We are running Nomad 0.9.1 and still see this issue. Nomad logs are flooded with a similar error.

@braxton9460
Copy link

I just upgraded to nomad 0.9.1 today from 0.8.4 and found that I am only getting this error in our environment where we are using periodic/batch jobs. In our other environments where we only have service type jobs, we do not encounter this error and resultant issues with prometheus metrics collection.
I did not get this error before the upgrade.
I am happy to provide logs or more information if it would be useful.

@awkaplan
Copy link

awkaplan commented May 20, 2019

I'm also observing this issue when upgrading 0.8.3 -> 0.9.1. Some additional details:

  • This only appears to affect the nomad_nomad_job_summary_* metrics.
  • Temporarily setting the prometheus_metrics configuration to false does not resolve the issue.

@awkaplan
Copy link

An update about this issue -
If left running with prometheus_metrics = true, the cluster leader will eventually kill any running allocations on the cluster. Disabling prometheus_metrics and restarting all masters causes allocations to restart and jobs to recover.

@bewiwi
Copy link

bewiwi commented May 28, 2019

Hi,
I think the problem is in the method iterateJobSummaryMetrics()
https://github.com/hashicorp/nomad/blob/master/nomad/leader.go#L648

Depending of task type, we inject different label but prometheus lib seems not compatible with this. All metrics with same name should have same labels.

As example service task labels are:

label:<name:"alloc_id" value:"24439375-f7cd-0207-4c2c-492bbc9b5ee4" >
label:<name:"job" value:"top_secret" >
label:<name:"task" value:"worker" >
label:<name:"task_group" value:"worker" >

And sync task labels are:

label:<name:"parent_id" value:"top_secret_sync" >
label:<name:"periodic_id" value:"1559057100" >
label:<name:"task" value:"logger" >
label:<name:"task_group" value:"sync" >

For me, 2 solutions are possible,

  • add all labels for all metrics and add a new label "job_type" with value periodic, sync or
  • add a suffix/prefix for metrics name depending of type

What do you think ?

@henyxia
Copy link

henyxia commented Jun 4, 2019

Up! Will we see this fix in the next release? :)

@stremovsky
Copy link

Hello,

The problem is still in place. When pushing data to prometheus pushgateway, the job label got rewritten.

Only renaming "job" to "job_name" helps us.

@tgross
Copy link
Member

tgross commented Nov 7, 2019

Hi @stremovsky. Sorry to hear that. You're on a version of Nomad that's 0.9.5 or later?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet