Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error has occurred while serving metrics #3

Open
Atisom opened this issue Nov 17, 2024 · 2 comments
Open

An error has occurred while serving metrics #3

Atisom opened this issue Nov 17, 2024 · 2 comments

Comments

@Atisom
Copy link

Atisom commented Nov 17, 2024

Dear Team,

We use the cgroup_exporter in more compute node, but sometime we got this error message:

# curl localhost:9306/metrics
An error has occurred while serving metrics:

91 error(s) occurred:
* [from Gatherer #1] collected metric "cgroup_uid" { label:{name:"jobid" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_user_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:395.08}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_system_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:16.58}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_total_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:411.858161192}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpus" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:64}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_rss_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_cache_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_used_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_total_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_fail_count" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_used_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_total_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_fail_count" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_uid" { label:{name:"jobid" value:""} gauge:{value:0}} was collected before with the same name and label values
...
# /opt/jobstats/cgroup_exporter --collect.fullslurm --config.paths /slurm
ts=2024-11-15T15:00:26.607Z caller=cgroup_exporter.go:431 level=info msg="Starting cgroup_exporter" version="(version=, branch=, revision=64248e974a586d6fa75e0d1efc9e90c1b06785b8-modified)"
ts=2024-11-15T15:00:26.607Z caller=cgroup_exporter.go:432 level=info msg="Build context" build_context="(go=go1.20.6, platform=linux/amd64, user=, date=, tags=unknown)"
ts=2024-11-15T15:00:26.607Z caller=cgroup_exporter.go:433 level=info msg="Starting Server" address=:9306
# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.9 (Ootpa)
# slurmd -V
slurm 22.05.7
# systemd-cgtop
Control Group                                                                                                                  Tasks   %CPU   Memory  Input/s Output/s
/                                                                                                                               2019 3265.5   102.9G        -        -
/slurm                                                                                                                             - 3176.5    91.2G        -        -
/slurm/uid_13***                                                                                                                   - 3176.5    71.4G        -        -
/slurm/uid_13***/job_87*****                                                                                                       - 3176.6    71.4G        -        -
/slurm/uid_13***                                                                                                                   -      -     6.2G        -        -
/slurm/uid_13***/job_87*****                                                                                                       -      -   103.5M        -        -
# ls /sys/fs/cgroup/cpuacct/slurm
cgroup.clone_children  cpuacct.stat   cpuacct.usage_all     cpuacct.usage_percpu_sys   cpuacct.usage_sys   cpu.cfs_period_us  cpu.rt_period_us   cpu.shares  notify_on_release  uid_13***
cgroup.procs           cpuacct.usage  cpuacct.usage_percpu  cpuacct.usage_percpu_user  cpuacct.usage_user  cpu.cfs_quota_us   cpu.rt_runtime_us  cpu.stat    tasks              uid_13***

I tried to restart the cgroup_exporter and the slurmd services, but it didn't solve the problem. After I rebooted the whole compute node, the issue resolved. Do you have any idea?

@Atisom
Copy link
Author

Atisom commented Nov 17, 2024

when I remove the '--collect.fullslurm' flag, it works again. Maybe it cannot measure some kind of job?

@plazonic
Copy link
Owner

Hello,

if it happens again can you please try two things - the first is to look at the contents of /slurm cgroup dirs (maybe find /sys/fs/cgroup/*/slurm -type d) and other is to restart cgroup_exporter with --log.level=debug.

I suspect that there is something under /slurm that regexp match:

^/slurm/uid_([0-9]+)/job_([0-9]+)(/step_([^/]+)(/task_([[0-9]+))?)?$

is not matching and we should figure out what that is and fix it (or skip it if it is not useful).

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants