An error has occurred while serving metrics #3

Atisom · 2024-11-17T19:21:27Z

Dear Team,

We use the cgroup_exporter in more compute node, but sometime we got this error message:

# curl localhost:9306/metrics
An error has occurred while serving metrics:

91 error(s) occurred:
* [from Gatherer #1] collected metric "cgroup_uid" { label:{name:"jobid" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_user_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:395.08}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_system_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:16.58}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_total_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:411.858161192}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpus" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:64}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_rss_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_cache_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_used_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_total_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_fail_count" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_used_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_total_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_fail_count" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_uid" { label:{name:"jobid" value:""} gauge:{value:0}} was collected before with the same name and label values
...

# /opt/jobstats/cgroup_exporter --collect.fullslurm --config.paths /slurm
ts=2024-11-15T15:00:26.607Z caller=cgroup_exporter.go:431 level=info msg="Starting cgroup_exporter" version="(version=, branch=, revision=64248e974a586d6fa75e0d1efc9e90c1b06785b8-modified)"
ts=2024-11-15T15:00:26.607Z caller=cgroup_exporter.go:432 level=info msg="Build context" build_context="(go=go1.20.6, platform=linux/amd64, user=, date=, tags=unknown)"
ts=2024-11-15T15:00:26.607Z caller=cgroup_exporter.go:433 level=info msg="Starting Server" address=:9306
# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.9 (Ootpa)
# slurmd -V
slurm 22.05.7
# systemd-cgtop
Control Group                                                                                                                  Tasks   %CPU   Memory  Input/s Output/s
/                                                                                                                               2019 3265.5   102.9G        -        -
/slurm                                                                                                                             - 3176.5    91.2G        -        -
/slurm/uid_13***                                                                                                                   - 3176.5    71.4G        -        -
/slurm/uid_13***/job_87*****                                                                                                       - 3176.6    71.4G        -        -
/slurm/uid_13***                                                                                                                   -      -     6.2G        -        -
/slurm/uid_13***/job_87*****                                                                                                       -      -   103.5M        -        -
# ls /sys/fs/cgroup/cpuacct/slurm
cgroup.clone_children  cpuacct.stat   cpuacct.usage_all     cpuacct.usage_percpu_sys   cpuacct.usage_sys   cpu.cfs_period_us  cpu.rt_period_us   cpu.shares  notify_on_release  uid_13***
cgroup.procs           cpuacct.usage  cpuacct.usage_percpu  cpuacct.usage_percpu_user  cpuacct.usage_user  cpu.cfs_quota_us   cpu.rt_runtime_us  cpu.stat    tasks              uid_13***

I tried to restart the cgroup_exporter and the slurmd services, but it didn't solve the problem. After I rebooted the whole compute node, the issue resolved. Do you have any idea?

The text was updated successfully, but these errors were encountered:

Atisom · 2024-11-17T19:22:12Z

when I remove the '--collect.fullslurm' flag, it works again. Maybe it cannot measure some kind of job?

plazonic · 2024-12-14T14:22:22Z

Hello,

if it happens again can you please try two things - the first is to look at the contents of /slurm cgroup dirs (maybe find /sys/fs/cgroup/*/slurm -type d) and other is to restart cgroup_exporter with --log.level=debug.

I suspect that there is something under /slurm that regexp match:

^/slurm/uid_([0-9]+)/job_([0-9]+)(/step_([^/]+)(/task_([[0-9]+))?)?$

is not matching and we should figure out what that is and fix it (or skip it if it is not useful).

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An error has occurred while serving metrics #3

An error has occurred while serving metrics #3

Atisom commented Nov 17, 2024

Atisom commented Nov 17, 2024

plazonic commented Dec 14, 2024

An error has occurred while serving metrics #3

An error has occurred while serving metrics #3

Comments

Atisom commented Nov 17, 2024

Atisom commented Nov 17, 2024

plazonic commented Dec 14, 2024