You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use the cgroup_exporter in more compute node, but sometime we got this error message:
# curl localhost:9306/metrics
An error has occurred while serving metrics:
91 error(s) occurred:
* [from Gatherer #1] collected metric "cgroup_uid" { label:{name:"jobid" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_user_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:395.08}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_system_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:16.58}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_total_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:411.858161192}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpus" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:64}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_rss_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_cache_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_used_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_total_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_fail_count" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_used_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_total_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_fail_count" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_uid" { label:{name:"jobid" value:""} gauge:{value:0}} was collected before with the same name and label values
...
I tried to restart the cgroup_exporter and the slurmd services, but it didn't solve the problem. After I rebooted the whole compute node, the issue resolved. Do you have any idea?
The text was updated successfully, but these errors were encountered:
if it happens again can you please try two things - the first is to look at the contents of /slurm cgroup dirs (maybe find /sys/fs/cgroup/*/slurm -type d) and other is to restart cgroup_exporter with --log.level=debug.
I suspect that there is something under /slurm that regexp match:
Dear Team,
We use the cgroup_exporter in more compute node, but sometime we got this error message:
I tried to restart the cgroup_exporter and the slurmd services, but it didn't solve the problem. After I rebooted the whole compute node, the issue resolved. Do you have any idea?
The text was updated successfully, but these errors were encountered: