Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running as systemd service with port change does not work #95

Open
dchau-wfr opened this issue Feb 25, 2023 · 8 comments
Open

Running as systemd service with port change does not work #95

dchau-wfr opened this issue Feb 25, 2023 · 8 comments

Comments

@dchau-wfr
Copy link

dchau-wfr commented Feb 25, 2023

  1. Copied the original service file and added option -listen-address 0.0.0.0:9101 and service starts but the metrics page does not work.
  2. Running /usr/bin/prometheus-slurm-exporter -listen-address 0.0.0.0:9101 manually does work.

Deric

@lahwaacz
Copy link
Contributor

What is the error message?

@dchau-wfr
Copy link
Author

dchau-wfr commented Feb 25, 2023

  1. When i run as a service:
    `● prometheus-slurm-exporter.service - Prometheus SLURM Exporter
    Loaded: loaded (/etc/systemd/system/prometheus-slurm-exporter.service; disabled; vendor preset: enabled)
    Active: active (running) since Sat 2023-02-25 07:59:38 UTC; 9s ago
    Main PID: 2496943 (prometheus-slur)
    Tasks: 6 (limit: 230312)
    Memory: 2.5M
    CGroup: /system.slice/prometheus-slurm-exporter.service
    └─2496943 /usr/bin/prometheus-slurm-exporter -listen-address 0.0.0.0:9101

Feb 25 07:59:38 nm-203-18 systemd[1]: Started Prometheus SLURM Exporter.
Feb 25 07:59:38 nm-203-18 prometheus-slurm-exporter[2496943]: time="2023-02-25T07:59:38Z" level=info msg="Starting Server: 0.0.0.0:9101" source="m>
Feb 25 07:59:38 nm-203-18 prometheus-slurm-exporter[2496943]: time="2023-02-25T07:59:38Z" level=info msg="GPUs Accounting: false" source="main.go:>
Feb 25 07:59:42 nm-203-18 systemd[1]: prometheus-slurm-exporter.service: Current command vanished from the unit file, execution of the command lis>
root@nm-203-18:/etc/systemd/system# curl 127.0.0.1:9101/metrics
(curl: (52) Empty reply from server)`

  1. When i ran manually with same command as from service file " /usr/bin/prometheus-slurm-exporter -listen-address=0.0.0.0:9101"
    root@nm-203-18:~# curl 10.5.70.7:9101/metrics

HELP go_gc_duration_seconds A summary of the GC invocation durations.

TYPE go_gc_duration_seconds summary

go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
...........

@dchau-wfr
Copy link
Author

Service file contents:
`root@nm-203-18:/etc/systemd/system# cat prometheus-slurm-exporter.service
[Unit]
Description=Prometheus SLURM Exporter

[Service]
ExecStart=/usr/bin/prometheus-slurm-exporter -listen-address=0.0.0.0:9101
Restart=always
RestartSec=15

[Install]
WantedBy=multi-user.target`

@lahwaacz
Copy link
Contributor

I can't reproduce, it works just fine for me. I doubt the error prometheus-slurm-exporter.service: Current command vanished from the unit file, execution of the command lis> is specific to the behavior of the exporter. What happens when you remove the -listen-address flag and use the default port?

@dchau-wfr
Copy link
Author

dchau-wfr commented Feb 25, 2023 via email

@dchau-wfr
Copy link
Author

dchau-wfr commented Feb 27, 2023

I just tried it on another machine and reproduced the issue. This is with slurm 22.05.6.

Also getting exit code 1 after running for some time. Might be related to the sdiag issues I've seen in other bugs.

@qww-ygg
Copy link

qww-ygg commented May 18, 2023

有解决的办法吗

@dchau-wfr
Copy link
Author

The issue is that when slurm exporter is ran as a service it does not have the slurm environment loaded. We had to load the slurm environment by calling a custom slurmrc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants