Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The restic command seems to hang #33

Closed
zoechi opened this issue Sep 6, 2024 · 4 comments
Closed

The restic command seems to hang #33

zoechi opened this issue Sep 6, 2024 · 4 comments

Comments

@zoechi
Copy link

zoechi commented Sep 6, 2024

I back to Hetzner using the sftp.command (#31).
So for restic-exporter to work I mount the storage to a directory.

journalctl -xfu prometheus-restic-exporter

shows this with no further output

Sep 06 08:01:01 myserver systemd[1]: Started prometheus-restic-exporter.service.
░░ Subject: A start job for unit prometheus-restic-exporter.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit prometheus-restic-exporter.service has finished successfully.
░░
░░ The job identifier is 65000.
Sep 06 08:01:01 myserver prometheus-restic-exporter-start[273523]: 2024-09-06 08:01:01 INFO     Starting Restic Prometheus Exporter
Sep 06 08:01:01 myserver prometheus-restic-exporter-start[273523]: 2024-09-06 08:01:01 INFO     It could take a while if the repository is remote
ps aux | grep restic

shows

root      273754  2.3  1.1 4296735584 278484 ?   Sl   08:01   0:14 /nix/store/7mzcrjbpsp8nm2i103lfbhkglwcva4mg-restic-0.17.0/bin/restic -r /mnt/restic_storagebox/backups/myserver -p /run/credentials/prometheus-restic-exporter.service/RESTIC_PASSWORD_FILE --no-lock stats --json 74fbf530ad47c94aebd884fa06a99f520290225e2142bc5853731c5733c14563

When I execute the command manually after about 15sec I get

{"total_size":249796868116,"total_file_count":1103048,"snapshots_count":1}

Any idea why the command might be stuck when run from the exporter
and why the exporter never starts listening on the configured port?

[Service]
Environment="LISTEN_ADDRESS=127.0.0.1"
Environment="LISTEN_PORT=9753"
User=root
...

Any suggestions about how to debug?

@zoechi
Copy link
Author

zoechi commented Sep 14, 2024

It took 15 hours from starting the restic stats command until it starts serving on the configured address+port.
I looked a bit at the code and it looks like a pure Python subprocess.run issue, but I have no experience with Python.
Even when it runs such a long time it seems to complete successfully in the end otherwise an exception would be thrown.

Sep 13 22:07:35 smarthome prometheus-restic-exporter-start[1022]: 2024-09-13 22:07:35 INFO     Starting Restic Prometheus Exporter
Sep 13 22:07:35 smarthome prometheus-restic-exporter-start[1022]: 2024-09-13 22:07:35 INFO     It could take a while if the repository is remote
Sep 14 13:07:27 smarthome prometheus-restic-exporter-start[1022]: 2024-09-14 13:07:27 INFO     Serving at http://0.0.0.0:9753
Sep 14 13:07:27 smarthome prometheus-restic-exporter-start[1022]: 2024-09-14 13:07:27 INFO     Refreshing stats every 3600 seconds

Perhaps some more log output about what it does could help in case it does more that I think (multiple calls).
From what I observed it's always the same process and not one call per snapshot or whatever. The hash is always the same when I check the process list.

@zoechi
Copy link
Author

zoechi commented Sep 14, 2024

I think now that this is a SystemD issue which limits resources for services.
I'm going to close but will update when I figure out how to fix this.

@zoechi zoechi closed this as completed Sep 14, 2024
@zoechi
Copy link
Author

zoechi commented Sep 15, 2024

I found restic/restic#4591
I ensured the RESTIC_CACHE_DIR env var is set to the same directory as when restic backups are executed. When I run the command manually, the cache directory's changed timestamp is updated, when it's run from the exporter in the systemd service it is not updated.
So it might be the missing cache dir but it's not clear yet why.

@zoechi
Copy link
Author

zoechi commented Sep 16, 2024

The cause was that the cache directory was read only.
This was caused by NixOS setting ProtectSystem=strict that is set for all Prometheus exporter's systemd services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant