Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACME does not react helpfully upon failure #116652

Closed
turion opened this issue Mar 17, 2021 · 3 comments
Closed

ACME does not react helpfully upon failure #116652

turion opened this issue Mar 17, 2021 · 3 comments
Labels
0.kind: bug Something is broken

Comments

@turion
Copy link
Contributor

turion commented Mar 17, 2021

On my webserver, I had failed acme services like these since quite some time:

# systemctl status acme-nextcloud.manuelbaerenz.de.service
● acme-nextcloud.manuelbaerenz.de.service - Renew ACME certificate for nextcloud.manuelbaerenz.de
     Loaded: loaded (/nix/store/g7laham2i950fsrn7iz9h90gxcr4b1ik-unit-acme-nextcloud.manuelbaerenz.de.service/acme-nextcloud.manuelbaerenz.de.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2021-03-17 18:09:04 UTC; 6s ago
TriggeredBy: ● acme-nextcloud.manuelbaerenz.de.timer
    Process: 11735 ExecStart=/nix/store/knmsshynd5ynwqa6b8hkgvrcpgbhv3kk-unit-script-acme-nextcloud.manuelbaerenz.de-start/bin/acme-nextcloud.manuelbaerenz.de-start (code=exited, status=1/>
   Main PID: 11735 (code=exited, status=1/FAILURE)
         IP: 4.5K in, 1018B out
        CPU: 140ms

Mar 17 18:09:02 manuelbaerenz acme-nextcloud.manuelbaerenz.de-start[11738]: ++ ls -1 accounts
Mar 17 18:09:02 manuelbaerenz acme-nextcloud.manuelbaerenz.de-start[11735]: + '[' -e certificates/nextcloud.manuelbaerenz.de.key -a -e certificates/nextcloud.manuelbaerenz.de.crt -a -n acm>
Mar 17 18:09:02 manuelbaerenz acme-nextcloud.manuelbaerenz.de-start[11735]: + '[' -e certificates/domainhash.txt ']'
Mar 17 18:09:02 manuelbaerenz acme-nextcloud.manuelbaerenz.de-start[11735]: + cmp -s domainhash.txt certificates/domainhash.txt
Mar 17 18:09:02 manuelbaerenz acme-nextcloud.manuelbaerenz.de-start[11735]: + lego --accept-tos --path . -d nextcloud.manuelbaerenz.de --email [email protected] --key-type ec256>
Mar 17 18:09:04 manuelbaerenz acme-nextcloud.manuelbaerenz.de-start[11740]: 2021/03/17 18:09:04 Account [email protected] is not registered. Use 'run' to register a new account.
Mar 17 18:09:04 manuelbaerenz systemd[1]: acme-nextcloud.manuelbaerenz.de.service: Main process exited, code=exited, status=1/FAILURE
Mar 17 18:09:04 manuelbaerenz systemd[1]: acme-nextcloud.manuelbaerenz.de.service: Failed with result 'exit-code'.
Mar 17 18:09:04 manuelbaerenz systemd[1]: Failed to start Renew ACME certificate for nextcloud.manuelbaerenz.de.
Mar 17 18:09:04 manuelbaerenz systemd[1]: acme-nextcloud.manuelbaerenz.de.service: Consumed 140ms CPU time, received 4.5K IP traffic, sent 1018B IP traffic.

I inspected /var/lib/acme/.lego, and all folders seemed fine. I had successfully used Let's Encrypt in the past, but I believe in the recent move to lego, something broke, or some state got corrupted.

I could fix it by removing the .lego folder and doing systemctl restart acme-nextcloud.manuelbaerenz.de.service. But I believe it would have been better if the failure of this service would have either:

  1. automatically triggered a complete lego run
  2. given the user some clearer error message on how to manually trigger a complete run, for example with a separate systemd service executing the run script, that can be called manually.

CC

@aanderse
@andrew-d
@arianvp
@Emily
@flokli
@m1cr0man

Metadata

  • system: "x86_64-linux"
  • host os: Linux 5.4.104, NixOS, 21.05pre276379.266dc8c3d05 (Okapi)
  • multi-user?: yes
  • sandbox: yes
  • version: nix-env (Nix) 2.3.10
  • channels(root): "nixos-21.05pre276379.266dc8c3d05"
  • nixpkgs: /nix/var/nix/profiles/per-user/root/channels/nixos

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
  - security.acme
@turion turion added the 0.kind: bug Something is broken label Mar 17, 2021
@turion
Copy link
Contributor Author

turion commented Mar 17, 2021

P.S.: I discovered this only when I got the mail from Let's Encrypt that my certificates are expiring. Had I read my logs more carefully, I would have noticed earlier, since the error occurred daily since beginning of February, when I upgraded to d96bd33.

@m1cr0man
Copy link
Contributor

But I believe it would have been better if the failure of this service would have either:
automatically triggered a complete lego run
given the user some clearer error message on how to manually trigger a complete run, for example with a separate systemd service executing the run script, that can be called manually.

There's a documented process in the NixOS Manual for triggering a manual renewal of certificates which would also resolve your issue. This avoids the need for a second service for a full renewal.

With regards to performing a full renewal automatically when incremental renewal fails; This is very difficult given how much complexity there already is in the service's scripting. Without reading the output of lego itself we can't be sure what the cause of the error is, of which there are many.

Failing the entire systemd service is the most helpful thing we can do. You will see a failed service when you do a nixos-rebuild, and it will clearly log to the journal that there was an error. There's a number of reasons I wouldn't want to put any other echo statements or remediation steps in:

  • Recurring failures could trigger rate limits from LetsEncrypt
  • Recommending to run systemctl clean --what=state acme-$domain.service for all renewal failures would make it more difficult to later debug one-off issues (e.g. ACME fails with JWS verification error #101445).
  • Reading the output + maintaining a list of errors -> solutions in code would be very time consuming for @nixos/acme and fixes wouldn't necessarily be reliable.

I appreciate you taking the time to write this ticket but I can't think of a way to action this more effectively. I can update the header in the manual wrt running systemctl clean to recommend it as a more general troubleshooting step.

@turion
Copy link
Contributor Author

turion commented Mar 17, 2021

I see your point. You're probably right that that further scripting will convolute problems more than it will help. Thanks for your detailed answer.

@turion turion closed this as completed Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

No branches or pull requests

2 participants