Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACME fails with JWS verification error #101445

Closed
symphorien opened this issue Oct 23, 2020 · 48 comments · Fixed by #101482
Closed

ACME fails with JWS verification error #101445

symphorien opened this issue Oct 23, 2020 · 48 comments · Fixed by #101482
Labels
0.kind: bug Something is broken 0.kind: regression Something that worked before working no longer 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS

Comments

@symphorien
Copy link
Member

Describe the bug

When upgrading from 20.03 to 20.09, acme started failing for 5 of my domains (but not all of them).

To Reproduce

  1. systemctl start acme-foo.example.org.service
acme-foo.example.com-start[23185]: 2020/10/23 08:32:54 [INFO] [foo.example.com] acme: Obtaining bundled SAN certificate
acme-foo.example.com-start[23185]: 2020/10/23 08:32:54 Could not obtain certificates:
acme-foo.example.com-start[23185]:         acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: JWS verification error, url:
systemd[1]: acme-foo.example.com.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: acme-foo.example.com.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Renew ACME certificate for foo.example.com.
systemd[1]: acme-foo.example.com.service: Consumed 371ms CPU time, received 5.8K IP traffic, sent 2.2K IP traffic.

Expected behavior
No failure

Additional context
Add any other context about the problem here.

Notify maintainers
cc @NixOS/acme

Metadata

  • system: "x86_64-linux"
  • host os: Linux 5.8.14, NixOS, 20.09beta1083.51aaa3fa1b6 (Nightingale)
  • multi-user?: yes
  • sandbox: yes
  • version: nix-env (Nix) 2.3.7
  • channels(root): "nixos-20.09beta1346.05334ad7852, nixos-unstable-21.03pre246543.24c9b05ac53"
  • channels(symphorien): "home-manager-20.09"
  • nixpkgs: /nix/var/nix/profiles/per-user/root/channels/nixos

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module: acme
@symphorien symphorien added the 0.kind: bug Something is broken label Oct 23, 2020
@arianvp arianvp added the 0.kind: regression Something that worked before working no longer label Oct 23, 2020
@Ma27
Copy link
Member

Ma27 commented Oct 23, 2020

I hit the same bug as well and fixed it on my machines with the following workaround: #91121 (comment)

@m1cr0man
Copy link
Contributor

Thanks for reposting that @Ma27 . Think I will extend #100356 with some info on how to deal with this error.

@symphorien
Copy link
Member Author

I wiped /var/lib/acme/.lego/accounts and now all certificate services fail:

acme-foo.example.com-start[1329]: 2020/10/23 19:17:28 No key found for account [email protected]. Generating a 2048 key.
acme-foo.example.com-start[1329]: 2020/10/23 19:17:29 Saved key to accounts/acme-v02.api.letsencrypt.org/[email protected]/keys/[email protected]
acme-foo.example.com-start[1329]: 2020/10/23 19:17:32 Account [email protected] is not registered. Use 'run' to register a new account.
systemd[1]: acme-foo.example.com.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: acme-foo.example.com.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Renew ACME certificate for foo.example.com.

@m1cr0man
Copy link
Contributor

Right - the tests for whether to run lego renew or lego run don't check the accounts directory, I will fix that now. That said, I don't know if ACME fairs well if you delete your account details and keep your certificates. I'm pretty sure you need to use the same account to issue a renew request for existing certificates.

@symphorien
Copy link
Member Author

Do you mean that I should nuke all of /var/lib/acme ?

@m1cr0man
Copy link
Contributor

Pretty much, yeah. Personally I would move it somewhere else instead of flat-out delete ;) Then run systemd-tmpfiles --create.

@symphorien
Copy link
Member Author

It works.

@m1cr0man
Copy link
Contributor

Awesome. I have opened a PR there which will resolve this issue. In particular, I made it check if the accounts directory is empty, which hopefully negates the need to wipe out your entire /var/lib/acme directory. I also added some documentation on the process.

@veprbl veprbl added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Oct 23, 2020
@vincentbernat
Copy link
Member

I am also running into this problem. The workaround works, but the account key was generated just yesterday and nothing looks corrupted (JSON and EC key). Today, I was trying to enroll a new domain using the same account. There may be something more fishy around this. Unsure how to debug this.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/release-notes-20-03-20-09-instructions-may-need-an-update-acme/9787/2

@symphorien
Copy link
Member Author

#102387 prevents renewal services to be run at the same time. I'm unsure as to how to test if it fixes the issue.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/lets-encrypt-on-20-09/9950/2

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/lets-encrypt-on-20-09/9950/3

tokudan added a commit to freifunkhamburg/mail2-nixos-config that referenced this issue Nov 12, 2020
@pjones
Copy link
Contributor

pjones commented Nov 13, 2020

#102387 prevents renewal services to be run at the same time. I'm unsure as to how to test if it fixes the issue.

@symphorien FWIW I don't think that renewing more than one cert at a time is a factor in this bug. I only had a single cert fail when upgrading to NixOS 20.09 and running the single systemd service by itself still failed. It appears this will be fixed by #102862

That said, linking all of the renewal services together would be nice so that if one fails the others won't be tried. If you have 5 or more failures you basically lock yourself out of LE for an hour.

@symphorien
Copy link
Member Author

That said, linking all of the renewal services together would be nice so that if one fails the others won't be tried. If you have 5 or more failures you basically lock yourself out of LE for an hour.

This PR does not do that. If a certificate fails, the next ones will be tried as well.

@m1cr0man
Copy link
Contributor

@symphorien and others, I have done some work in #106857 to prevent multiple simultaneous account creations, and reduced the usage of tmpfiles so that file creation is more dependable/reliable. If you could give it a shot and report back I'd be interested to hear if it works :) You may need to run systemctl clean --what=state acme-DOMAIN.service first to get rid of any potential broken stuff. Bare in mind this will delete your existing certs, keys and account data for that domain.

@symphorien
Copy link
Member Author

The issue still exists on 20.09 (I could not find any backport), and it happened to me this very morning.

@nh2
Copy link
Contributor

nh2 commented Dec 27, 2020

Also happened to me just now on 20.09.

The workaround from #91121 (comment) did not fix it for me.

@m1cr0man
Copy link
Contributor

m1cr0man commented Jan 11, 2021

Ok! I figured out how to reproduce this from a working setup 🔥

  1. Switch to the LE staging server
  2. Create a new certificate
  3. Move /var/lib/acme/.lego/accounts/$hash to $hash_before
  4. Run systemctl clean --what=state acme-$cert.service; systemctl start acme-$cert.service
  5. Copy /var/lib/acme/.lego/accounts/$hash to $hash_after
  6. Copy from the _before prefixed folder either account.json or the private key (one of them, either works) to the unprefixed accounts/$hash/...subfolder...
  7. Delete /var/lib/acme/.lego/$cert
  8. Run systemctl start acme-$cert.service

TL;DR If you end up in a situation where you have the wrong key for a given account, you get the exact error being shared around here.

Jan 11 19:10:44 gelandewagen acme-m1cr0man.com-start[30747]: 2021/01/11 19:10:44 Could not obtain certificates:
Jan 11 19:10:44 gelandewagen acme-m1cr0man.com-start[30747]:         acme: error: 400 :: POST :: https://acme-staging-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: JWS verification error, url:

Why this is happening is what really confuses me. This has been demonstrated to happen naturally even when you only have 1 certificate involved, which rules out a multi-cert race condition. By virtue of the hashed folders, you can never really end up with stale data in your accounts folder.

Anyone that has an accounts directory that has this issue - can you please cat the key in your accounts folder and check it is not malformed? Also, can you compare the modify times of the account.json and the key file (they should be fairly close)? That's my best guess right now as to what the cause is.

What's nicer is that I have found an elegant solution: If your key is not corrupt, then you can re-request the belonging account ID. To quote the Let's Encrypt FAQ:

If your ACME client does not record the account ID, you can retrieve it by submitting a new registration request with the same key.

To do this, simply delete the account.json in your accounts folder, and run a renew.

@pjones
Copy link
Contributor

pjones commented Jan 11, 2021

@m1cr0man

Deleting the account.json file did not fix the issue for me.

Here is the account directory before deleting the file:

$ ls -lR /var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a
/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a:
total 4
drwx------ 3 acme acme 4096 Jan 11 15:46 acme-v02.api.letsencrypt.org

/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a/acme-v02.api.letsencrypt.org:
total 4
drwx------ 3 acme acme 4096 Jan 11 15:46 domains@redacted

/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a/acme-v02.api.letsencrypt.org/domains@redacted:
total 8
-rwx------ 1 acme postfix  215 Jan 11 15:46 account.json
drwx------ 2 acme acme    4096 Jan 11 15:46 keys

/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a/acme-v02.api.letsencrypt.org/domains@redacted/keys:
total 4
-rwx------ 1 acme acme 227 Jan 11 15:46 [email protected]

Before deleting the account.json file, running the service produced:

Jan 11 16:39:50 moriarty systemd[1]: Starting Renew ACME certificate for redacted...
Jan 11 16:39:51 moriarty acme-redacted-start[5092]: 2021/01/11 16:39:51 [INFO] [redacted] acme: Trying renewal with 727 hours rema>
Jan 11 16:39:51 moriarty acme-redacted-start[5092]: 2021/01/11 16:39:51 [INFO] [redacted] acme: Obtaining bundled SAN certificate
Jan 11 16:39:51 moriarty acme-redacted-start[5092]: 2021/01/11 16:39:51 acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org>

After deleting the file and running the service again:

Jan 11 22:36:52 moriarty systemd[1]: Starting Renew ACME certificate for redacted...
Jan 11 22:36:53 moriarty acme-redacted-start[13450]: 2021/01/11 22:36:53 Account domains@redacted is not registered. Use 'run' to regis>
Jan 11 22:36:53 moriarty systemd[1]: acme-redacted.service: Main process exited, code=exited, status=1/FAILURE

@bbqsrc
Copy link

bbqsrc commented Jan 12, 2021

Also did not work for me.

@bbqsrc
Copy link

bbqsrc commented Jan 12, 2021

Just straight up deleting /var/lib/acme helped though. Think about what you're doing before you do this, but for me it didn't matter if things didn't work. 😛

Fortunately though, it did.

@stephank
Copy link
Contributor

stephank commented Jan 12, 2021

That actually did work for me, though hopefully I was in the correct situation:

$ ls -l account.json keys/*.key
-rwx------ 1 acme nginx  215 Jan 12 10:55 account.json
-rwx------ 1 acme nginx  227 Jan 12 10:55 keys/<REDACTED>.key

$ systemctl start acme-<REDACTED>.service
Job for acme-<REDACTED>.service failed because the control process exited with error code.
See "systemctl status acme-<REDACTED>.service" and "journalctl -xe" for details.

$ journalctl -e
Jan 12 15:16:37 <REDACTED> acme-<REDACTED>-start[18364]: 2021/01/12 15:16:37 [INFO] [<REDACTED>] acme: Obtaining bundled SAN certificate
Jan 12 15:16:37 <REDACTED> acme-<REDACTED>-start[18364]: 2021/01/12 15:16:37 Could not obtain certificates:
Jan 12 15:16:37 <REDACTED> acme-<REDACTED>-start[18364]:         acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: JWS verification error, url:

$ mv account.json bak-account.json

$ systemctl start acme-<REDACTED>.service

$ journalctl -e
Jan 12 15:16:50 <REDACTED> acme-<REDACTED>-start[18400]: 2021/01/12 15:16:50 [INFO] acme: Registering account for <REDACTED>
[...]
Jan 12 15:16:58 <REDACTED> acme-<REDACTED>-start[18400]: 2021/01/12 15:16:58 [INFO] [<REDACTED>] Server responded with a certificate.

$ ls -l account.json keys/*.key
-rwx------ 1 acme nginx 215 Jan 12 15:16 account.json
-rwx------ 1 acme nginx 227 Jan 12 10:55 keys/<REDACTED>.key

After this, account.json was recreated with a different account ID, and the key remained the same. For what it's worth, the numeric ID is 3 lower in value than the backup. And as you can see, there was a little over 4 hours inbetween.

For some background, this is on EC2 and we provisioned this server today. We activated a configuration that requested 10 certificates in one go, and all went well there. Later, we had to change one of the nginx virtual hosts, so caused a new certificate request there. That one failed with the JWS verification error.

So ACME configuration was only changed twice: in the initial activation, and the later change. (Other activations did cause the services to rerun, just do nothing, because certs were up-to-date.)

Unfortunately, this is not a 1-certificate-scenario, and so doesn't rule out parallelism.

This machine is currently on nixpkgs aa5b9cd on branch nixos-20.09.

I would expect an actually corrupt key on disk would cause a different error, because Lego would not be able to sign the request? It's my guess that we can rule out a corrupt key file.

@m1cr0man
Copy link
Contributor

After this, account.json was recreated with a different account ID, and the key remained the same. For what it's worth, the numeric ID is 3 lower in value than the backup. And as you can see, there was a little over 4 hours inbetween.

Unfortunately, this is not a 1-certificate-scenario, and so doesn't rule out parallelism.

Thanks for the background info. Fortunately I have definitely fixed any parallelism issues in #106857 :) If that was the cause in your case, it should be resolved soon. The fact that your ID was 3 lower confirms it IMO.

I would expect an actually corrupt key on disk would cause a different error, because Lego would not be able to sign the request? It's my guess that we can rule out a corrupt key file.

I would suspect the same too, but I haven't looked into how Lego uses that key exactly. I would hope it gives a different error.

@pjones I came up with another possible scenario where this solution may not work (at least, deleting the account.json would not work). AFAIK the account keys are generated by lego. If the key is generated, written to disk, then account creation fails, then you end up with a key with no account associated with it. This could be the issue in your case. Try removing the key instead of the account.json and report back what happens :)

@m1cr0man
Copy link
Contributor

Actually come to think of it there's no way for lego to get the key. I suspect it will still work, but your account ID will change. That's what happened when I tried it.

@pjones
Copy link
Contributor

pjones commented Jan 12, 2021

@m1cr0man If that last comment was directed to me...

If my account ID is going to change should I just archive /var/lib/acme and start from scratch?

@m1cr0man
Copy link
Contributor

@pjones I guess so, emphasis on the archive since right now I don't know what else to investigate as a possible cause.

@pjones
Copy link
Contributor

pjones commented Jan 18, 2021

@m1cr0man My server with multiple certs, the one where all the acme services seemed to resolve themselves, started failing again. Nothing on this server changed.

Jan 17 12:53:50 kilgrave systemd[1]: Starting Renew ACME certificate for redacted...
Jan 17 12:53:51 kilgrave acme-redacted-start[6257]: 2021/01/17 12:53:51 [INFO] [redacted] acme: Trying renewal with 587 hours remaining
Jan 17 12:53:51 kilgrave acme-redacted-start[6257]: 2021/01/17 12:53:51 [INFO] [redacted] acme: Obtaining bundled SAN certificate
Jan 17 12:53:51 kilgrave acme-redacted-start[6257]: 2021/01/17 12:53:51 acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: JWS verification error, url:
Jan 17 12:53:51 kilgrave systemd[1]: acme-redacted.service: Main process exited, code=exited, status=1/FAILURE
Jan 17 12:53:51 kilgrave systemd[1]: acme-redacted.service: Failed with result 'exit-code'.

So I decided to archive the /var/lib/acme directory and start from zero. When running one of the renewal services I get the following error:

Jan 18 17:53:18 kilgrave systemd[1]: Starting Renew ACME certificate for redacted...
Jan 18 17:53:18 kilgrave systemd[8245]: acme-redacted.service: Failed to set up mount namespacing: /run/systemd/unit-root/tmp/accounts: No such file or directory
Jan 18 17:53:18 kilgrave systemd[8245]: acme-redacted.service: Failed at step NAMESPACE spawning /nix/store/3z83bfgnaqxvvsw35k7ypn02rr1j875j-unit-script-acme-redacted-start/bin/acme-redacted-start: No such file or directory
Jan 18 17:53:18 kilgrave systemd[1]: acme-redacted.service: Main process exited, code=exited, status=226/NAMESPACE
Jan 18 17:53:18 kilgrave systemd[1]: acme-redacted.service: Failed with result 'exit-code'.
Jan 18 17:53:18 kilgrave systemd[1]: Failed to start Renew ACME certificate for redacted.

I've started receiving Let's Encrypt emails stating that some of my certs are about to expire. At this point all of my servers are failing to renew all certifications because of acme service errors.

At this point I don't know what to do and I'm only a few days away from a production outage due to expired certs.

Please advise.

@m1cr0man
Copy link
Contributor

Can you send the logs from the last good renewal of one of your certs? Also can you indicate:

  • Whether you use one email/account for all certs or multiple accounts
  • Whether /var/lib/acme and /tmp are on different file systems

Also you need to run systemd-tmpfiles --create if you blow away your whole /var/lib/acme

@pjones
Copy link
Contributor

pjones commented Jan 18, 2021

@m1cr0man The services have been failing long enough where I no longer have logs that go back that far. As far as accounts go, I use one email address across all of my certs.

Using systemd-tmpfiles was the missing piece that I needed. I was able to recreate the certs on one of my servers successfully. I'll start working on the other servers next.

Thank you for your quick response.

@m1cr0man
Copy link
Contributor

Glad you got sorted. I am constantly surprised by people's encounters with this bug. I personally have a lot of hosts using the module; one with over 20 certs all using one account, and I have never hit some of these common issues - which I hope is a bit of good news. Once you rebuild your /var/lib/acme it tends to go away for good but I've not found a good reason why it breaks in the first place. Hopefully the work in the other PR will help people.

pjones added a commit to pjones/tilde that referenced this issue Jan 24, 2021
@rkoe
Copy link
Contributor

rkoe commented Jan 31, 2021

Is there any solution or workaround for this "JWS verification error"? Even removing /var/lib/acme + running systemd-tmpfiles --create does not help in any way.
There's always the error:

acme: error: 400 :: POST :: https://acme-staging-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: JWS verification error, url:

Any help would be appreciated.

@zopieux
Copy link
Contributor

zopieux commented Jan 31, 2021

@rkoe FYI I had to patch #106857 in my local nixpkgs, backup/remove /var/lib/acme, run systemd-tmpfiles --create and run the service to finally get it to work.

@rkoe
Copy link
Contributor

rkoe commented Jan 31, 2021

Will there be any fix for this problem in NixOS 20.09.?

@zopieux Thanks. Can you tell me how to patch this?

@m1cr0man
Copy link
Contributor

@rkoe FYI I had to patch #106857 in my local nixpkgs, backup/remove /var/lib/acme, run systemd-tmpfiles --create and run the service to finally get it to work.

This doesnt make sense. I removed all tmpfiles usage in #106857 so that couldn't have an impact on any resolution. However, if that patch does solve the issue then I will look into getting a backport merged

@zopieux
Copy link
Contributor

zopieux commented Feb 1, 2021

Sorry if this is just a coincidence then. FYI I was running nixos-unstable at 257cbbc + the PR mentioned above when I finally got rid of the ACME errors.

@m1cr0man
Copy link
Contributor

m1cr0man commented Feb 6, 2021

I have opened a backport for #106857 in #112145 :)

@arianvp
Copy link
Member

arianvp commented Feb 15, 2021

This is fixed now!

@arianvp arianvp closed this as completed Feb 15, 2021
@tgunnoe
Copy link

tgunnoe commented Feb 15, 2021

To whom it may concern:
I still had to remove/backup /var/lib/acme to get the update to work.

@m1cr0man
Copy link
Contributor

To whom it may concern:
I still had to remove/backup /var/lib/acme to get the update to work.

There's not much that can be done about this, but I am hopeful that the latest set of PRs will stop the issue occurring again. Out of interest though, did you get this error specifically (JWS verification) or something else?

@stigok
Copy link
Contributor

stigok commented Mar 14, 2021

Just had this issue today on unstable. To get it working again I did the following

# tar -cf ~/acme-backup.tar /var/lib/acme
# rm -rf /var/lib/acme
# mkdir /var/lib/acme
# chown acme /var/lib/acme

Lots of good hints in this thread here. Thanks!

@m1cr0man
Copy link
Contributor

To anyone else that comes across this thread: I also wrote docs on the manual which detail only deleting the accounts folder to resolve this issue. From previous testing in this thread, this was usually all that was required.

Failing that, could you please try these commands before the above options in the thread as I am curious if it will work:

tar -cf ~/acme-backup.tar /var/lib/acme
systemctl clean --what=state acme-$brokencert.service

This will delete the certs + accounts data for the failing renewal. It's a little less destructive if you're only having a problem with one cert than recreating the whole of /var/lib/acme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 0.kind: regression Something that worked before working no longer 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS
Projects
None yet