Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several issues in ACME certificates #84633

Closed
3 of 4 tasks
immae opened this issue Apr 7, 2020 · 37 comments
Closed
3 of 4 tasks

Several issues in ACME certificates #84633

immae opened this issue Apr 7, 2020 · 37 comments
Assignees
Labels
0.kind: bug Something is broken 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS
Milestone

Comments

@immae
Copy link
Contributor

immae commented Apr 7, 2020

Describe the bug

This issue aggregates one "old" and several new issues with ACME in 20.03 and unstable. Since some of them are dependent on each other to be of concern, I write them all in the same issue.

  • Change of acme certificate from simp_le to lego breaks workflow #84409
  • Due to the "RemainAfterExit" flag in service configuration, the acme-foo service is only ran once after boot or when the service derivation is modified during deployment. This defeats the purpose of the corresponding timer since it will not restart an already-started service. (see man systemd.timer for confirmation about this point, or check systemd logs after one day to confirm that the service was indeed never restarted)
  • The postRun script for acme is alway run when acme is started, i.e. every day according to the timer. Previous version of the postRun script was only launched when there was an effective change in the certificate (which is the sensible action I think).
    Due to the previous issue, the bug is currently not apparent when the timer triggers (every day), only when the service derivation is changed.
  • when we change the domains list, it doesn’t regenerate the certificate (It might be due to the fact that the script runs lego renew || lego run where "run" would be the one to run when the list changes)

Poke: @aanderse @arianvp @Mic92 @m1cr0man @emilazy @flokli since you were concerned with the previous issue about acme.

@immae immae added the 0.kind: bug Something is broken label Apr 7, 2020
@veprbl veprbl added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Apr 7, 2020
@arianvp
Copy link
Member

arianvp commented Apr 8, 2020

Issue 2 was introduced in #72056 to fix #48845

However it seems like a bit of a workaround to me that seems to be more of a bug in switch-to-configuration.pl which doesn't restart oneshots properly so maybe the cure was worse than the harm and we should try fixing it another way.

@immae
Copy link
Contributor Author

immae commented Apr 8, 2020

Shouldn’t the solution be:

  • ACME runs as root no matter what, and places the challenges in .well-known/challenge (world readable or at least readable by the web server, which is the important one here)
  • At the end of the renewal, it copies the certificate to its final place and gives proper rights

I don’t see how any other configuration could work, if ACME runs as regular user you would need to give world-writable rights to .well-known/challenge (in case two jobs with different users would run simultaneously, which happens for instance when you define two new certificates during deployment).

In any case the configuration in tempfiles cannot work if you have more than one user : it will try to create .well-known/challenge as both user, but only one of them will ever win...

Another possibility would be to use groups and force every "user" using certificates to belong to this group. (EDIT: this would not allow to assign a different group to the resulting certificate, not good)

About #48845, the way I see it is to have a acme-foo-setrights service whose only job would be to set rights on the final destination (and would have Wants/After acme-foo + WantedBy multi-user.target). This one would duplicate the last part of acme-foo’s job, but as it’s idempotent we don’t care much, and it could have the RemainAfterExit which would become harmless too in this situation (it would only be useful during deployment, which would anyway restart it regardless of this flag)

What do you think?

I may do the work to implement that if you agree with the ideas

@emilazy
Copy link
Member

emilazy commented Apr 8, 2020

The lego client should definitely not be run as root, that's a huge attack surface that has no need for elevated privileges beyond reading and writing out certificates.

Currently if you're using the nginx module with HTTP-01 challenges the certificates are owned as nginx:nginx and lego runs as nginx; I personally think it would be best to run as an acme account with corresponding group, have the certificates owned by acme:acme, turn on allowKeysForGroup by default, and put nginx in the acme group, to reduce the privilege it runs with further and take away nginx's write access to the certificates.

@flokli
Copy link
Contributor

flokli commented Apr 9, 2020

Not sure if we can assume ACL support, but if we do, we might also be able to just use setfacl after copying files into the final location to grant read access to the certificates for the users configured for each certificate, without having to require them to be in certain groups.

@flokli
Copy link
Contributor

flokli commented Apr 9, 2020

I think we should probably revert #72056, and see if we can fix it differently.

@immae
Copy link
Contributor Author

immae commented Apr 11, 2020

I just uncovered a new bug in Lego which might be blocking, but it might be a bug for upstream rather: when we change the domains list, it doesn’t regenerate the certificate...

(It might be due to the fact that the script runs lego renew || lego run where "run" would be the one to run at that stage)

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/go-no-go-meeting-nixos-20-03-markhor/6495/25

@emilazy
Copy link
Member

emilazy commented Apr 14, 2020

(moved from #85152)

@m1cr0man I think a rewrite of the module would be great, as long as compatibility with the basic configuration options was kept; there's a fair bit held over from simp_le that no longer makes sense, and a lot of nits with the current implementation. I know @yegortimoshenko was working on a refactor at some point; maybe they'd be able to share WIP code?

For what it's worth, one possible way to fix the various renewal bugs in one shot would be to derive the certificate subdirectory from a hash of its configuration like /var/lib/acme/.lego/$domain-$hash; that way, any changes to the OCSP Must-Staple configuration, alternate domains, etc. would change the underlying internal path to the certificate and force a new one to be issued. We'd still need explicit acme-*-{force,revoke}.service or similar that users could run manually for edge-cases, but that should ensure that the 90% case is handled out of the box without hacks to manually check the configuration differences and remove certificates. (The implementation of force renewal will still probably involve manually removing certificates from underneath lego since I think it doesn't implement this directly; it'd be best to just rename the directory to $domain-$hash-old-$timestamp or similar so that people can still revoke them if necessary.)

Of course, in an ideal world, lego would be able to handle forcing renewals on certificate configuration changes itself, but I'm not sure it tracks enough state currently to be able to do this.

@arianvp
Copy link
Member

arianvp commented Apr 14, 2020

I just opened #85223

@flokli
Copy link
Contributor

flokli commented Apr 14, 2020

I'm not sure if we're doing ourselves a favour by having this meta issue - maybe the individual issues should be put into separate issues…

@arianvp
Copy link
Member

arianvp commented Apr 14, 2020

How about a project board?

@immae
Copy link
Contributor Author

immae commented Apr 14, 2020

@flokli : merged them all because of inte-dependencies between them, feel free to cut it in pieces :)

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/reuse-lets-encrypt-acme-certificate-for-multiple-services-with-lego/6720/4

@arianvp
Copy link
Member

arianvp commented Apr 15, 2020

I agree we should revert #72056 now that I come to think of it doesn't it actually completely break renewal? not only activation?! as the timer will never be able to re-fire the unit

Edit: Reverting PRs:

@arianvp
Copy link
Member

arianvp commented Apr 16, 2020

Could other people jump in to fix the last two issues? @immae perhaps? I don't have enough time to work on all of them and I don't want to be blocking the 20.03 release

@immae
Copy link
Contributor Author

immae commented Apr 16, 2020

Sure

@immae
Copy link
Contributor Author

immae commented Apr 16, 2020

@immae
Copy link
Contributor Author

immae commented Apr 16, 2020

By writing the PR I found one more issue that could be problematic...

# Test that existing cert is older than new cert
KEY=${spath}/certificates/${keyName}.key
if [ -e $KEY -a $KEY -nt key.pem ]; then
  (...)
fi

In the simp_le age, the key.pem file was reused at each run. It’s not the case in lego, unless we use the --reuse-key.

I’m not concerned with this change but I know this can have a big consequence for people who use key pinning (I’m not sure of the name, I’m thinking of the feature where the web server says to the browser "My public key is Foo, if I give you any certificate with a different public key in the next year be worried").

Maybe we should add something about it in the documentation? (note that since there is no migration from switching from simp_le to lego people who use that will already have issues about this feature)

@arianvp
Copy link
Member

arianvp commented Apr 16, 2020

Web browsers do not support this feature, but it's often used in native apps like banking apps indeed.

I think a note for now should be sufficient, but exposing this --reuse-key feature in the module sounds like a good idea to me

@immae
Copy link
Contributor Author

immae commented Apr 16, 2020

Note: I don’t think the remaining issues should be considered blocking for the release of 20.03, provided we have proper documentations that acme certificates is not an ideal world: If I understand correctly, the remaining issues are either just annoying or wontfix (regarding migration scripts from old setup)

@immae
Copy link
Contributor Author

immae commented Apr 16, 2020

Should I propose a PR with --reuse-key @arianvp ?

EDIT:

@immae
Copy link
Contributor Author

immae commented Apr 16, 2020

I had a look at the last checkbox (changing the domains list, or as @emilazy said it can concern other options too), lego officially doesn’t support it (PR welcome though: go-acme/lego#507 the issue is slightly different but raises the same concern).

I guess a way to solve it simply would be to store the whole command line in /var/lib/acme/my_cert/command_line, compare the current one with the previous one and switch "renew" to "run" in case it changed. However, the flags like reuse-key cannot be used in "run" so we’re back with the issue of key pinning...

@emilazy
Copy link
Member

emilazy commented Apr 16, 2020

See my previous comment for a simple way to address renewing certificates on changed settings without any complicated hacks or manual comparisons.

@immae
Copy link
Contributor Author

immae commented Apr 16, 2020

See my previous comment for a simple way to address renewing certificates on changed settings without any complicated hacks or manual comparisons.

Ah I just reread it and now I understand it better, sorry. Ok I’ll try to implement that.

@immae
Copy link
Contributor Author

immae commented Apr 17, 2020

See my previous comment for a simple way to address renewing certificates on changed settings without any complicated hacks or manual comparisons.

@emilazy : I implemented that, but I’m not satisfied, because it breaks the reusePrivateKey option. I had a look at the code from lego, and the only way I found to honor this option is to have a /var/lib/acme/.lego/${certname} (so no hash), and a conditional run like this (pseudo bash code):

if (no existing dir); then
  lego run ...
elif (important parameter changed); then
  lego renew --days 0 ...
else
  lego renew --days ${cfg.validMinDays}
fi

The reason being that lego run will always recreate a new private key and all else, regardless of the content of the directory.

If I use a hashed key like you suggested, then it will always be empty at the first run. I can trick and copy the files from the old location, but I feel a scenario like above where I compute the hash to determine the (important parameter changed) and store it in the state directory would be more clean.

Does that sound good to you?

@flokli flokli mentioned this issue Apr 18, 2020
10 tasks
@pvgoran
Copy link
Contributor

pvgoran commented Apr 19, 2020

I opened #85534 which is supposedly relevant.

@arianvp
Copy link
Member

arianvp commented Apr 23, 2020

@immae could you update your bullet list with what is fixed please?

@immae
Copy link
Contributor Author

immae commented Apr 23, 2020

Done @arianvp . I don’t know how to handle the last one yet, I made a sketch of a proposition above but got no answer so I don’t know what to do

@immae
Copy link
Contributor Author

immae commented Apr 23, 2020

(and the reuse key option wasn’t merged finally, maybe I should reopen an issue for that one before it gets lost?)

@pstch
Copy link
Contributor

pstch commented May 16, 2020

@immae There's also another solution for the last issue, though maybe a bit hacky, which would be to use 999 (or even a larger value) for --days when we detect that the list of domain names (and other things, such as the contact address) has changed, either by comparing the command line (stripped of the validMinDays flag, of course), or by reading the list of alternate names from the certificate file. This way we can keep using renew and allow --reuse-key but still be able to force a renewal. I don't know if that's better than your proposed solution.

I think that last issue is a serious one, as it breaks user expectations : I don't remember this happening with simp_le. I have a fixed version of the module on my servers, and I can submit a PR with these changes if it's accepted as a good way to do it.

lego is supposed to implement a combined renew/run command, which should make all of this easier, but I don't know when it will be done.

@immae
Copy link
Contributor Author

immae commented May 16, 2020

@pstch : the standard way of lego to say "infinite" is to put 0 for --days, and it was my suggestion indeed :)

Except that in my implementation I store a hash of the relevant arguments in the .lego folder, and whenever the hash changes it means that the certificate needs to be regenerated. I’m happy with that implementation as far as I am concerned, but it’s just implementation details: the "big" part in the issue is deciding when you need to force renew and when you don’t.

@immae
Copy link
Contributor Author

immae commented May 24, 2020

I wrote a (big) change to acme handling. Its current state is here in my repository: https://git.immae.eu/?p=perso/Immae/Config/Nix.git;a=blob;f=modules/private/certificates.nix;hb=HEAD#l76

It should solve most issues reported here and there (note that contrary to my previous comment the --days 0 flag doesn’t work, I didn’t recheck why but either my memory faults me or it’s an unreleased recent change upstream)

advantages:

  • Any change in domain / keytype will be handled properly and regenerate the certificate (currently lego silently ignores a handle a change of domain or a change of keytype). there will be one account created per key-type (could be improved even more to "only one account" but it requires more workarounds)
  • A single user "acme" does all the work. "root" only comes at the very last step (ExecStartPost) to copy certificates and give proper rights.

issues:

  • Possible race condition if several certificates try to create the "first" account simultaneously (since they share the account folder - per keytype - two concurring instances of lego might end up trying to generate a key simultaneously with bad result. This issue is not new, it’s already the case currently)
  • A lot of business rule happen in the bash script to decide what action to take. That will improve with time if lego gets better at handling those rules (change of domain, change of key type, ...)
  • Definitely incompatible with pre-20.03 or 20.03 or current unstable (that in particular could be fixed though)

I may create PR for some of the issues if you’re happy with that change.

@m1cr0man
Copy link
Contributor

For people tracking this, I've been taking a look at what @immae has done, what emilazy has shared, and my own work to rewrite the module. It's going well, just slowed down during the week by work. I will open a PR this weekend. I've resolved most of the issues immae highlighted above except for the backwards compatibility, although adding a migration script would be possible.

There's a number of design decisions I've made that will be open for discussion on the PR, and some other issues I have thought of in the process of rewriting it.

@m1cr0man m1cr0man mentioned this issue Jun 19, 2020
10 tasks
@m1cr0man
Copy link
Contributor

m1cr0man commented Oct 3, 2020

How are people finding the new module? 🙂 I addressed everything that was mentioned in this issue, and since the merge there's only been a small bit of feedback all of which was external to the module itself (either lego or LE related). I'll give it a month and then close this ticket.

@flokli
Copy link
Contributor

flokli commented Oct 11, 2020

I'd say no news / new bug reports are good news :-)

Thanks a lot for all the work that has been done - I think we're in a wayy better situation than before ❤️

@pstch
Copy link
Contributor

pstch commented Nov 8, 2020

Thanks a lot for your work @m1cr0man ! I've been using the new module for three weeks now, mostly without any problem. I think the changes are really good. However, I've find a (minor) problem : switching to wildcard certificates fails because the old domain names are still requested, and the request fails because although these names are already validated, LE refuses certificate certificate requests with both a wildcard and subdomains.

Nov 08 19:54:49 example acme-example.com-start[32187]: 2020/11/08 19:54:49 acme: error: 400 :: POST :: https://acme-staging-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: Error creating new order :: Domain name "sub.example.com" is redundant with a wildcard domain in the same request. Remove one or the other from the certificate request., url:

Using systemctl clean acme-example.com.service fixed the problem.

@m1cr0man
Copy link
Contributor

m1cr0man commented May 4, 2021

Thanks a lot for your work @m1cr0man ! I've been using the new module for three weeks now, mostly without any problem. I think the changes are really good. However, I've find a (minor) problem : switching to wildcard certificates fails because the old domain names are still requested, and the request fails because although these names are already validated, LE refuses certificate certificate requests with both a wildcard and subdomains.

I believe this issue is solved now as the domainHash is used to check whether the certs generated in the last run match the set of configured domain + extraDomains in the configuration. So long as you strip your configuration to just include the wildcard domain it will do a clean cert request instead of a renewal and succeed.

I'm going to close this issue now. It has been open long enough and I feel everything it raised has been addressed at this point. If something is still missing/broken feel free to pull me up on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS
Projects
None yet
Development

No branches or pull requests

9 participants