-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: recover from previously failed attempt #291
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1ba286c
to
2fe89e9
Compare
fb32f46
to
28e880e
Compare
Looking at the implementation at: https://github.com/cert-manager/issuer-lib/blob/main/controllers/request_controller.go#L306 I can see that it should be handled differently. I'm replacing the PR with an approach like that: It records the errors as intermediate errors including a maxretryduration. |
28e880e
to
68a6d2f
Compare
PR is now updated to reflect a similar behaviour as in the https://github.com/cert-manager/issuer-lib, by allowing a grace period of retries. |
68a6d2f
to
73c759f
Compare
Detected different CRDs in the |
@woehrl01 cert-manager should retry failed issuance per their documentation (https://cert-manager.io/docs/faq/). Do you know if cert-manager is retrying? If it's not, any idea why it's not working as intended? |
@dcamzn No I don't know. According to the docs it's only retrying if it's not in a terminal state (marked as failed). I also looked through the source code and it looks like that it only retries a terminal state if you delete the certificaterequest (that's also a recommendation according to the faq). In my case we only have a window of 32h until the certificate is invalid and it had never retried/recreated a certificaterequest once during that time, after it failed. Even an active "renew" command via cli isn't creating a new certificaterequest 48h after terminal failure. It could be because we never set the condition |
@woehrl01 thank you for that info. we're going to keep looking into this. We'll take a look at the Issuing and Ready conditions that you called out. Thank you! |
@dcamzn thanks. One tip to reproduce the issue quite easily:
Expected scenario is that it would retry issuing the certificate afterwards automatically/eventually. My PR will basically just allow this for a configurable grace period. We already use that change successful in production. |
thanks @woehrl01 on steps to reproduce the issue. you read our minds! |
If it fails at the point of the certificaterequest, we think the right behavior is for the certificaterequest to be deleted by the cert-manager controller and recreated by the cert-manager controller, so that all issuers get fixed. we'll circle back after reproducing the issue. |
@dcamzn great, thank you! Let me share the following docs about the certificaterequest which states that it's not retrying the issuing after failure: https://cert-manager.io/docs/concepts/certificaterequest/ So I guess the Faq is maybe not correct. |
Did some additional research, looks like that any of those conditions do not match, and the |
I was able to reproduce and cut this SIM to cert-manager: cert-manager/cert-manager#6405. I saw that one of the times I reproduced, if I waited long enough, it did retry exactly once. I was not able to get that behavior consistently. |
I was able to get retries to work, see cert-manager/cert-manager#6405 for details. It seems the retries just take longer between each try than they did before, which was a change in cert-manager 1.8. |
Thank you @divyansh-gupta for investigating, I also have read through the PR, but this is not what I see. I would like to share the exact logs which shows that cert-manager is retrying, but it's not creating a new The "failed to retrieve provisioner" is the error returned by aws-privateca-issuer |
@dcamzn @divyansh-gupta what are your thought on this PR, after syncing with the cert-manager team? |
We are also hitting a very similar issue. If you create the cert and the AWSPCAIssuer at the same time, the Certificate request happens before the AWSPCAIssuer is ready. Although cert-manager seems to retry after 1h, it is never successful in creating the certificate. |
Hi @woehrl01 - let me share with your my results when replicating the exact situation you described:
but see the CertificateRequest shows the cert issued anyway! And I checked, the cert exists and was issued -
To further summarize: When a Certificate is requested before an Issuer is created, the Certificate will fail to issue. When the issuer is created, the Certificate will issue on the next retry (1 hour), but the Certificate object isn't being updated correctly to reflect that even though the CertificateRequest object shows a success. On the next retry (3 hours), the Certificate will also issue again and the Certificate object will also be updated to show that the certificate was issued successfully, so at this point, 2 certificates were issued, and everything is stable. This is clearly the wrong behavior, and the correct behavior is that on the first retry, when the certificate was issued, the Certificate object should also be updated to reflect the successful issuance. That seems to be the root of the problem that needs to be addressed. I am not yet sure on where to address that issue - it seems that the cert-manager ReadinessController is responsible for taking a CertificateRequest that is successful and updating the Certificate object, so I am looking there: https://github.com/cert-manager/cert-manager/blob/master/pkg/controller/certificates/readiness/readiness_controller.go |
@divyansh-gupta thank you for the update. There must be definitely something wrong somewhere. If you look at my event log, you can clearly see that it retried multiple times and never succeed, without deleting the certificaterequest. Despite I think one hour delay because of a "race condition" during setup is quite long. (in hour case it's not me issue of a deleter issuer but, just that the controller is somehow in a not full ready state (it fails that the controller can't receive the provisioner, even though it did exist the whole time. |
There seem to be two distinct problems:
Thanks for finding these issues and more importanly creating a PR to help! We will be accepting this PR, however there are two things that I'd like to consider as changes in this PR:
Let me know what you think of the suggestions above! Thanks for your help again! |
0201a52
to
88049e0
Compare
@bmsiegel I removed the default value from the helm chart, now it should be compatible with the old image as well. |
@bmsiegel looks like the tests are successful now. If you there are no additional changes required from your side I will do the squashing of the changes. Wdyt? |
Only thing I'd ask is that we remove everything except the necessary changes for the PR. You can cut new PRs for the other things you did here or we can take care of it. Would make the history a little easier to follow. |
What kind of unnecessary changes do you refer to, you would like to have moved to a separate PR? |
It looks like the CRD changes are not needed for this, correct me if I'm wrong. |
I see, yes they are not. They have been a result on regenerating the manifests from the definition. I can remove that, but suggest doing the change afterwards to make contributions easier. |
+1, agreed. |
88049e0
to
0a5f398
Compare
0a5f398
to
be6ffd0
Compare
Detected different CRDs in the |
be6ffd0
to
6bcfbb7
Compare
Detected different CRDs in the |
1d3d319
to
8242168
Compare
Signed-off-by: Lukas Wöhrl <[email protected]> sync crd Signed-off-by: Lukas Wöhrl <[email protected]> add stable itter for tests Signed-off-by: Lukas Wöhrl <[email protected]> test: more debug output to find out what's wrong Signed-off-by: Lukas Wöhrl <[email protected]> fix: invalid helm default Signed-off-by: Lukas Wöhrl <[email protected]> fix: remove default helm value for maxRetryDuration Signed-off-by: Lukas Wöhrl <[email protected]>
8242168
to
edb1182
Compare
If a previous certificate request failed, it will be currently never retried again.
This adds a configuration
max-retry-duration
to allow a transition time to recover from transient errors.The PR includes: