Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: recover from previously failed attempt #291

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

woehrl01
Copy link

@woehrl01 woehrl01 commented Oct 3, 2023

If a previous certificate request failed, it will be currently never retried again.

This adds a configuration max-retry-duration to allow a transition time to recover from transient errors.

The PR includes:

  • New Unittests
  • Fixed existing unittest for cluster wide issuer
  • go fmt of everything

@jetstack-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sgtcodfish for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@woehrl01 woehrl01 force-pushed the previous_failed branch 2 times, most recently from fb32f46 to 28e880e Compare October 3, 2023 17:27
@woehrl01 woehrl01 changed the title fix: recover from previously failed attempt feat: recover from previously failed attempt Oct 3, 2023
@woehrl01
Copy link
Author

woehrl01 commented Oct 3, 2023

Looking at the implementation at: https://github.com/cert-manager/issuer-lib/blob/main/controllers/request_controller.go#L306 I can see that it should be handled differently. I'm replacing the PR with an approach like that: It records the errors as intermediate errors including a maxretryduration.

@woehrl01
Copy link
Author

woehrl01 commented Oct 5, 2023

PR is now updated to reflect a similar behaviour as in the https://github.com/cert-manager/issuer-lib, by allowing a grace period of retries.

@github-actions
Copy link

github-actions bot commented Oct 5, 2023

Detected different CRDs in the config/crd/bases directory and charts/aws-pca-issuer/crds directory.
Since both CRDs were modified in this commit(s), they were unable to be automatically synced. Please update the pull request with identical CRDs for this workflow to pass.

@dcamzn
Copy link

dcamzn commented Oct 6, 2023

@woehrl01 cert-manager should retry failed issuance per their documentation (https://cert-manager.io/docs/faq/). Do you know if cert-manager is retrying? If it's not, any idea why it's not working as intended?

@woehrl01
Copy link
Author

woehrl01 commented Oct 6, 2023

@dcamzn No I don't know. According to the docs it's only retrying if it's not in a terminal state (marked as failed). I also looked through the source code and it looks like that it only retries a terminal state if you delete the certificaterequest (that's also a recommendation according to the faq). In my case we only have a window of 32h until the certificate is invalid and it had never retried/recreated a certificaterequest once during that time, after it failed.

Even an active "renew" command via cli isn't creating a new certificaterequest 48h after terminal failure.

It could be because we never set the condition Issuing to false. But Ready to false

@dcamzn
Copy link

dcamzn commented Oct 6, 2023

@woehrl01 thank you for that info. we're going to keep looking into this. We'll take a look at the Issuing and Ready conditions that you called out. Thank you!

@woehrl01
Copy link
Author

woehrl01 commented Oct 6, 2023

@dcamzn thanks. One tip to reproduce the issue quite easily:

  • delete the issuer
  • request a certificate (it will fail)
  • create the issuer

Expected scenario is that it would retry issuing the certificate afterwards automatically/eventually.

My PR will basically just allow this for a configurable grace period. We already use that change successful in production.

@dcamzn
Copy link

dcamzn commented Oct 6, 2023

thanks @woehrl01 on steps to reproduce the issue. you read our minds!

@dcamzn
Copy link

dcamzn commented Oct 6, 2023

If it fails at the point of the certificaterequest, we think the right behavior is for the certificaterequest to be deleted by the cert-manager controller and recreated by the cert-manager controller, so that all issuers get fixed.

we'll circle back after reproducing the issue.

@woehrl01
Copy link
Author

woehrl01 commented Oct 7, 2023

@dcamzn great, thank you! Let me share the following docs about the certificaterequest which states that it's not retrying the issuing after failure: https://cert-manager.io/docs/concepts/certificaterequest/

So I guess the Faq is maybe not correct.

@woehrl01
Copy link
Author

woehrl01 commented Oct 9, 2023

Did some additional research, looks like that any of those conditions do not match, and the CertificateRequest is not getting deleted:

https://github.com/cert-manager/cert-manager/blob/c6ff0136d2973a4aa14c6c08f86c0e254b1f79c6/pkg/controller/certificates/requestmanager/requestmanager_controller.go#L222-L261

@divyansh-gupta
Copy link
Contributor

I was able to reproduce and cut this SIM to cert-manager: cert-manager/cert-manager#6405.

I saw that one of the times I reproduced, if I waited long enough, it did retry exactly once. I was not able to get that behavior consistently.

@divyansh-gupta
Copy link
Contributor

divyansh-gupta commented Oct 12, 2023

I was able to get retries to work, see cert-manager/cert-manager#6405 for details. It seems the retries just take longer between each try than they did before, which was a change in cert-manager 1.8.

@woehrl01
Copy link
Author

Thank you @divyansh-gupta for investigating, I also have read through the PR, but this is not what I see.

I would like to share the exact logs which shows that cert-manager is retrying, but it's not creating a new CertificateRequest resource.

Bildschirmfoto 2023-10-15 um 13 52 53

The "failed to retrieve provisioner" is the error returned by aws-privateca-issuer

@woehrl01
Copy link
Author

@dcamzn @divyansh-gupta what are your thought on this PR, after syncing with the cert-manager team?

@cunningr
Copy link

We are also hitting a very similar issue. If you create the cert and the AWSPCAIssuer at the same time, the Certificate request happens before the AWSPCAIssuer is ready. Although cert-manager seems to retry after 1h, it is never successful in creating the certificate.

@divyansh-gupta
Copy link
Contributor

divyansh-gupta commented Oct 27, 2023

Hi @woehrl01 - let me share with your my results when replicating the exact situation you described:

  1. Create the certificate with NO issuer in place and describe it:
[ec2-user@ip-172-31-45-200 ~]$ kubectl describe Certificate
Name:         pca-issuer-rsa-cert-retry-blah-1
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  cert-manager.io/v1
Kind:         Certificate
Metadata:
  Creation Timestamp:  2023-10-26T20:46:37Z
  Generation:          1
  Resource Version:    1364963
  UID:                 65949e92-9dd4-47c5-885d-d654d4ffb963
Spec:
  Dns Names:
    issuer-rsa-example.com
  Issuer Ref:
    Group:  awspca.cert-manager.io
    Kind:   AWSPCAIssuer
    Name:   pca-issuer-rsa-2
  Private Key:
    Algorithm:  RSA
    Size:       2048
  Secret Name:  pca-issuer-rsa-cert-tls-retry-blah-1
  Subject:
    Organizations:
      aws
Status:
  Conditions:
    Last Transition Time:    2023-10-26T20:46:37Z
    Message:                 Issuing certificate as Secret does not exist
    Observed Generation:     1
    Reason:                  DoesNotExist
    Status:                  False
    Type:                    Ready
    Last Transition Time:    2023-10-26T20:46:37Z
    Message:                 The certificate request has failed to complete and will be retried: issuer could not be found
    Observed Generation:     1
    Reason:                  Failed
    Status:                  False
    Type:                    Issuing
  Failed Issuance Attempts:  1
  Last Failure Time:         2023-10-26T20:46:37Z
Events:
  Type     Reason     Age   From                                       Message
  ----     ------     ----  ----                                       -------
  Normal   Issuing    27s   cert-manager-certificates-trigger          Issuing certificate as Secret does not exist
  Normal   Generated  27s   cert-manager-certificates-key-manager      Stored new private key in temporary Secret resource "pca-issuer-rsa-cert-retry-blah-1-nfkg7"
  Normal   Requested  27s   cert-manager-certificates-request-manager  Created new CertificateRequest resource "pca-issuer-rsa-cert-retry-blah-1-1"
  Warning  Failed     27s   cert-manager-certificates-issuing          The certificate request has failed to complete and will be retried: issuer could not be found
  1. Check the CertificateRequest immediately and see that it failed because "issuer could not be found":
[ec2-user@ip-172-31-45-200 ~]$ kubectl describe CertificateRequest
Name:         pca-issuer-rsa-cert-retry-blah-1-1
Namespace:    default
Labels:       <none>
Annotations:  cert-manager.io/certificate-name: pca-issuer-rsa-cert-retry-blah-1
              cert-manager.io/certificate-revision: 1
              cert-manager.io/private-key-secret-name: pca-issuer-rsa-cert-retry-blah-1-nfkg7
API Version:  cert-manager.io/v1
Kind:         CertificateRequest
Metadata:
  Creation Timestamp:  2023-10-26T20:46:37Z
  Generation:          1
  Owner References:
    API Version:           cert-manager.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Certificate
    Name:                  pca-issuer-rsa-cert-retry-blah-1
    UID:                   65949e92-9dd4-47c5-885d-d654d4ffb963
  Resource Version:        1364959
  UID:                     58a242be-e3ae-488e-81bc-1c20289c4838
Spec:
  Extra:
    authentication.kubernetes.io/pod-name:
      cert-manager-75d57c8d4b-j5f6s
    authentication.kubernetes.io/pod-uid:
      fbc686ee-060a-4215-acbc-39fab0993079
  Groups:
    system:serviceaccounts
    system:serviceaccounts:cert-manager
    system:authenticated
  Issuer Ref:
    Group:   awspca.cert-manager.io
    Kind:    AWSPCAIssuer
    Name:    pca-issuer-rsa-2
  Request:   LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQ2xEQ0NBWHdDQVFBd0RqRU1NQW9HQTFVRUNoTURZWGR6TUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQwpBUThBTUlJQkNnS0NBUUVBMit5MGcrQzRMdzhwK1VqUlpPSW8zZ0lUQ1BHQ1RHRVBKeUpsL1FQNDdrdHJuZld4ClFyTTMrSUFRdkY4dkNSSTI2dlVBWXo0bnJNeUppbXpsK2FHQ2JPeE8wbmpaRXhjM0RnYUFRYkNHN1ZYQ3NXUGEKUnp6NUpDUCtPSmdFT3MzT1drT0lWdXJ5UE95TEpnVmdoUHR2SURTcGJNY0ozQzNKVmtoSkJQNTVVdkdPeFBrWgpDRUhpaXlsbVdXNEVzMnYrUjRlOGtxUW5kaWJLa1NtNk4wR2g0Ull1N1ppcGpPK1A0LzNYTkNZWFN4Z3JISVRPCjg5TVNYdFpCbENhV1FmR3doZnFCZnJnc1FCdDhwa2w1OVlSTTRadW41aEJUd0IzRFYwUGVGdm80VEZBcWFTbzgKZXU0MEJpMXJMUmszejFVMnBtaWVScWtweVA1aGZzVGVkNEZ2ZlFJREFRQUJvRUV3UHdZSktvWklodmNOQVFrTwpNVEl3TURBaEJnTlZIUkVFR2pBWWdoWnBjM04xWlhJdGNuTmhMV1Y0WVcxd2JHVXVZMjl0TUFzR0ExVWREd1FFCkF3SUZvREFOQmdrcWhraUc5dzBCQVFzRkFBT0NBUUVBbVJwd2Fjbm9naFRVZ2Izc0xJY0FtMmttQXNTRWU0N1AKeE44cmtHRnY4YUdzSmpJaFYyb09QdjkrWkIzNmlEcy9KVmRCaGFwZlpvNTBBNW9DK2hNRzUrS2FlUUlrTW1pVApTUjhGVFVSak1IUXFQSCttenlqNVoxNXErYUhERk1NcXRxeml4WlFDSnJ4V09rbFVIK3RPelR5TytFTmg4bW0vCit0SHFqdnY3clA4eWlKY1JlUVlSMVErQndGTCs0b0NJcjFnYlB5bmNuVjBsUkhCK0NpOFVHWHN4bm1IOG9xa1YKdTkyaU5rUHZ0dTk0NGNqRTVva1hWd1NHc0dDZ1ZBeGNZd0NnbEpNaW9ESmd6NzRBRW5uMlVGcnZXelVpM2lMdApRQUxuOFVWeXYweUF0Tjhsc3V4aGlCQUlJVzUyNENHaCt6MGxsRzVGSWVpZ1E0a1lxc1BueGc9PQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0K
  UID:       f5c14e33-830f-420e-86c1-4d6bce4197ef
  Username:  system:serviceaccount:cert-manager:cert-manager
Status:
  Conditions:
    Last Transition Time:  2023-10-26T20:46:37Z
    Message:               Certificate request has been approved by cert-manager.io
    Reason:                cert-manager.io
    Status:                True
    Type:                  Approved
    Last Transition Time:  2023-10-26T20:46:37Z
    Message:               issuer could not be found
    Reason:                Failed
    Status:                False
    Type:                  Ready
Events:
  Type     Reason           Age   From                                       Message
  ----     ------           ----  ----                                       -------
  Normal   cert-manager.io  87s   cert-manager-certificaterequests-approver  Certificate request has been approved by cert-manager.io
  Warning  Failed           87s   awspcaissuer-controller                    issuer could not be found
  1. Deploy the issuer and check its Ready, which it is:
[ec2-user@ip-172-31-45-200 ~]$ kubectl describe AWSPCAIssuer
Name:         pca-issuer-rsa-2
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  awspca.cert-manager.io/v1beta1
Kind:         AWSPCAIssuer
Metadata:
  Creation Timestamp:  2023-10-26T20:48:42Z
  Generation:          1
  Resource Version:    1365212
  UID:                 cb4cb08a-e0c3-4ae5-b7f3-9925ed49428b
Spec:
  Arn:     arn:aws:acm-pca:us-east-1:444007060692:certificate-authority/da6896c0-cd44-41e9-b436-c291c2018b16
  Region:  us-east-1
Status:
  Conditions:
    Last Transition Time:  2023-10-26T20:48:46Z
    Message:               Issuer verified
    Reason:                Verified
    Status:                True
    Type:                  Ready
Events:
  Type    Reason    Age              From                     Message
  ----    ------    ----             ----                     -------
  Normal  Verified  1s (x2 over 4s)  awspcaissuer-controller  Issuer verified
  1. Wait until the 1 hour mark so retry happens

  2. See the certificate failed the second retry -

[ec2-user@ip-172-31-45-200 ~]$ kubectl describe Certificate
Name:         pca-issuer-rsa-cert-retry-blah-1
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  cert-manager.io/v1
Kind:         Certificate
Metadata:
  Creation Timestamp:  2023-10-26T20:46:37Z
  Generation:          1
  Resource Version:    1371932
  UID:                 65949e92-9dd4-47c5-885d-d654d4ffb963
Spec:
  Dns Names:
    issuer-rsa-example.com
  Issuer Ref:
    Group:  awspca.cert-manager.io
    Kind:   AWSPCAIssuer
    Name:   pca-issuer-rsa-2
  Private Key:
    Algorithm:  RSA
    Size:       2048
  Secret Name:  pca-issuer-rsa-cert-tls-retry-blah-1
  Subject:
    Organizations:
      aws
Status:
  Conditions:
    Last Transition Time:    2023-10-26T20:46:37Z
    Message:                 Issuing certificate as Secret does not exist
    Observed Generation:     1
    Reason:                  DoesNotExist
    Status:                  False
    Type:                    Ready
    Last Transition Time:    2023-10-26T21:46:37Z
    Message:                 The certificate request has failed to complete and will be retried: issuer could not be found
    Observed Generation:     1
    Reason:                  Failed
    Status:                  False
    Type:                    Issuing
  Failed Issuance Attempts:  2
  Last Failure Time:         2023-10-26T21:46:37Z
Events:
  Type     Reason     Age                From                                       Message
  ----     ------     ----               ----                                       -------
  Normal   Issuing    22m (x2 over 82m)  cert-manager-certificates-trigger          Issuing certificate as Secret does not exist
  Normal   Requested  22m (x2 over 82m)  cert-manager-certificates-request-manager  Created new CertificateRequest resource "pca-issuer-rsa-cert-retry-blah-1-1"
  Warning  Failed     22m (x2 over 82m)  cert-manager-certificates-issuing          The certificate request has failed to complete and will be retried: issuer could not be found
  Normal   Generated  22m                cert-manager-certificates-key-manager      Stored new private key in temporary Secret resource "pca-issuer-rsa-cert-retry-blah-1-vfczn"

but see the CertificateRequest shows the cert issued anyway! And I checked, the cert exists and was issued -

[ec2-user@ip-172-31-45-200 ~]$ kubectl describe CertificateRequest
Name:         pca-issuer-rsa-cert-retry-blah-1-1
Namespace:    default
Labels:       <none>
Annotations:  cert-manager.io/certificate-name: pca-issuer-rsa-cert-retry-blah-1
              cert-manager.io/certificate-revision: 1
              cert-manager.io/private-key-secret-name: pca-issuer-rsa-cert-retry-blah-1-vfczn
API Version:  cert-manager.io/v1
Kind:         CertificateRequest
Metadata:
  Creation Timestamp:  2023-10-26T21:46:37Z
  Generation:          1
  Owner References:
    API Version:           cert-manager.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Certificate
    Name:                  pca-issuer-rsa-cert-retry-blah-1
    UID:                   65949e92-9dd4-47c5-885d-d654d4ffb963
  Resource Version:        1371939
  UID:                     a416b3e9-589a-4f53-9b53-78498fdaf6cb
Spec:
  Extra:
    authentication.kubernetes.io/pod-name:
      cert-manager-75d57c8d4b-j5f6s
    authentication.kubernetes.io/pod-uid:
      fbc686ee-060a-4215-acbc-39fab0993079
  Groups:
    system:serviceaccounts
    system:serviceaccounts:cert-manager
    system:authenticated
  Issuer Ref:
    Group:   awspca.cert-manager.io
    Kind:    AWSPCAIssuer
    Name:    pca-issuer-rsa-2
  Request:   LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQ2xEQ0NBWHdDQVFBd0RqRU1NQW9HQTFVRUNoTURZWGR6TUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQwpBUThBTUlJQkNnS0NBUUVBMWVPb1VYbG9ZdVk0SXFKK252UXRiRUpDcFdkWk1GMHY3aTh1UDN3VG9DRlpvODRICmZoWTM4YXFvaHllY2NOZUZxdjRpeEV3Nm92SHY2ZUJYTTNtN2RCT3RsajdtblV4Rk5OSjdEdW05N0R0bVRURkUKSGZTTk9ES2d0L2ZsUktYS1FBUUNqTlhxSTRtZ2pEOHNoVlVEWHNhSjVEQVhqOUw2U3p5QWdzMXpldUJ2WGJuZApmMW1rRXFhMm4vZVA2SU5qRTJsY1V5dmdveGp5K0Myb2cramU3Y0FTZEM0SjYrdDFyUHA3T1FORUNaRG9jaDIyClBrZ3BXbHBNOXZXT0tvZjVVemFBMERuVEs3bnBFdDR5UHptcU5QWmwvK0VMOE45NFZCZEErb2dtdjVjNHVBcHcKRVFYbzFNeUFBTnd6RC85TmxHQTlvNWJnOVJwQzYxU0N5VUI2c3dJREFRQUJvRUV3UHdZSktvWklodmNOQVFrTwpNVEl3TURBaEJnTlZIUkVFR2pBWWdoWnBjM04xWlhJdGNuTmhMV1Y0WVcxd2JHVXVZMjl0TUFzR0ExVWREd1FFCkF3SUZvREFOQmdrcWhraUc5dzBCQVFzRkFBT0NBUUVBeXdza2FIZExRWWI3cFdiMXYvNDREd2JXMFp4Q2cyb1kKbmpSSk5oZ3VEVW43V3F3TlZDOXVnTWlWa0xvTmY3RDVmZ3dONFVNcWliR0NuYTY2WDlHeFJUS1pCNE5YTFE3Sgo4ZnV1cm1IaXF1Vm4vU0ZIbVRqL3NnMDFBVGJRWU9pWmxJUnhaS0oycHdzTjZwT3FNclU3NUplOFhZSEEyaTYvCkw0UGRSZHNVSkU4Z2doL0swb25mcEx6YUhBVmNPL25uS3M0NUFSVkl6MW84UVR3N0EzbnYrZ2RCZDc2YitzQWgKS1JZZUx1a0toMWYxSVBTblZqZmV5dGwwdFVJWktlQkcyWkN0emNuaVdlNXlhZ2xCYlB6aDJ2d3NVcDB1V1dobwo2ODg5d2VGUkE1R0Z6V3ZFWW1aK1VtUE5XZWNQRzNlMlI2blZoOVFTV2VseTNtbDdSZ0xWYXc9PQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0K
  UID:       f5c14e33-830f-420e-86c1-4d6bce4197ef
  Username:  system:serviceaccount:cert-manager:cert-manager
Status:
  Ca:           LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURZekNDQWt1Z0F3SUJBZ0lSQUxKdllWT3V4UFFRTXp3Q3Y5ejNRT2N3RFFZSktvWklodmNOQVFFTEJRQXcKU3pFWE1CVUdBMVVFQ2d3T1FWZFRUMjVCYVhKVWQybDBZMmd4RnpBVkJnTlZCQXNNRGtGWFUwOXVRV2x5VkhkcApkR05vTVJjd0ZRWURWUVFEREE1QlYxTlBia0ZwY2xSM2FYUmphREFlRncweU16QTVNVEl4TnpNMk1qaGFGdzB6Ck16QTVNVEl4T0RNMk1qUmFNRXN4RnpBVkJnTlZCQW9NRGtGWFUwOXVRV2x5VkhkcGRHTm9NUmN3RlFZRFZRUUwKREE1QlYxTlBia0ZwY2xSM2FYUmphREVYTUJVR0ExVUVBd3dPUVZkVFQyNUJhWEpVZDJsMFkyZ3dnZ0VpTUEwRwpDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLQW9JQkFRQ3g0WmFHSUZscjR3ck5SUzhIRDEwWWRhblM3TDhWCjBEb3dieDQyQVpPUkZmd2hpRGZybml2b3ErZDJaWkZrdGxGSTVqdzhRN2I4dkppWnZQdklYNzRyQjJmanZjUjEKZ2NtL1NFSGMwUWRmWWxnT3FycE0yQjI2ZEk0YXpkRTBLODBXbUxXRTA3MDkrTks5UWE5Tkg1bk82ZU1nT2haRQphN1FldHBJUmUxcmlzc0ZqQktOY2prSkwzOUl5Z1dqVnRjeXV2Y2dZVlNnaWJOMjRrQkl0WXFkWTlCSXRwWUNDCmxSWjlhWWpkOEp4SXFQMzFIUWJmSVNhREdoaUNuM2pyWnVwdGp4M29oeUlDV3BhWmVDUnRqb2dGMkZzRkI2dFIKYWNTZTM4YVlBZDg3S1ZRNUsxcUROTU01NitSUlRZeFJURmIyaTVqWUMzSXkwL1Q5aEl6ZFlMbnhBZ01CQUFHagpRakJBTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFRkV6WXdZSDI2UjR5UVBWL2QrcDcvYWdCCm1ZWnpNQTRHQTFVZER3RUIvd1FFQXdJQmhqQU5CZ2txaGtpRzl3MEJBUXNGQUFPQ0FRRUFiK2dSSnNaaW5GY1oKaWVQLzVWSXpBYzhVb0xmVmZrMzcyQkI4ZnZpRTluMDlxbEk2MU44ckhiMUJBeUQzVjNYTE4xSGRlVW90clFhZgpTcjNaVnJkZEFPbDlPaHJFQTBEWjlrYUZNL2p2ZXlIelJQTk1DYSttNnZvaGQycFk1RDUwOWFLT09lc3FmeERrClZJR1ZycXV3OWZhOXJTSWhMcDNaRkJaRTBXbDFPV3F3QzNjMGsxTDRtdHpDeGJ1eFVJQUJZZ0RWRU41M0RlNTUKdXpqUmlJWkE4TkFhb0RPajdMdDdKQmVFSCs4N0JZRmlCWjNMTWN3M1VOaG9UclpJUGs3NnlZVE5YeHRkdlgyOAoxTkZKYVdmSWRpRVJidEJlODYvbDhxVld0SU56RVYxdEhHZjVOSXY1cUJSUFlyQlptd09IdVNXUWFsZitDRUhLCnRORUVQSzBQc2c9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
  Certificate:  LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURZRENDQWtpZ0F3SUJBZ0lRRW52TmJrZ2dVV1FRU1Uxb2djKzZ2VEFOQmdrcWhraUc5dzBCQVFzRkFEQkwKTVJjd0ZRWURWUVFLREE1QlYxTlBia0ZwY2xSM2FYUmphREVYTUJVR0ExVUVDd3dPUVZkVFQyNUJhWEpVZDJsMApZMmd4RnpBVkJnTlZCQU1NRGtGWFUwOXVRV2x5VkhkcGRHTm9NQjRYRFRJek1UQXlOakl3TkRZek4xb1hEVEl6Ck1URXlOVEl4TkRZek4xb3dEakVNTUFvR0ExVUVDaE1EWVhkek1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0MKQVE4QU1JSUJDZ0tDQVFFQTFlT29VWGxvWXVZNElxSitudlF0YkVKQ3BXZFpNRjB2N2k4dVAzd1RvQ0Zabzg0SApmaFkzOGFxb2h5ZWNjTmVGcXY0aXhFdzZvdkh2NmVCWE0zbTdkQk90bGo3bW5VeEZOTko3RHVtOTdEdG1UVEZFCkhmU05PREtndC9mbFJLWEtRQVFDak5YcUk0bWdqRDhzaFZVRFhzYUo1REFYajlMNlN6eUFnczF6ZXVCdlhibmQKZjFta0VxYTJuL2VQNklOakUybGNVeXZnb3hqeStDMm9nK2plN2NBU2RDNEo2K3QxclBwN09RTkVDWkRvY2gyMgpQa2dwV2xwTTl2V09Lb2Y1VXphQTBEblRLN25wRXQ0eVB6bXFOUFpsLytFTDhOOTRWQmRBK29nbXY1YzR1QXB3CkVRWG8xTXlBQU53ekQvOU5sR0E5bzViZzlScEM2MVNDeVVCNnN3SURBUUFCbzMwd2V6QWhCZ05WSFJFRUdqQVkKZ2hacGMzTjFaWEl0Y25OaExXVjRZVzF3YkdVdVkyOXRNQWtHQTFVZEV3UUNNQUF3SHdZRFZSMGpCQmd3Rm9BVQpUTmpCZ2ZicEhqSkE5WDkzNm52OXFBR1pobk13SFFZRFZSME9CQllFRkJHcHNSTjF4dEdiZFlYTnYvWm15RGNmCnBhTW5NQXNHQTFVZER3UUVBd0lGb0RBTkJna3Foa2lHOXcwQkFRc0ZBQU9DQVFFQUZPWGFORGhyNHZSZHQ4NnQKdlpCaGgvbXpXOVZPekRtTWpPYmdIenlOdGl6eTZIalhqVWprZVZWTWFHeEQxaHZzUmwxbTFocTZXaTI2RHpuSgpLbm9xUnJWUnFsVExMSE5ocWRxWXc4U3lRVExmL1VZVmFjNklqTWIvVVJ4eDYwVjBGbEJmUGpaTVFOOVc3Z3lpCklDeG5aaytiMzNNN1AxWitTWHNvU2hwek5TKzVSdGxubXRYYVZUdjkrM0plYkNCRlVVSzRoMUZnalFpOE5xRkMKVGJ6bkhoM1BDUGR6N3BVZWp6dHo2MnBOc1V3Q1hXOE5IQlRLZW01U1MvWEpLbDluU2x5Wm5HSWdrc2ZpSGxiWgptL2NxUnR2Wlk5amRKeHNYcVhKRHFzUWp5bGtqTlN6TTJSMTF2T3EvWFdwSU1UZFptc2szK2VFZVBJNnlIT3ZiClo1aTF6UT09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
  Conditions:
    Last Transition Time:  2023-10-26T21:46:37Z
    Message:               Certificate request has been approved by cert-manager.io
    Reason:                cert-manager.io
    Status:                True
    Type:                  Approved
    Last Transition Time:  2023-10-26T21:46:40Z
    Message:               certificate issued
    Reason:                Issued
    Status:                True
    Type:                  Ready
Events:
  Type    Reason           Age   From                                       Message
  ----    ------           ----  ----                                       -------
  Normal  cert-manager.io  21m   cert-manager-certificaterequests-approver  Certificate request has been approved by cert-manager.io
  Normal  Issued           21m   awspcaissuer-controller                    certificate issued
  1. Wait for the next retry in three hours and you'll see on the second retry that the Certificate object and the CertificateRequest object both show a successful issuance, grabbed the output here:
[ec2-user@ip-172-31-45-200 ~]$ kubectl describe Certificate
Name:         pca-issuer-rsa-cert-retry-blah-1
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  cert-manager.io/v1
Kind:         Certificate
Metadata:
  Creation Timestamp:  2023-10-26T20:46:37Z
  Generation:          1
  Resource Version:    1385854
  UID:                 65949e92-9dd4-47c5-885d-d654d4ffb963
Spec:
  Dns Names:
    issuer-rsa-example.com
  Issuer Ref:
    Group:  awspca.cert-manager.io
    Kind:   AWSPCAIssuer
    Name:   pca-issuer-rsa-2
  Private Key:
    Algorithm:  RSA
    Size:       2048
  Secret Name:  pca-issuer-rsa-cert-tls-retry-blah-1
  Subject:
    Organizations:
      aws
Status:
  Conditions:
    Last Transition Time:  2023-10-26T23:46:40Z
    Message:               Certificate is up to date and has not expired
    Observed Generation:   1
    Reason:                Ready
    Status:                True
    Type:                  Ready
  Not After:               2023-11-25T23:46:37Z
  Not Before:              2023-10-26T22:46:37Z
  Renewal Time:            2023-11-15T23:26:37Z
  Revision:                1
Events:                    <none>
[ec2-user@ip-172-31-45-200 ~]$ kubectl describe CertificateRequest
Name:         pca-issuer-rsa-cert-retry-blah-1-1
Namespace:    default
Labels:       <none>
Annotations:  cert-manager.io/certificate-name: pca-issuer-rsa-cert-retry-blah-1
              cert-manager.io/certificate-revision: 1
              cert-manager.io/private-key-secret-name: pca-issuer-rsa-cert-retry-blah-1-p8ssp
API Version:  cert-manager.io/v1
Kind:         CertificateRequest
Metadata:
  Creation Timestamp:  2023-10-26T23:46:37Z
  Generation:          1
  Owner References:
    API Version:           cert-manager.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Certificate
    Name:                  pca-issuer-rsa-cert-retry-blah-1
    UID:                   65949e92-9dd4-47c5-885d-d654d4ffb963
  Resource Version:        1385848
  UID:                     a72811cd-15e9-4e1f-9302-eeadee8d4bb8
Spec:
  Extra:
    authentication.kubernetes.io/pod-name:
      cert-manager-75d57c8d4b-j5f6s
    authentication.kubernetes.io/pod-uid:
      fbc686ee-060a-4215-acbc-39fab0993079
  Groups:
    system:serviceaccounts
    system:serviceaccounts:cert-manager
    system:authenticated
  Issuer Ref:
    Group:   awspca.cert-manager.io
    Kind:    AWSPCAIssuer
    Name:    pca-issuer-rsa-2
  Request:   LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQ2xEQ0NBWHdDQVFBd0RqRU1NQW9HQTFVRUNoTURZWGR6TUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQwpBUThBTUlJQkNnS0NBUUVBM29MaG1BR0hxWlIrUUdZUnNIekpqRTFDRWppR2pqQ3FmcEk0aU9lSTA4T1JDUWJDCklqaCtnM0RtWm5KbUR5MVg5RENzancxWGExa2tVZHVrRzRFYUgzSTdQMFdXOVFMaTFZWHVubFV4cklFTkVaWVUKdGtTT1YweUtvWDMxYW4wcWkyQXRjZUVydFJ2bUF5b0loK2FwKzBRYTRvczlWeitrcGk4Y1ZPSEVMZW5UYXRHaApJUEdwdWpvVFVlZlFKQjRqbG1mUFByekxlNVVDNDlVcUc0dWxERjlEeWp3dUlqdmM0N1U2QzJ5TVJGNjI1eU5EClRsWTl2QlhGV2Z3MkJZZFBhUUdhajluVDBndU5pdmczOEp1WFhQL0x4NGQzeGRDRUo4SVBkOENlcWdkYzNZR2sKRmVYUEVSUmQwSzhGWGVHNnlJd3IwcHdpcWxyL0p4RTlYWVV1MHdJREFRQUJvRUV3UHdZSktvWklodmNOQVFrTwpNVEl3TURBaEJnTlZIUkVFR2pBWWdoWnBjM04xWlhJdGNuTmhMV1Y0WVcxd2JHVXVZMjl0TUFzR0ExVWREd1FFCkF3SUZvREFOQmdrcWhraUc5dzBCQVFzRkFBT0NBUUVBblRGZkNkTzEyQmFGaUkycG84OVhtV0drN3RFSmk2VjIKUVQrQ1k2emVnODVyS2VaZmVDM3BXOXNmNnBRQ3ZOUlZsK096S0xXbU16OW9VbStWYjk2NTVFQlp3UFF4WnlaTQp1Zkd6WnBvQWpYZC95K2tDVVZ1ODNLdm5QVVFYcWUvdW4vRlM3Wm42S2hCS0VwWjl5aU1KRjFIbUdCSHFIV1luCk1KNE9VbGRYelhOa0dLQjZseCsrbzdrN1JXTStsZVJOQzRwbWkycElJck1NMmpqdEdmY1R3ek1PblBKNUxLME0KQjlLSFMwc2hKdlBoTzlHcC8xUldodzFEaTBzT3pOS1Rqc3N4TXQ2Y0o5YUNoVkh5NW1BNnpNaW1JTTlkcnpJdQpIcS9TYnZnK085cm5WUHIrS0k1bG5ZVVBEYndtRVFxN2RvdytTbk9KaUNsNUlvbjdEYzk5dkE9PQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0K
  UID:       f5c14e33-830f-420e-86c1-4d6bce4197ef
  Username:  system:serviceaccount:cert-manager:cert-manager
Status:
  Ca:           LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURZekNDQWt1Z0F3SUJBZ0lSQUxKdllWT3V4UFFRTXp3Q3Y5ejNRT2N3RFFZSktvWklodmNOQVFFTEJRQXcKU3pFWE1CVUdBMVVFQ2d3T1FWZFRUMjVCYVhKVWQybDBZMmd4RnpBVkJnTlZCQXNNRGtGWFUwOXVRV2x5VkhkcApkR05vTVJjd0ZRWURWUVFEREE1QlYxTlBia0ZwY2xSM2FYUmphREFlRncweU16QTVNVEl4TnpNMk1qaGFGdzB6Ck16QTVNVEl4T0RNMk1qUmFNRXN4RnpBVkJnTlZCQW9NRGtGWFUwOXVRV2x5VkhkcGRHTm9NUmN3RlFZRFZRUUwKREE1QlYxTlBia0ZwY2xSM2FYUmphREVYTUJVR0ExVUVBd3dPUVZkVFQyNUJhWEpVZDJsMFkyZ3dnZ0VpTUEwRwpDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLQW9JQkFRQ3g0WmFHSUZscjR3ck5SUzhIRDEwWWRhblM3TDhWCjBEb3dieDQyQVpPUkZmd2hpRGZybml2b3ErZDJaWkZrdGxGSTVqdzhRN2I4dkppWnZQdklYNzRyQjJmanZjUjEKZ2NtL1NFSGMwUWRmWWxnT3FycE0yQjI2ZEk0YXpkRTBLODBXbUxXRTA3MDkrTks5UWE5Tkg1bk82ZU1nT2haRQphN1FldHBJUmUxcmlzc0ZqQktOY2prSkwzOUl5Z1dqVnRjeXV2Y2dZVlNnaWJOMjRrQkl0WXFkWTlCSXRwWUNDCmxSWjlhWWpkOEp4SXFQMzFIUWJmSVNhREdoaUNuM2pyWnVwdGp4M29oeUlDV3BhWmVDUnRqb2dGMkZzRkI2dFIKYWNTZTM4YVlBZDg3S1ZRNUsxcUROTU01NitSUlRZeFJURmIyaTVqWUMzSXkwL1Q5aEl6ZFlMbnhBZ01CQUFHagpRakJBTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFRkV6WXdZSDI2UjR5UVBWL2QrcDcvYWdCCm1ZWnpNQTRHQTFVZER3RUIvd1FFQXdJQmhqQU5CZ2txaGtpRzl3MEJBUXNGQUFPQ0FRRUFiK2dSSnNaaW5GY1oKaWVQLzVWSXpBYzhVb0xmVmZrMzcyQkI4ZnZpRTluMDlxbEk2MU44ckhiMUJBeUQzVjNYTE4xSGRlVW90clFhZgpTcjNaVnJkZEFPbDlPaHJFQTBEWjlrYUZNL2p2ZXlIelJQTk1DYSttNnZvaGQycFk1RDUwOWFLT09lc3FmeERrClZJR1ZycXV3OWZhOXJTSWhMcDNaRkJaRTBXbDFPV3F3QzNjMGsxTDRtdHpDeGJ1eFVJQUJZZ0RWRU41M0RlNTUKdXpqUmlJWkE4TkFhb0RPajdMdDdKQmVFSCs4N0JZRmlCWjNMTWN3M1VOaG9UclpJUGs3NnlZVE5YeHRkdlgyOAoxTkZKYVdmSWRpRVJidEJlODYvbDhxVld0SU56RVYxdEhHZjVOSXY1cUJSUFlyQlptd09IdVNXUWFsZitDRUhLCnRORUVQSzBQc2c9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
  Certificate:  LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURZRENDQWtpZ0F3SUJBZ0lRYlJmUUJFbGhwcTRZUzMwK0RJb2ZLakFOQmdrcWhraUc5dzBCQVFzRkFEQkwKTVJjd0ZRWURWUVFLREE1QlYxTlBia0ZwY2xSM2FYUmphREVYTUJVR0ExVUVDd3dPUVZkVFQyNUJhWEpVZDJsMApZMmd4RnpBVkJnTlZCQU1NRGtGWFUwOXVRV2x5VkhkcGRHTm9NQjRYRFRJek1UQXlOakl5TkRZek4xb1hEVEl6Ck1URXlOVEl6TkRZek4xb3dEakVNTUFvR0ExVUVDaE1EWVhkek1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0MKQVE4QU1JSUJDZ0tDQVFFQTNvTGhtQUdIcVpSK1FHWVJzSHpKakUxQ0VqaUdqakNxZnBJNGlPZUkwOE9SQ1FiQwpJamgrZzNEbVpuSm1EeTFYOURDc2p3MVhhMWtrVWR1a0c0RWFIM0k3UDBXVzlRTGkxWVh1bmxVeHJJRU5FWllVCnRrU09WMHlLb1gzMWFuMHFpMkF0Y2VFcnRSdm1BeW9JaCthcCswUWE0b3M5Vnora3BpOGNWT0hFTGVuVGF0R2gKSVBHcHVqb1RVZWZRSkI0amxtZlBQcnpMZTVVQzQ5VXFHNHVsREY5RHlqd3VJanZjNDdVNkMyeU1SRjYyNXlORApUbFk5dkJYRldmdzJCWWRQYVFHYWo5blQwZ3VOaXZnMzhKdVhYUC9MeDRkM3hkQ0VKOElQZDhDZXFnZGMzWUdrCkZlWFBFUlJkMEs4RlhlRzZ5SXdyMHB3aXFsci9KeEU5WFlVdTB3SURBUUFCbzMwd2V6QWhCZ05WSFJFRUdqQVkKZ2hacGMzTjFaWEl0Y25OaExXVjRZVzF3YkdVdVkyOXRNQWtHQTFVZEV3UUNNQUF3SHdZRFZSMGpCQmd3Rm9BVQpUTmpCZ2ZicEhqSkE5WDkzNm52OXFBR1pobk13SFFZRFZSME9CQllFRlBuS1NUMkFwSnVBb1RZb3lTVzk2MjZkCmpBL1NNQXNHQTFVZER3UUVBd0lGb0RBTkJna3Foa2lHOXcwQkFRc0ZBQU9DQVFFQVNVZEtLcUhvRHM2RnAvR2oKeVcvTGhxcFIzeVh5L1hBc0FKODRGQjZHVkhYcGpKMlBXblhRTHZwVVZrdHp3Yi8wTWtvSGloTWdvQi9vQksxNApHNm5mSkNtaWdoa3lEaTVaZGorN0UzdlVSVW1oYm12bDVXbWN0QitIa1Roa0M2L1NwTldRc0N1aDdGVDVIRXllClFvTlZ2amtyMXk5U0JqbzJTSi9JMXp6aGJvc2lldWhEQnZ3UDUvT3hORWtncFJzTTZ4SVBROTd2ZDZ0RnA5RFAKNUVEZjE5WEJRSjlQSm5Kdk83b2xoYXlGY0t4cEExampoSEFDaE00K2MwUXZ2SXlUVUlVZFJpMmtRZlpBaWZoMwpBWXhPWjNDMEFDMVQ5Qy9FS2RvZEFBTlF6ajU4b1FaUHl1K0ZjRmxKdzIyYVk5bVVZdFA0MWhUbXJ4VjdpTEM5CkE4eU5jdz09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
  Conditions:
    Last Transition Time:  2023-10-26T23:46:37Z
    Message:               Certificate request has been approved by cert-manager.io
    Reason:                cert-manager.io
    Status:                True
    Type:                  Approved
    Last Transition Time:  2023-10-26T23:46:40Z
    Message:               certificate issued
    Reason:                Issued
    Status:                True
    Type:                  Ready
Events:                    <none>

To further summarize: When a Certificate is requested before an Issuer is created, the Certificate will fail to issue. When the issuer is created, the Certificate will issue on the next retry (1 hour), but the Certificate object isn't being updated correctly to reflect that even though the CertificateRequest object shows a success. On the next retry (3 hours), the Certificate will also issue again and the Certificate object will also be updated to show that the certificate was issued successfully, so at this point, 2 certificates were issued, and everything is stable.

This is clearly the wrong behavior, and the correct behavior is that on the first retry, when the certificate was issued, the Certificate object should also be updated to reflect the successful issuance. That seems to be the root of the problem that needs to be addressed. I am not yet sure on where to address that issue - it seems that the cert-manager ReadinessController is responsible for taking a CertificateRequest that is successful and updating the Certificate object, so I am looking there: https://github.com/cert-manager/cert-manager/blob/master/pkg/controller/certificates/readiness/readiness_controller.go

@woehrl01
Copy link
Author

@divyansh-gupta thank you for the update. There must be definitely something wrong somewhere. If you look at my event log, you can clearly see that it retried multiple times and never succeed, without deleting the certificaterequest.

Despite I think one hour delay because of a "race condition" during setup is quite long. (in hour case it's not me issue of a deleter issuer but, just that the controller is somehow in a not full ready state (it fails that the controller can't receive the provisioner, even though it did exist the whole time.

@divyansh-gupta
Copy link
Contributor

There seem to be two distinct problems:

  1. Upon a successful retry after an initial failure, it seems that Certificate objects are not always being marked as successfully issued the first time. This is out of the scope of this PR, and may be an issue with cert-manager's ReadinessController. We will leave this issue out of the scope of this discussion for now and will be looked into seperately.

  2. Our plugin is not retrying quickly enough when common "temporary" errors occur. This PR addresses that.

Thanks for finding these issues and more importanly creating a PR to help!

We will be accepting this PR, however there are two things that I'd like to consider as changes in this PR:

  1. We should default MaxRetryDuration to 180. That should allow for 2 retries, considering each retry is set to occur between 1-2 mins.

  2. We should not retry errors returned from the provisioner.sign method. Those should be rare and retry slowly/permanently fail, but I believe this PR marks those errors as temporary.

Let me know what you think of the suggestions above! Thanks for your help again!

@woehrl01
Copy link
Author

@bmsiegel I removed the default value from the helm chart, now it should be compatible with the old image as well.

@woehrl01
Copy link
Author

@bmsiegel looks like the tests are successful now. If you there are no additional changes required from your side I will do the squashing of the changes. Wdyt?

@bmsiegel
Copy link
Contributor

Only thing I'd ask is that we remove everything except the necessary changes for the PR. You can cut new PRs for the other things you did here or we can take care of it. Would make the history a little easier to follow.

@woehrl01
Copy link
Author

What kind of unnecessary changes do you refer to, you would like to have moved to a separate PR?

@bmsiegel
Copy link
Contributor

It looks like the CRD changes are not needed for this, correct me if I'm wrong.

@woehrl01
Copy link
Author

woehrl01 commented Jul 23, 2024

I see, yes they are not. They have been a result on regenerating the manifests from the definition. I can remove that, but suggest doing the change afterwards to make contributions easier.

@bmsiegel
Copy link
Contributor

+1, agreed.

Copy link

Detected different CRDs in the config/crd/bases directory and charts/aws-pca-issuer/crds directory.
Since both CRDs were modified in this commit(s), they were unable to be automatically synced. Please update the pull request with identical CRDs for this workflow to pass.

Copy link

Detected different CRDs in the config/crd/bases directory and charts/aws-pca-issuer/crds directory.
Since both CRDs were modified in this commit(s), they were unable to be automatically synced. Please update the pull request with identical CRDs for this workflow to pass.

@woehrl01 woehrl01 force-pushed the previous_failed branch 3 times, most recently from 1d3d319 to 8242168 Compare October 22, 2024 16:59
Signed-off-by: Lukas Wöhrl <[email protected]>

sync crd

Signed-off-by: Lukas Wöhrl <[email protected]>

add stable itter for tests

Signed-off-by: Lukas Wöhrl <[email protected]>

test: more debug output to find out what's wrong

Signed-off-by: Lukas Wöhrl <[email protected]>

fix: invalid helm default

Signed-off-by: Lukas Wöhrl <[email protected]>

fix: remove default helm value for maxRetryDuration

Signed-off-by: Lukas Wöhrl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants