Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller: PendingError isn't actually an error #59

Merged
merged 3 commits into from
Nov 16, 2023

Conversation

maelvls
Copy link
Member

@maelvls maelvls commented Sep 11, 2023

I dug a bit more into why users of the issuer-lib wouldn't see the recovery of errors by looking at the logs with the default logging level. For example, the logs would show this:

{"level":"error","msg":"Retryable CertificateRequest error.","error":"certificate still in a pending state, the request will be retried: WebSDK CertRequest Module Requested Certificate"}
{"level":"error","msg":"Retryable CertificateRequest error.","error":"certificate still in a pending state, the request will be retried: WebSDK CertRequest Module Requested Certificate"}
{"level":"error","msg":"Retryable CertificateRequest error.","error":"certificate still in a pending state, the request will be retried: Prepares the Certificate Object for lifecycle processing"}
{"level":"error","msg":"Retryable CertificateRequest error.","error":"certificate still in a pending state, the request will be retried: Not yet available"}
{"level":"error","msg":"Retryable CertificateRequest error.","error":"certificate still in a pending state, the request will be retried: Post CSR"}
{"level":"error","msg":"Retryable CertificateRequest error.","error":"certificate still in a pending state, the request will be retried: Order Pending"}
{"level":"error","msg":"Retryable CertificateRequest error.","error":"certificate still in a pending state, the request will be retried: Order Pending"}
{"level":"error","msg":"Retryable CertificateRequest error.","error":"certificate still in a pending state, the request will be retried: Order Pending"}

After inspecting the CertificateRequest (CR), the user of an implementation of issuer-lib would realize that the CR has succeeded since then but the logs don't reflect that. They would have expected a positive message saying that the controller has recovered:

{"level":"info","msg":"Retryable CertificateRequest error.","error":"certificate still in a pending state, the request will be retried: Order Pending"}

Currently, users have to use the logr level 1 (referred as "debug" in the README) in order to see the success message, but the above "Retryable CertificateRequest error" message shows at logr level 0.

Note that the README is currently incorrect: it should say "Succeeded signing the CertificateRequest is logged at the level 1 (debug)":

Screenshot 2023-09-12 at 10 44 02

I think the main reason for this mismatch is because we confused ourselves with PendingError. The type is error, it has "error" in its name, but it isn't an error after all. By mistake, we (the issuer-lib developers) ended up logging PendingError as an error i.e. at level 0 with log.V(1).Error (Error ignores V(1) entirely, that's why I say that we log it at level 0). These pending error messages get printed regardless of the logging level.

What we really wanted is to only show these messages in "debug" mode, similarly as to how we currently log the "Succeeded signing certificate" messages at logr level 1. That way, the logging experience is consistent: someone debugging the logs will be able to see, in debug mode, the progression of the signing of the certificate as well as when it succeeded.

I also propose to change the message associated with the PendingError case. It isn't a failure since PendingError isn't an error.

@jetstack-bot jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 11, 2023
@maelvls
Copy link
Member Author

maelvls commented Sep 11, 2023

@inteon I'll need a bit of time to refactor the tests. Could you take a look at the general idea of this PR? Thanks.

@inteon
Copy link
Member

inteon commented Sep 11, 2023

I think you are making a valid point.
I'll just try to write down some facts as reference:

There are two possible options: you fail permanently or you can retry reconciliation.

  1. fail permanently:

    • this is always an error and this should be logged
    • can be caused by a returned PermanentError or by a "normal error" that has not been retried enough
  2. retry reconciliation:

    • we can have a "normal error" which probably indicates an error that might be transient
    • we can have a PendingError which probably indicates something is not yet ready (because we will never timeout in case of a PendingError)

The relevant error types are PermanentError, PendingError and "normal error".

So, maybe we should not log PendingErrors as "errors", but do we log normal errors (errors that might be transient) as errors?

@maelvls
Copy link
Member Author

maelvls commented Sep 12, 2023

My concern is that we are conflating two things:

  1. "The signing process is going well and is progressing" (PendingError),
  2. "The signing process failed and will (or will not) be retried" (also PendingError, or PermanentError, or a normal error).

Currently, PendingError is used for both (1) and (2), and that's causing me a lot of confusion.

What I would recommend is to retire PendingError altogether and rename PermanentError to TerminalError to match controller-runtime's naming (that last change isn't required, though).

Since PendingError isn't semantically equivalent to an error, I think we shouldn't use an error to avoid the confusion with one. To avoid any confusion, I would split the returned values in two parts that are exclusive to each other: "Result" and "error". For example:

type Result struct {
    // The Retry attempts are subject to an exponential backoff with an
    // upper limit of 1 hour.
    Retry bool

    // Optional. It may be useful for the user to know why the signing is
    // still in progress.
    RetryReason string

    // The SecCondition is useful to store information on the
    // CertificateRequest (e.g., a polling ID).
    SetCondition Condition
}

// The Sign func may return an error wrapped in TerminalError to signify that the
// signing process has stopped due to a problem that the implementor of Sign thinks
// can't be resolved by retrying. The Sign func doesn't return a Result when an
// error has been returned.
type Sign func(context.Context, CertificateRequestObject, v1alpha1.Issuer) (PEMBundle, Result, error)

On an other topic, I went looking at the documentation around Sign but the comment above the func signature doesn't tell what errors the implementors of Sign can use. One has to dig into the other files around (err_set_condition.go etc) to learn that they can return special errors.

WDYT @inteon?

Makefile Outdated Show resolved Hide resolved
@jetstack-bot jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 18, 2023
@jetstack-bot jetstack-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 9, 2023
@inteon inteon self-assigned this Oct 13, 2023
@inteon inteon force-pushed the detail-logs-usage branch from bb1a7db to c21fd90 Compare October 13, 2023 13:44
@jetstack-bot jetstack-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 13, 2023
@inteon
Copy link
Member

inteon commented Nov 16, 2023

This is a great improvement to issuer-lib!
/lgtm
/approve

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 16, 2023
@jetstack-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: inteon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 16, 2023
@jetstack-bot jetstack-bot merged commit 24eaa87 into cert-manager:main Nov 16, 2023
3 checks passed
@maelvls maelvls deleted the detail-logs-usage branch November 17, 2023 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants