Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More graceful failure recovery #61

Merged
merged 4 commits into from
Sep 1, 2020

Conversation

seaneagan
Copy link
Contributor

@seaneagan seaneagan commented Aug 26, 2020

  • Ensure upgrade actually occurs if known state was not reached
    for any reason (other than install failure).
  • After transient failures not tied to new state application, ensure
    spurious upgrades do not occur and ready state is again reached,
    by remembering that the known state was already successfully applied.
  • Reset failure counts after success so they're not stale.
  • Only lookup post-deployment release revision on remediation,
    since otherwise we already have it.

@seaneagan seaneagan force-pushed the transient-error-recovery branch 3 times, most recently from aacf564 to 2b5142f Compare August 28, 2020 15:34
@seaneagan seaneagan changed the title Recover status after transient errors More graceful failure recovery Aug 28, 2020
@seaneagan seaneagan marked this pull request as ready for review August 28, 2020 16:07
Copy link
Member

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the spec documentation to incorporate the added status fields?

@hiddeco
Copy link
Member

hiddeco commented Aug 31, 2020

Other thing I just noticed, on https://github.com/seaneagan/helm-controller/blob/transient-error-recovery/controllers/helmrelease_controller.go#L344, I think we should push the update so that the progressing state becomes observable.

In addition to this, I think we should also record the progressing state whenever the reconciler handles the object, as it should signal the resource is being handled (including confirming the state still matches, even if it results in a no-op).

- Ensure upgrade actually occurs if known state was not reached
  for any reason (other than install failure).
- After transient failures not tied to new state application, ensure
  spurious upgrades do not occur and ready state is again reached,
  by remembering that the known state was already successfully applied.
- Reset failure counts after success so they're not stale.
- Only lookup post-deployment release revision on remediation,
  since otherwise we already have it.
- Push status update after finding new state so user can observe.
@seaneagan
Copy link
Contributor Author

Can you please update the spec documentation to incorporate the added status fields?

I added it to the example status output there. I didn't add any prose documentation for it, since we don't have any for other status fields either (only the ready condition). I think the other status conditions should be documented as part of fluxcd/flux2#179. I could follow up with a PR to document all the other fields, if we think the existing reference documentation is not sufficient.

@hiddeco
Copy link
Member

hiddeco commented Aug 31, 2020

I think the other status conditions should be documented as part of fluxcd/flux2#179. I could follow up with a PR to document all the other fields, if we think the existing reference documentation is not sufficient.

✔️


What is not incorporated in the PR yet is my suggestion to

I think we should also record the progressing state whenever the reconciler handles the object, as it should signal the resource is being handled (including confirming the state still matches, even if it results in a no-op).

In detail this would mean that we always call:

SetHelmReleaseCondition(&hr, ReadyCondition, corev1.ConditionUnknown, ProgressingReason, "reconciliation in progress")

after the suspend check (https://github.com/seaneagan/helm-controller/blob/transient-error-recovery/controllers/helmrelease_controller.go#L138), but only reset the status fields on an observed state or generation change.

@seaneagan
Copy link
Contributor Author

Other thing I just noticed, on https://github.com/seaneagan/helm-controller/blob/transient-error-recovery/controllers/helmrelease_controller.go#L344, I think we should push the update so that the progressing state becomes observable.

Done.

In addition to this, I think we should also record the progressing state whenever the reconciler handles the object, as it should signal the resource is being handled (including confirming the state still matches, even if it results in a no-op).

I think this will be confusing to the user and any automation that is doing periodic status checking to have the Ready condition continually oscillate between "True" / "False" and "Unknown". It also results in data loss, for example all the reasons / messages present on the conditions about how they got into that state. I think unless the controller finds a specific change that needs to be reconciled, or experiences some failure in the reconciliation process, then there's no reason to assume that the conditions have changed. It seems like this is the behavior I have observed in other controllers with respect to conditions. One thing that does seem useful is an e.g. lastObservedTime or similar field (Helm Operator called it lastUpdateTime) on each condition, to know as of when the condition has been validated by the controller, however this does not exist in the recommended condition schema.

@hiddeco
Copy link
Member

hiddeco commented Aug 31, 2020

It also results in data loss, for example all the reasons / messages present on the conditions about how they got into that state.

This can be overcome by only resetting / replacing the ReadyCondition (which SetHelmReleaseCondition does), which would allow the other conditions to still exists for as long as there hasn't been a state change detection, while also allowing the user to observe that the resource is continuously being reconciled at the configured interval.

I can think of two scenarios where automation would profit from the condition change:

  • To be able to detect if a ReconcileAtAnnotation actually triggers a reconciliation
  • To guarantee of processing up-to-date DependsOn resources

@seaneagan
Copy link
Contributor Author

This can be overcome by only resetting / replacing the ReadyCondition (which SetHelmReleaseCondition does), which would allow the other conditions to still exists for as long as there hasn't been a state change detection

As of #47 the ReadyCondition contains valuable info too in it's reason / message / lastTransitionTime which I don't think we'd want to lose. I'm hoping to update the e2e tests soon to validate that these fields are set appropriately for each scenario.

, while also allowing the user to observe that the resource is continuously being reconciled at the configured interval.

I think this is where a status.lastObservedTime (or try to get it included in the recommended status condition schema) or perhaps status.lastReconciledAt would be very useful.

I can think of two scenarios where automation would profit from the condition change:

* To be able to detect if a `ReconcileAtAnnotation` actually triggers a reconciliation

I think lastObservedTime would help here as well.

To check my understanding of the intended semantics of this annotation, it is just meant to trigger a normal reconciliation loop (no different than the interval based one), as opposed to triggering an actual unconditional helm upgrade correct? I think a helm upgrade trigger may be useful as well, but perhaps this would be a separate e.g. "releaseAt" annotation, similar to "testAt" (fluxcd/flux2#103).

* To guarantee of processing up-to-date `DependsOn` resources

This is already implemented (though maybe deserves an automated test i.e. the "guarantee part :) ), if a dependency is found not ready the dependent helm release will be marked not ready with an appropriate reason / message.

This is more consistent with the existing terminology used.
This adds a .status.lastObservedTime field which records when the
HelmRelease was last observed by the controller. This allows the user
to observe whether the spec.interval and reconcileAt annotations are
triggering reconciliation attempts as desired.
Also set status.lastObservedTime to the actual time of the status update.
Copy link
Member

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this meets all our requirements for observability, and enough observation knobs to make the Toolkit CLI aware if e.g. a tk reconcile hr results in an actual reconciliation.

+1 for adding the LastObservedTime field to the Status object, given the Condition now has a proper design from the Kubernetes upstream, which we want to follow.

@hiddeco
Copy link
Member

hiddeco commented Sep 1, 2020

@stefanprodan please give this a look and see if it matches your expectations after the discussion we had.

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @seaneagan 🎖️

@hiddeco hiddeco added area/ux In pursuit of a delightful user experience enhancement New feature or request labels Sep 1, 2020
@hiddeco hiddeco merged commit 3413c76 into fluxcd:master Sep 1, 2020
jtyr pushed a commit to jtyr/fluxcd-helm-controller that referenced this pull request Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ux In pursuit of a delightful user experience enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants