More graceful failure recovery #61

seaneagan · 2020-08-26T21:22:56Z

Ensure upgrade actually occurs if known state was not reached
for any reason (other than install failure).
After transient failures not tied to new state application, ensure
spurious upgrades do not occur and ready state is again reached,
by remembering that the known state was already successfully applied.
Reset failure counts after success so they're not stale.
Only lookup post-deployment release revision on remediation,
since otherwise we already have it.

hiddeco

Can you please update the spec documentation to incorporate the added status fields?

hiddeco · 2020-08-31T11:11:34Z

Other thing I just noticed, on https://github.com/seaneagan/helm-controller/blob/transient-error-recovery/controllers/helmrelease_controller.go#L344, I think we should push the update so that the progressing state becomes observable.

In addition to this, I think we should also record the progressing state whenever the reconciler handles the object, as it should signal the resource is being handled (including confirming the state still matches, even if it results in a no-op).

- Ensure upgrade actually occurs if known state was not reached for any reason (other than install failure). - After transient failures not tied to new state application, ensure spurious upgrades do not occur and ready state is again reached, by remembering that the known state was already successfully applied. - Reset failure counts after success so they're not stale. - Only lookup post-deployment release revision on remediation, since otherwise we already have it. - Push status update after finding new state so user can observe.

seaneagan · 2020-08-31T13:58:44Z

Can you please update the spec documentation to incorporate the added status fields?

I added it to the example status output there. I didn't add any prose documentation for it, since we don't have any for other status fields either (only the ready condition). I think the other status conditions should be documented as part of fluxcd/flux2#179. I could follow up with a PR to document all the other fields, if we think the existing reference documentation is not sufficient.

hiddeco · 2020-08-31T14:11:47Z

I think the other status conditions should be documented as part of fluxcd/flux2#179. I could follow up with a PR to document all the other fields, if we think the existing reference documentation is not sufficient.

✔️

What is not incorporated in the PR yet is my suggestion to

I think we should also record the progressing state whenever the reconciler handles the object, as it should signal the resource is being handled (including confirming the state still matches, even if it results in a no-op).

In detail this would mean that we always call:

SetHelmReleaseCondition(&hr, ReadyCondition, corev1.ConditionUnknown, ProgressingReason, "reconciliation in progress")

after the suspend check (https://github.com/seaneagan/helm-controller/blob/transient-error-recovery/controllers/helmrelease_controller.go#L138), but only reset the status fields on an observed state or generation change.

seaneagan · 2020-08-31T14:20:26Z

Other thing I just noticed, on https://github.com/seaneagan/helm-controller/blob/transient-error-recovery/controllers/helmrelease_controller.go#L344, I think we should push the update so that the progressing state becomes observable.

Done.

In addition to this, I think we should also record the progressing state whenever the reconciler handles the object, as it should signal the resource is being handled (including confirming the state still matches, even if it results in a no-op).

I think this will be confusing to the user and any automation that is doing periodic status checking to have the Ready condition continually oscillate between "True" / "False" and "Unknown". It also results in data loss, for example all the reasons / messages present on the conditions about how they got into that state. I think unless the controller finds a specific change that needs to be reconciled, or experiences some failure in the reconciliation process, then there's no reason to assume that the conditions have changed. It seems like this is the behavior I have observed in other controllers with respect to conditions. One thing that does seem useful is an e.g. lastObservedTime or similar field (Helm Operator called it lastUpdateTime) on each condition, to know as of when the condition has been validated by the controller, however this does not exist in the recommended condition schema.

hiddeco · 2020-08-31T15:21:25Z

It also results in data loss, for example all the reasons / messages present on the conditions about how they got into that state.

This can be overcome by only resetting / replacing the ReadyCondition (which SetHelmReleaseCondition does), which would allow the other conditions to still exists for as long as there hasn't been a state change detection, while also allowing the user to observe that the resource is continuously being reconciled at the configured interval.

I can think of two scenarios where automation would profit from the condition change:

To be able to detect if a ReconcileAtAnnotation actually triggers a reconciliation
To guarantee of processing up-to-date DependsOn resources

seaneagan · 2020-08-31T16:46:51Z

This can be overcome by only resetting / replacing the ReadyCondition (which SetHelmReleaseCondition does), which would allow the other conditions to still exists for as long as there hasn't been a state change detection

As of #47 the ReadyCondition contains valuable info too in it's reason / message / lastTransitionTime which I don't think we'd want to lose. I'm hoping to update the e2e tests soon to validate that these fields are set appropriately for each scenario.

, while also allowing the user to observe that the resource is continuously being reconciled at the configured interval.

I think this is where a status.lastObservedTime (or try to get it included in the recommended status condition schema) or perhaps status.lastReconciledAt would be very useful.

I can think of two scenarios where automation would profit from the condition change:
* To be able to detect if a `ReconcileAtAnnotation` actually triggers a reconciliation

I think lastObservedTime would help here as well.

To check my understanding of the intended semantics of this annotation, it is just meant to trigger a normal reconciliation loop (no different than the interval based one), as opposed to triggering an actual unconditional helm upgrade correct? I think a helm upgrade trigger may be useful as well, but perhaps this would be a separate e.g. "releaseAt" annotation, similar to "testAt" (fluxcd/flux2#103).

* To guarantee of processing up-to-date `DependsOn` resources

This is already implemented (though maybe deserves an automated test i.e. the "guarantee part :) ), if a dependency is found not ready the dependent helm release will be marked not ready with an appropriate reason / message.

This is more consistent with the existing terminology used.

This adds a .status.lastObservedTime field which records when the HelmRelease was last observed by the controller. This allows the user to observe whether the spec.interval and reconcileAt annotations are triggering reconciliation attempts as desired.

Also set status.lastObservedTime to the actual time of the status update.

hiddeco

I think this meets all our requirements for observability, and enough observation knobs to make the Toolkit CLI aware if e.g. a tk reconcile hr results in an actual reconciliation.

+1 for adding the LastObservedTime field to the Status object, given the Condition now has a proper design from the Kubernetes upstream, which we want to follow.

hiddeco · 2020-09-01T10:05:14Z

@stefanprodan please give this a look and see if it matches your expectations after the discussion we had.

api/v2alpha1/helmrelease_types.go

stefanprodan

LGTM

Thanks @seaneagan 🎖️

fix uninitialized toleration

seaneagan force-pushed the transient-error-recovery branch 3 times, most recently from aacf564 to 2b5142f Compare August 28, 2020 15:34

seaneagan changed the title ~~Recover status after transient errors~~ More graceful failure recovery Aug 28, 2020

seaneagan marked this pull request as ready for review August 28, 2020 16:07

hiddeco reviewed Aug 31, 2020

View reviewed changes

seaneagan force-pushed the transient-error-recovery branch from 2b5142f to be9e22f Compare August 31, 2020 13:49

seaneagan added 2 commits August 31, 2020 12:12

Rename knownStateApplied to observedStateReconciled

fd7d232

This is more consistent with the existing terminology used.

seaneagan force-pushed the transient-error-recovery branch from df8e05f to 55f6038 Compare August 31, 2020 18:34

Update status iff new state or done reconciling

0d64e8d

Also set status.lastObservedTime to the actual time of the status update.

seaneagan force-pushed the transient-error-recovery branch from 056a2bc to 0d64e8d Compare August 31, 2020 21:00

hiddeco approved these changes Sep 1, 2020

View reviewed changes

hiddeco requested a review from stefanprodan September 1, 2020 10:04

stefanprodan reviewed Sep 1, 2020

View reviewed changes

api/v2alpha1/helmrelease_types.go Show resolved Hide resolved

stefanprodan approved these changes Sep 1, 2020

View reviewed changes

hiddeco added area/ux In pursuit of a delightful user experience enhancement New feature or request labels Sep 1, 2020

hiddeco merged commit 3413c76 into fluxcd:master Sep 1, 2020

jtyr pushed a commit to jtyr/fluxcd-helm-controller that referenced this pull request Dec 14, 2023

Merge pull request fluxcd#61 from dweomer/fix-toleration

c078893

fix uninitialized toleration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More graceful failure recovery #61

More graceful failure recovery #61

seaneagan commented Aug 26, 2020 •

edited

Loading

hiddeco left a comment

hiddeco commented Aug 31, 2020 •

edited

Loading

seaneagan commented Aug 31, 2020

hiddeco commented Aug 31, 2020 •

edited

Loading

seaneagan commented Aug 31, 2020

hiddeco commented Aug 31, 2020 •

edited

Loading

seaneagan commented Aug 31, 2020

hiddeco left a comment

hiddeco commented Sep 1, 2020

stefanprodan left a comment

More graceful failure recovery #61

More graceful failure recovery #61

Conversation

seaneagan commented Aug 26, 2020 • edited Loading

hiddeco left a comment

Choose a reason for hiding this comment

hiddeco commented Aug 31, 2020 • edited Loading

seaneagan commented Aug 31, 2020

hiddeco commented Aug 31, 2020 • edited Loading

seaneagan commented Aug 31, 2020

hiddeco commented Aug 31, 2020 • edited Loading

seaneagan commented Aug 31, 2020

hiddeco left a comment

Choose a reason for hiding this comment

hiddeco commented Sep 1, 2020

stefanprodan left a comment

Choose a reason for hiding this comment

seaneagan commented Aug 26, 2020 •

edited

Loading

hiddeco commented Aug 31, 2020 •

edited

Loading

hiddeco commented Aug 31, 2020 •

edited

Loading

hiddeco commented Aug 31, 2020 •

edited

Loading