-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More graceful failure recovery #61
Conversation
aacf564
to
2b5142f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update the spec documentation to incorporate the added status fields?
Other thing I just noticed, on https://github.com/seaneagan/helm-controller/blob/transient-error-recovery/controllers/helmrelease_controller.go#L344, I think we should push the update so that the progressing state becomes observable. In addition to this, I think we should also record the progressing state whenever the reconciler handles the object, as it should signal the resource is being handled (including confirming the state still matches, even if it results in a no-op). |
- Ensure upgrade actually occurs if known state was not reached for any reason (other than install failure). - After transient failures not tied to new state application, ensure spurious upgrades do not occur and ready state is again reached, by remembering that the known state was already successfully applied. - Reset failure counts after success so they're not stale. - Only lookup post-deployment release revision on remediation, since otherwise we already have it. - Push status update after finding new state so user can observe.
2b5142f
to
be9e22f
Compare
I added it to the example status output there. I didn't add any prose documentation for it, since we don't have any for other status fields either (only the ready condition). I think the other status conditions should be documented as part of fluxcd/flux2#179. I could follow up with a PR to document all the other fields, if we think the existing reference documentation is not sufficient. |
✔️ What is not incorporated in the PR yet is my suggestion to
In detail this would mean that we always call: SetHelmReleaseCondition(&hr, ReadyCondition, corev1.ConditionUnknown, ProgressingReason, "reconciliation in progress") after the suspend check (https://github.com/seaneagan/helm-controller/blob/transient-error-recovery/controllers/helmrelease_controller.go#L138), but only reset the status fields on an observed state or generation change. |
Done.
I think this will be confusing to the user and any automation that is doing periodic status checking to have the Ready condition continually oscillate between "True" / "False" and "Unknown". It also results in data loss, for example all the reasons / messages present on the conditions about how they got into that state. I think unless the controller finds a specific change that needs to be reconciled, or experiences some failure in the reconciliation process, then there's no reason to assume that the conditions have changed. It seems like this is the behavior I have observed in other controllers with respect to conditions. One thing that does seem useful is an e.g. |
This can be overcome by only resetting / replacing the I can think of two scenarios where automation would profit from the condition change:
|
As of #47 the ReadyCondition contains valuable info too in it's reason / message / lastTransitionTime which I don't think we'd want to lose. I'm hoping to update the e2e tests soon to validate that these fields are set appropriately for each scenario.
I think this is where a
I think To check my understanding of the intended semantics of this annotation, it is just meant to trigger a normal reconciliation loop (no different than the interval based one), as opposed to triggering an actual unconditional helm upgrade correct? I think a helm upgrade trigger may be useful as well, but perhaps this would be a separate e.g. "releaseAt" annotation, similar to "testAt" (fluxcd/flux2#103).
This is already implemented (though maybe deserves an automated test i.e. the "guarantee part :) ), if a dependency is found not ready the dependent helm release will be marked not ready with an appropriate reason / message. |
This is more consistent with the existing terminology used.
This adds a .status.lastObservedTime field which records when the HelmRelease was last observed by the controller. This allows the user to observe whether the spec.interval and reconcileAt annotations are triggering reconciliation attempts as desired.
df8e05f
to
55f6038
Compare
Also set status.lastObservedTime to the actual time of the status update.
056a2bc
to
0d64e8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this meets all our requirements for observability, and enough observation knobs to make the Toolkit CLI aware if e.g. a tk reconcile hr
results in an actual reconciliation.
+1 for adding the LastObservedTime
field to the Status
object, given the Condition
now has a proper design from the Kubernetes upstream, which we want to follow.
@stefanprodan please give this a look and see if it matches your expectations after the discussion we had. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @seaneagan 🎖️
fix uninitialized toleration
for any reason (other than install failure).
spurious upgrades do not occur and ready state is again reached,
by remembering that the known state was already successfully applied.
since otherwise we already have it.