Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update status readiness #654

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions apis/common/v1/condition.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ const (
ReasonAvailable ConditionReason = "Available"
ReasonUnavailable ConditionReason = "Unavailable"
ReasonCreating ConditionReason = "Creating"
ReasonUpdating ConditionReason = "Updating"
ReasonDeleting ConditionReason = "Deleting"
)

Expand Down Expand Up @@ -194,6 +195,17 @@ func Creating() Condition {
}
}

// Updating returns a condition that indicates the resource is currently
// being updated.
func Updating() Condition {
return Condition{
Type: TypeReady,
Status: corev1.ConditionFalse,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per #583 (comment), I don't think an updating resource is necessarily an unready resource.

LastTransitionTime: metav1.Now(),
Reason: ReasonUpdating,
}
}

// Deleting returns a condition that indicates the resource is currently
// being deleted.
func Deleting() Condition {
Expand Down
78 changes: 38 additions & 40 deletions pkg/reconciler/managed/reconciler.go
Original file line number Diff line number Diff line change
Expand Up @@ -1142,15 +1142,10 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (resu
}
}

if observation.ResourceUpToDate {
// We did not need to create, update, or delete our external resource.
// Per the below issue nothing will notify us if and when the external
// resource we manage changes, so we requeue a speculative reconcile
// after the specified poll interval in order to observe it and react
// accordingly.
// https://github.com/crossplane/crossplane/issues/289
// skip the update if the management policy is set to ignore updates
if !policy.ShouldUpdate() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally lower in the reconciler, but we moved it here to catch cases where the management policy does not allow updates early.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is an improvement on the previous behavior.

At this point in the reconcile loop we know no update is needed, regardless of whether the policy would or wouldn't allow an update. I feel it's more useful to log that no update is needed than to log that no update will be performed due to policy. Logging that no update will be performed due to policy leaves it ambiguous as to whether an update is needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point! But I think I would rather rewrite the if on line 1157:

From this: if !observation.ResourceUpToDate && policy.ShouldUpdate() {
To this: if !observation.ResourceUpToDate {

And then I can immediately on the line below check if !policy.ShouldUpdate() and tell the user that the update will not be performed. This way it will only log this if an update is actually needed.

Would this maybe create spam for the users, though? I am not very familiar with how the ShouldUpdate policy is set, but when it is set to false, would the user not be spammed with Skipping update... on every reconcile then?

reconcileAfter := r.pollIntervalHook(managed, r.pollInterval)
log.Debug("External resource is up to date", "requeue-after", time.Now().Add(reconcileAfter))
log.Debug("Skipping update due to managementPolicies. Reconciliation succeeded", "requeue-after", time.Now().Add(reconcileAfter))
managed.SetConditions(xpv1.ReconcileSuccess())
return reconcile.Result{RequeueAfter: reconcileAfter}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}
Expand All @@ -1159,45 +1154,48 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (resu
log.Debug("External resource differs from desired state", "diff", observation.Diff)
}

// skip the update if the management policy is set to ignore updates
if !policy.ShouldUpdate() {
reconcileAfter := r.pollIntervalHook(managed, r.pollInterval)
log.Debug("Skipping update due to managementPolicies. Reconciliation succeeded", "requeue-after", time.Now().Add(reconcileAfter))
managed.SetConditions(xpv1.ReconcileSuccess())
return reconcile.Result{RequeueAfter: reconcileAfter}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}
if !observation.ResourceUpToDate && policy.ShouldUpdate() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if condition is what allows us to treat the Update case the same as the Create case.

This is as close to the implementation for Create as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure observation.ResourceUpToDate is the correct signal from which to derive an Updating condition.

To me:

  • ReasonCreating means a resource is unavailable because it's still being created by the cloud provider.
  • ReasonUpdating would therefore mean a resource is currently being updated by the cloud provider.

It would potentially make sense to set ReasonUpdating if the cloud provider API let us know that the resource was processing an update. For example I think a GKE cluster would go into state REPAIRING at the API level and be partially degraded while updating. On the other hand, a lot of cloud APIs won't give any indication that an update is being processed.

I believe observation.ResourceUpToDate indicates that there's a diff between the MR's desired state, and the observed state of the external resource. It's not indicating whether the external resource is updating but rather whether the external resource appears to need an update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this reconciler sets the Creating condition, but it never unsets it. It never sets the Available condition.

The reconciler can know that it triggered the creation of of an external resource, but it has no way to know whether creation is done. The ExternalClient is responsible for setting the resource to Available when it notices that creation has completed. For some resources the ExternalClient knows that creation has completed due to some API level state transition. Other resources are ready as soon as they're created.

Will ExternalClient implementations need to factor in the existence of a new Updating condition by handling the case where a resource has transitioned from Updating back to Available?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure observation.ResourceUpToDate is the correct signal from which to derive an Updating condition.

To me:

* `ReasonCreating` means a resource is unavailable because it's still being created by the cloud provider.

* `ReasonUpdating` would therefore mean a resource is currently being updated by the cloud provider.

It would potentially make sense to set ReasonUpdating if the cloud provider API let us know that the resource was processing an update. For example I think a GKE cluster would go into state REPAIRING at the API level and be partially degraded while updating. On the other hand, a lot of cloud APIs won't give any indication that an update is being processed.

I believe observation.ResourceUpToDate indicates that there's a diff between the MR's desired state, and the observed state of the external resource. It's not indicating whether the external resource is updating but rather whether the external resource appears to need an update.

I don't really agree with this.

You seem to focus a lot on what the cloud provider reports for the resource, so if they say it is ready, we set it to ready, but is this really what we want?
When I see that a resource is ready, I would think the cloud provider says it is good to go AND it matches my desired state.
If I update some firewall resource to add a rule to block something happening, is this really ready until my observe function has verified that the rule is actually there?

When I first read this in the reconcile function, it read it as "The API returned a 2xx code indicating that we have successfully created the resource, now let's requeue immediately to make sure it is actually created".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this reconciler sets the Creating condition, but it never unsets it. It never sets the Available condition.

The reconciler can know that it triggered the creation of of an external resource, but it has no way to know whether creation is done. The ExternalClient is responsible for setting the resource to Available when it notices that creation has completed. For some resources the ExternalClient knows that creation has completed due to some API level state transition. Other resources are ready as soon as they're created.

Will ExternalClient implementations need to factor in the existence of a new Updating condition by handling the case where a resource has transitioned from Updating back to Available?

That is a good point.
I am not very familiar with how for example the Upjet providers set the ready field, but I assume they do it at the end of the observe function just before they return the response saying that the resource both exists and is up to date.

My desire was always to just make Update behave the same as Create, to always give the most up to date status on if the resource is as I desire.
So in the implementation I proposed, I set the status to not ready and Updating with an instant requeue to have the observe function have to verify that things are OK. This way it should be covered by the same logic that already sets the resource as Ready by existing providers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't actually able to find these definitions in the docs, only this from the troubleshooting part:

Most Crossplane resources set the Ready condition. Ready represents the availability of the resource - whether it is creating, deleting, available, unavailable, binding, etc.

You acknowledge that there is a gap here, and I feel like the issues you linked confirm the gap and that is a big enough problem that workarounds have been attempted to fix it.

I understand that definitions and such can be scary to change, as they often introduce some level of a breaking change, but I don't really see the negative side of changing this. All providers already implement a way of checking if the resource is actually compliant with the desired state, I only suggest that we actually use it for Update too.

What is the argument for not changing this?
The only negative I see is that the resource will be shown as Updating for a couple of seconds before it shows Ready.

Another issue with the current implementation:
If I were to create my own in-house provider, and I wanted the Ready field to better indicate if the resource is updating, I could manually set the condition in my Update function. The problem is that there is no reconcile requeue after successfully requesting the update, so if I have 10 minute poll interval, it means my resource is Updating for 10 minutes, even though it only took a second.

Copy link
Member

@negz negz Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the argument for not changing this?

I'm not convinced the way you're addressing the problem in this PR is the best way to address the problem. If we're going to change the semantics of Ready, we should take a step back and:

  • Consider the impact.
  • Consider whether changing the semantics of Ready is the right thing to do (vs alternatives like adopting kstatus.

I think a change like this warrants at least a one-pager to get us all aligned on what the problem is, and on which solution is best.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Would you want me to write it, or is this something you would do internally?
I am not very familiar with the one-pager concept, but it seems like most of the existing ones are written by maintainers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely happy for you to write it. 😄 I think it's a great thing to have more community members leading designs. We can pair you up with either myself or a maintainer closer to your timezone (e.g. @phisco or @turkenh) to help bounce ideas from.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can give it a try 😄
I'll work on a first draft and open a PR in the Crossplane repo. I will do some research into Kstatus too.

update, err := external.Update(externalCtx, managed)
if err != nil {
// We'll hit this condition if we can't update our external resource,
// for example if our provider credentials don't have access to update
// it. If this is the first time we encounter this issue we'll be
// requeued implicitly when we update our status with the new error
// condition. If not, we requeue explicitly, which will trigger backoff.
log.Debug("Cannot update external resource")
record.Event(managed, event.Warning(reasonCannotUpdate, err))
managed.SetConditions(xpv1.Updating(), xpv1.ReconcileError(errors.Wrap(err, errReconcileUpdate)))
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
Comment on lines +1167 to +1168
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly the point of the PR. This is what gives us more consistent behaviour for Update.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also similar additions on line 1177, for example.

}

update, err := external.Update(externalCtx, managed)
if err != nil {
// We'll hit this condition if we can't update our external resource,
// for example if our provider credentials don't have access to update
// it. If this is the first time we encounter this issue we'll be
// requeued implicitly when we update our status with the new error
// condition. If not, we requeue explicitly, which will trigger backoff.
log.Debug("Cannot update external resource")
record.Event(managed, event.Warning(reasonCannotUpdate, err))
managed.SetConditions(xpv1.ReconcileError(errors.Wrap(err, errReconcileUpdate)))
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}
if _, err := r.managed.PublishConnection(ctx, managed, update.ConnectionDetails); err != nil {
// If this is the first time we encounter this issue we'll be requeued
// implicitly when we update our status with the new error condition. If
// not, we requeue explicitly, which will trigger backoff.
log.Debug("Cannot publish connection details", "error", err)
record.Event(managed, event.Warning(reasonCannotPublish, err))
managed.SetConditions(xpv1.Updating(), xpv1.ReconcileError(err))
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}

if _, err := r.managed.PublishConnection(ctx, managed, update.ConnectionDetails); err != nil {
// If this is the first time we encounter this issue we'll be requeued
// implicitly when we update our status with the new error condition. If
// not, we requeue explicitly, which will trigger backoff.
log.Debug("Cannot publish connection details", "error", err)
record.Event(managed, event.Warning(reasonCannotPublish, err))
managed.SetConditions(xpv1.ReconcileError(err))
// We've successfully updated our external resource. In many cases the
// updating process takes a little time to finish. We requeue explicitly
// order to observe the external resource to determine whether it's
// ready for use.
Comment on lines +1181 to +1184
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the "We requeue explicitly order to observe..." sentence. It seems a bit odd to me.

This comment was copied from the Create case above and changed to fit with the Update case.

log.Debug("Successfully requested update of external resource")
record.Event(managed, event.Normal(reasonUpdated, "Successfully requested update of external resource"))
managed.SetConditions(xpv1.Updating(), xpv1.ReconcileSuccess())
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}

// We've successfully updated our external resource. Per the below issue
// nothing will notify us if and when the external resource we manage
// changes, so we requeue a speculative reconcile after the specified poll
// interval in order to observe it and react accordingly.
// We did not need to create, update, or delete our external resource.
// Per the below issue nothing will notify us if and when the external
// resource we manage changes, so we requeue a speculative reconcile
// after the specified poll interval in order to observe it and react
// accordingly.
Comment on lines +1191 to +1195
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the comment that was used for the if observation.ResourceUpToDate case.

We moved it to the bottom as it seems to work well for the last resort case, which is when there is nothing we need to do.

// https://github.com/crossplane/crossplane/issues/289
reconcileAfter := r.pollIntervalHook(managed, r.pollInterval)
log.Debug("Successfully requested update of external resource", "requeue-after", time.Now().Add(reconcileAfter))
record.Event(managed, event.Normal(reasonUpdated, "Successfully requested update of external resource"))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chose to not emit any events here, but we are not sure.

Emitting here would emit every reconcile that just observed and everything is okay, so it doesn't seem right.

log.Debug("External resource is up to date", "requeue-after", time.Now().Add(reconcileAfter))
managed.SetConditions(xpv1.ReconcileSuccess())
return reconcile.Result{RequeueAfter: reconcileAfter}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}
17 changes: 11 additions & 6 deletions pkg/reconciler/managed/reconciler_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1075,6 +1075,7 @@ func TestReconciler(t *testing.T) {
MockStatusUpdate: test.MockSubResourceUpdateFn(func(_ context.Context, obj client.Object, _ ...client.SubResourceUpdateOption) error {
want := &fake.Managed{}
want.SetConditions(xpv1.ReconcileError(errors.Wrap(errBoom, errReconcileUpdate)))
want.SetConditions(xpv1.Updating())
if diff := cmp.Diff(want, obj, test.EquateConditions()); diff != "" {
reason := "Errors while updating an external resource should be reported as a conditioned status."
t.Errorf("\nReason: %s\n-want, +got:\n%s", reason, diff)
Expand Down Expand Up @@ -1114,6 +1115,7 @@ func TestReconciler(t *testing.T) {
MockStatusUpdate: test.MockSubResourceUpdateFn(func(_ context.Context, obj client.Object, _ ...client.SubResourceUpdateOption) error {
want := &fake.Managed{}
want.SetConditions(xpv1.ReconcileError(errBoom))
want.SetConditions(xpv1.Updating())
if diff := cmp.Diff(want, obj, test.EquateConditions()); diff != "" {
reason := "Errors publishing connection details after an update should be reported as a conditioned status."
t.Errorf("\nReason: %s\n-want, +got:\n%s", reason, diff)
Expand Down Expand Up @@ -1156,14 +1158,15 @@ func TestReconciler(t *testing.T) {
want: want{result: reconcile.Result{Requeue: true}},
},
"UpdateSuccessful": {
reason: "A successful managed resource update should trigger a requeue after a long wait.",
reason: "A successful managed resource update should trigger a requeue after a short wait",
args: args{
m: &fake.Manager{
Client: &test.MockClient{
MockGet: test.NewMockGetFn(nil),
MockStatusUpdate: test.MockSubResourceUpdateFn(func(_ context.Context, obj client.Object, _ ...client.SubResourceUpdateOption) error {
want := &fake.Managed{}
want.SetConditions(xpv1.ReconcileSuccess())
want.SetConditions(xpv1.Updating())
if diff := cmp.Diff(want, obj, test.EquateConditions()); diff != "" {
reason := "A successful managed resource update should be reported as a conditioned status."
t.Errorf("\nReason: %s\n-want, +got:\n%s", reason, diff)
Expand Down Expand Up @@ -1192,7 +1195,7 @@ func TestReconciler(t *testing.T) {
WithFinalizer(resource.FinalizerFns{AddFinalizerFn: func(_ context.Context, _ resource.Object) error { return nil }}),
},
},
want: want{result: reconcile.Result{RequeueAfter: defaultPollInterval}},
want: want{result: reconcile.Result{Requeue: true}},
},
"ReconciliationPausedSuccessful": {
reason: `If a managed resource has the pause annotation with value "true", there should be no further requeue requests.`,
Expand Down Expand Up @@ -1662,7 +1665,7 @@ func TestReconciler(t *testing.T) {
want: want{result: reconcile.Result{RequeueAfter: defaultPollInterval}},
},
"ManagementPolicyAllUpdateSuccessful": {
reason: "A successful managed resource update using management policies should trigger a requeue after a long wait.",
reason: "A successful managed resource update using management policies should trigger a requeue after a short wait.",
args: args{
m: &fake.Manager{
Client: &test.MockClient{
Expand All @@ -1675,6 +1678,7 @@ func TestReconciler(t *testing.T) {
want := &fake.Managed{}
want.SetManagementPolicies(xpv1.ManagementPolicies{xpv1.ManagementActionAll})
want.SetConditions(xpv1.ReconcileSuccess())
want.SetConditions(xpv1.Updating())
if diff := cmp.Diff(want, obj, test.EquateConditions()); diff != "" {
reason := "A successful managed resource update should be reported as a conditioned status."
t.Errorf("\nReason: %s\n-want, +got:\n%s", reason, diff)
Expand Down Expand Up @@ -1704,10 +1708,10 @@ func TestReconciler(t *testing.T) {
WithFinalizer(resource.FinalizerFns{AddFinalizerFn: func(_ context.Context, _ resource.Object) error { return nil }}),
},
},
want: want{result: reconcile.Result{RequeueAfter: defaultPollInterval}},
want: want{result: reconcile.Result{Requeue: true}},
},
"ManagementPolicyUpdateUpdateSuccessful": {
reason: "A successful managed resource update using management policies should trigger a requeue after a long wait.",
reason: "A successful managed resource update using management policies should trigger a requeue after a short wait.",
args: args{
m: &fake.Manager{
Client: &test.MockClient{
Expand All @@ -1720,6 +1724,7 @@ func TestReconciler(t *testing.T) {
want := &fake.Managed{}
want.SetManagementPolicies(xpv1.ManagementPolicies{xpv1.ManagementActionAll})
want.SetConditions(xpv1.ReconcileSuccess())
want.SetConditions(xpv1.Updating())
if diff := cmp.Diff(want, obj, test.EquateConditions()); diff != "" {
reason := "A successful managed resource update should be reported as a conditioned status."
t.Errorf("\nReason: %s\n-want, +got:\n%s", reason, diff)
Expand Down Expand Up @@ -1749,7 +1754,7 @@ func TestReconciler(t *testing.T) {
WithFinalizer(resource.FinalizerFns{AddFinalizerFn: func(_ context.Context, _ resource.Object) error { return nil }}),
},
},
want: want{result: reconcile.Result{RequeueAfter: defaultPollInterval}},
want: want{result: reconcile.Result{Requeue: true}},
},
"ManagementPolicySkipLateInitialize": {
reason: "Should skip updating a managed resource to persist late initialized fields and should trigger a requeue after a long wait.",
Expand Down
Loading