-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reconciler/managed: add crossplane_resource_drift_seconds metric #489
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
/* | ||
Copyright 2023 The Crossplane Authors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package managed | ||
|
||
import ( | ||
"context" | ||
"sync" | ||
"time" | ||
|
||
"github.com/prometheus/client_golang/prometheus" | ||
"k8s.io/apimachinery/pkg/runtime/schema" | ||
"k8s.io/client-go/tools/cache" | ||
kmetrics "k8s.io/component-base/metrics" | ||
"sigs.k8s.io/controller-runtime/pkg/cluster" | ||
"sigs.k8s.io/controller-runtime/pkg/manager" | ||
"sigs.k8s.io/controller-runtime/pkg/metrics" | ||
|
||
"github.com/crossplane/crossplane-runtime/pkg/errors" | ||
"github.com/crossplane/crossplane-runtime/pkg/resource" | ||
) | ||
|
||
func init() { | ||
metrics.Registry.MustRegister(drift) | ||
} | ||
|
||
var subSystem = "crossplane" | ||
|
||
var ( | ||
drift = prometheus.NewHistogramVec(prometheus.HistogramOpts{ | ||
Subsystem: subSystem, | ||
Name: "resource_drift_seconds", | ||
Help: "ALPHA: How long since the previous successful reconcile when a resource was found to be out of sync; excludes restart of the provider", | ||
Buckets: kmetrics.ExponentialBuckets(10e-9, 10, 10), | ||
}, []string{"group", "kind"}) | ||
) | ||
|
||
// driftRecorder records the time since the last observation of a resource | ||
// and records the time since on update as a metric. This represents an upper | ||
// bound for the duration the drift existed. | ||
type driftRecorder struct { | ||
lastObservation sync.Map | ||
gvk schema.GroupVersionKind | ||
|
||
cluster cluster.Cluster | ||
} | ||
|
||
var _ manager.Runnable = &driftRecorder{} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: I prefer to add these "does it satisfy the interface" checks to test files, since they are a kind of test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would argue the inverse: they are for the reader aka documentation. Testing is a side-effect. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They're both, and besides documentation (e.g. Godoc examples) can go in test files too. Not a hill I'll die on, but precedent elsewhere in Crossplane is to put these in test files. |
||
|
||
func (r *driftRecorder) Start(ctx context.Context) error { | ||
inf, err := r.cluster.GetCache().GetInformerForKind(ctx, r.gvk) | ||
if err != nil { | ||
return errors.Wrapf(err, "cannot get informer for drift recorder for resource %s", r.gvk) | ||
} | ||
|
||
registered, err := inf.AddEventHandler(cache.ResourceEventHandlerFuncs{ | ||
DeleteFunc: func(obj interface{}) { | ||
if final, ok := obj.(cache.DeletedFinalStateUnknown); ok { | ||
obj = final.Obj | ||
} | ||
managed := obj.(resource.Managed) | ||
r.lastObservation.Delete(managed.GetName()) | ||
}, | ||
}) | ||
if err != nil { | ||
return errors.Wrap(err, "cannot add delete event handler to informer for drift recorder") | ||
} | ||
defer inf.RemoveEventHandler(registered) //nolint:errcheck // this happens on destruction. We cannot do anything anyway. | ||
|
||
<-ctx.Done() | ||
|
||
return nil | ||
} | ||
negz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
func (r *driftRecorder) recordUnchanged(name string) { | ||
r.lastObservation.Store(name, time.Now()) | ||
} | ||
|
||
func (r *driftRecorder) recordUpdate(name string) { | ||
last, ok := r.lastObservation.Load(name) | ||
if !ok { | ||
return | ||
} | ||
drift.WithLabelValues(r.gvk.Group, r.gvk.Kind).Observe(time.Since(last.(time.Time)).Seconds()) | ||
|
||
r.lastObservation.Store(name, time.Now()) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,7 @@ package managed | |
|
||
import ( | ||
"context" | ||
"fmt" | ||
"strings" | ||
"time" | ||
|
||
|
@@ -88,6 +89,30 @@ const ( | |
reasonReconciliationPaused event.Reason = "ReconciliationPaused" | ||
) | ||
|
||
// RetryAfterError is returned by a reconciler when it is unable to complete | ||
// the actions, and the reconciler should retry after the specified duration. | ||
type RetryAfterError struct { | ||
retryAfter time.Duration | ||
} | ||
|
||
// NewRetryAfterError creates a new ErrRetryAfter with the given retry-after duration. | ||
func NewRetryAfterError(retryAfter time.Duration) RetryAfterError { | ||
return RetryAfterError{retryAfter: retryAfter} | ||
} | ||
|
||
// DefaultRetryAfter is the default retry-after duration for ErrRetryAfter. | ||
const DefaultRetryAfter = 30 * time.Second | ||
|
||
// Error implements the error interface. | ||
func (e RetryAfterError) Error() string { | ||
return fmt.Sprintf("retry after %v", e.retryAfter) | ||
} | ||
|
||
// RetryAfter returns the duration after which the reconciler should retry. | ||
func (e RetryAfterError) RetryAfter() time.Duration { | ||
return e.retryAfter | ||
} | ||
|
||
// ControllerName returns the recommended name for controllers that use this | ||
// package to reconcile a particular kind of managed resource. | ||
func ControllerName(kind string) string { | ||
|
@@ -318,15 +343,27 @@ type ExternalClient interface { | |
// external resource does not exist. Create implementations may update | ||
// managed resource annotations, and those updates will be persisted. | ||
// All other updates will be discarded. | ||
// | ||
// This can return RetryAfterError to indicate that the resource cannot | ||
// be created at this time, and that the controller should retry after | ||
// the given duration. | ||
Create(ctx context.Context, mg resource.Managed) (ExternalCreation, error) | ||
|
||
// Update the external resource represented by the supplied Managed | ||
// resource, if necessary. Called unless Observe reports that the | ||
// associated external resource is up to date. | ||
// | ||
// This can return RetryAfterError to indicate that the resource cannot | ||
// be updated at this time, and that the controller should retry after | ||
// the given duration. | ||
Update(ctx context.Context, mg resource.Managed) (ExternalUpdate, error) | ||
|
||
// Delete the external resource upon deletion of its associated Managed | ||
// resource. Called when the managed resource has been deleted. | ||
// | ||
// This can return RetryAfterError to indicate that the resource cannot | ||
// be deleted at this time, and that the controller should retry after | ||
// the given duration. | ||
Delete(ctx context.Context, mg resource.Managed) error | ||
} | ||
|
||
|
@@ -482,6 +519,8 @@ type Reconciler struct { | |
|
||
features feature.Flags | ||
|
||
driftRecorder driftRecorder | ||
|
||
// The below structs embed the set of interfaces used to implement the | ||
// managed resource reconciler. We do this primarily for readability, so | ||
// that the reconciler logic reads r.external.Connect(), | ||
|
@@ -671,6 +710,7 @@ func NewReconciler(m manager.Manager, of resource.ManagedKind, o ...ReconcilerOp | |
creationGracePeriod: defaultGracePeriod, | ||
timeout: reconcileTimeout, | ||
managed: defaultMRManaged(m), | ||
driftRecorder: driftRecorder{cluster: m}, | ||
external: defaultMRExternal(), | ||
supportedManagementPolicies: defaultSupportedManagementPolicies(), | ||
log: logging.NewNopLogger(), | ||
|
@@ -681,6 +721,11 @@ func NewReconciler(m manager.Manager, of resource.ManagedKind, o ...ReconcilerOp | |
ro(r) | ||
} | ||
|
||
if err := m.Add(&r.driftRecorder); err != nil { | ||
r.log.Info("unable to register drift recorder with controller manager", "error", err) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be a debug log? What would a typical user do with/about this log when they saw it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would even promote it to an error if I could with our logging interface. This signals that something is broken. It's not debug output. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternative: we add an error as return value as a breaking change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some rationale as to why we don't have error log level: https://dave.cheney.net/2015/11/05/lets-talk-about-logging There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the second part of my original comment still stands though - how would a user fix this? What would cause this? In the spirit of the better errors guide - what does "unable to register drift recorder" mean? Keeping at info level is fine with me if it's meaningful and actionable to the user. Channeling the good errors guide, I don't personally think I'd know what "unable to register drift recorder" meant if I hadn't read this PR, or what I should do about it. |
||
// no way to recover from this | ||
} | ||
|
||
return r | ||
} | ||
|
||
|
@@ -892,7 +937,13 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco | |
log = log.WithValues("deletion-timestamp", managed.GetDeletionTimestamp()) | ||
|
||
if observation.ResourceExists && policy.ShouldDelete() { | ||
if err := external.Delete(externalCtx, managed); err != nil { | ||
err := external.Delete(externalCtx, managed) | ||
var retryAfterErr RetryAfterError | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: This long variable name is redundant given its limited scope and the fact that its type There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
if errors.As(err, &retryAfterErr) { | ||
log.Debug("External resource cannot be deleted now", "requeue-after", time.Now().Add(retryAfterErr.RetryAfter())) | ||
return reconcile.Result{RequeueAfter: retryAfterErr.RetryAfter()}, nil | ||
} | ||
if err != nil { | ||
// We'll hit this condition if we can't delete our external | ||
// resource, for example if our provider credentials don't have | ||
// access to delete it. If this is the first time we encounter | ||
|
@@ -981,6 +1032,11 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco | |
} | ||
|
||
creation, err := external.Create(externalCtx, managed) | ||
var retryAfterErr RetryAfterError | ||
if errors.As(err, &retryAfterErr) { | ||
log.Debug("External resource cannot be created now", "requeue-after", time.Now().Add(retryAfterErr.RetryAfter())) | ||
return reconcile.Result{RequeueAfter: retryAfterErr.RetryAfter()}, nil | ||
} | ||
if err != nil { | ||
// We'll hit this condition if we can't create our external | ||
// resource, for example if our provider credentials don't have | ||
|
@@ -1079,6 +1135,13 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco | |
// https://github.com/crossplane/crossplane/issues/289 | ||
log.Debug("External resource is up to date", "requeue-after", time.Now().Add(r.pollInterval)) | ||
managed.SetConditions(xpv1.ReconcileSuccess()) | ||
|
||
// record that we intentionally did not update the managed resource | ||
// because no drift was detected. We call this so late in the reconcile | ||
// because all the cases above could contribute (for different reasons) | ||
// that the external object would not have been updated. | ||
r.driftRecorder.recordUnchanged(managed.GetName()) | ||
|
||
return reconcile.Result{RequeueAfter: r.pollInterval}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus) | ||
} | ||
|
||
|
@@ -1094,6 +1157,11 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco | |
} | ||
|
||
update, err := external.Update(externalCtx, managed) | ||
var retryAfterErr RetryAfterError | ||
if errors.As(err, &retryAfterErr) { | ||
log.Debug("External resource cannot be updated now", "requeue-after", time.Now().Add(retryAfterErr.RetryAfter())) | ||
return reconcile.Result{RequeueAfter: retryAfterErr.RetryAfter()}, nil | ||
} | ||
if err != nil { | ||
// We'll hit this condition if we can't update our external resource, | ||
// for example if our provider credentials don't have access to update | ||
|
@@ -1106,6 +1174,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco | |
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus) | ||
} | ||
|
||
// record the drift after the successful update. | ||
r.driftRecorder.recordUpdate(managed.GetName()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we also reset the timer here (i.e. call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there would be an unchanged case afterwards. But I agree, we can directly reset. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Upjet runtime signals a resource-up-to-date in certain circumstances without actually making sure that the external resource is actually up-to-date:
Especially the first case is common because we asynchronously run the Terraform operations. The recorded metrics will be reflecting a lower bound on the intended metric in this case. The proposed changes are good from my perspective. The cases mentioned above can be considered as some violations of the contract between a provider and the managed reconciler. Just dropping this note here to increase awareness. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, maybe we should add an error type to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. background: this metric is really only useful if it is an upper bound. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ulucinar see second commit. Is that reasonable? |
||
|
||
if _, err := r.managed.PublishConnection(ctx, managed, update.ConnectionDetails); err != nil { | ||
// If this is the first time we encounter this issue we'll be requeued | ||
// implicitly when we update our status with the new error condition. If | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a precedent of using the prometheus client elsewhere? It might be nice to use Opencensus (e.g. to have a single metrics library if we wanted to add tracing in future). I'm guessing we may not get a choice in the matter, since we presumably need to extend the metrics controller-runtime is exposing (and I think it uses the Prometheus client per kubernetes-sigs/controller-runtime#305).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally did what controller-runtime does. Not sure what you are suggesting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really suggesting anything. 🙂 I'm essentially asking whether using Opencensus instead would be possible and advisable, but I think I know the answer is "no, not possible".