-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reconciler/managed: add crossplane_resource_drift_seconds metric #489
Conversation
29e3853
to
cc7e838
Compare
22de3f8
to
40ac673
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sttts, this is a great step toward improving observability 💪
Left a couple of commands, but none is blocking.
Could you please also populate the How has this code been tested
section in the PR description 🙏
pkg/reconciler/managed/metrics.go
Outdated
}, | ||
}) | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we wrap these (this and the above) errors with more details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/reconciler/managed/metrics.go
Outdated
drift = prometheus.NewHistogramVec(prometheus.HistogramOpts{ | ||
Subsystem: subSystem, | ||
Name: "resource_drift_seconds", | ||
Help: "ALPHA: How long since the previous reconcile when a resource was found to be out of sync; excludes restart of the provider", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the previous reconcile
Should this be "since the previous successful reconcile" or even "since the previous successful reconcile where the resource was observed as synced" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -1106,6 +1121,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco | |||
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus) | |||
} | |||
|
|||
// record the drift after the successful update. | |||
r.driftRecorder.recordUpdate(managed.GetName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also reset the timer here (i.e. call recordUnchanged
)? I believe we rely on an upcoming reconciliation which we expect to happen after a successful update, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there would be an unchanged case afterwards. But I agree, we can directly reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
f4d41ab
to
6123931
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sttts. I will bump the crossplane-runtime version across the official providers once this PR is merged.
pkg/reconciler/managed/metrics.go
Outdated
func (r *driftRecorder) Start(ctx context.Context) error { | ||
inf, err := r.cluster.GetCache().GetInformerForKind(ctx, r.gvk) | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we wrap the error here for more context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1106,6 +1121,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco | |||
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus) | |||
} | |||
|
|||
// record the drift after the successful update. | |||
r.driftRecorder.recordUpdate(managed.GetName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upjet runtime signals a resource-up-to-date in certain circumstances without actually making sure that the external resource is actually up-to-date:
- If an async operation is still ongoing at the time the resource is being reconciled
- If the native provider's ttl has expired and we have not yet drained the runner
Especially the first case is common because we asynchronously run the Terraform operations. The recorded metrics will be reflecting a lower bound on the intended metric in this case.
The proposed changes are good from my perspective. The cases mentioned above can be considered as some violations of the contract between a provider and the managed reconciler. Just dropping this note here to increase awareness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, maybe we should add an error type to external.Update
to "come back later" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
background: this metric is really only useful if it is an upper bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ulucinar see second commit. Is that reasonable?
Signed-off-by: Dr. Stefan Schimanski <[email protected]>
Signed-off-by: Dr. Stefan Schimanski <[email protected]>
6123931
to
938e13e
Compare
var subSystem = "crossplane" | ||
|
||
var ( | ||
drift = prometheus.NewHistogramVec(prometheus.HistogramOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a precedent of using the prometheus client elsewhere? It might be nice to use Opencensus (e.g. to have a single metrics library if we wanted to add tracing in future). I'm guessing we may not get a choice in the matter, since we presumably need to extend the metrics controller-runtime is exposing (and I think it uses the Prometheus client per kubernetes-sigs/controller-runtime#305).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally did what controller-runtime does. Not sure what you are suggesting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really suggesting anything. 🙂 I'm essentially asking whether using Opencensus instead would be possible and advisable, but I think I know the answer is "no, not possible".
@@ -681,6 +721,11 @@ func NewReconciler(m manager.Manager, of resource.ManagedKind, o ...ReconcilerOp | |||
ro(r) | |||
} | |||
|
|||
if err := m.Add(&r.driftRecorder); err != nil { | |||
r.log.Info("unable to register drift recorder with controller manager", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a debug log? What would a typical user do with/about this log when they saw it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would even promote it to an error if I could with our logging interface. This signals that something is broken. It's not debug output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative: we add an error as return value as a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some rationale as to why we don't have error log level: https://dave.cheney.net/2015/11/05/lets-talk-about-logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the second part of my original comment still stands though - how would a user fix this? What would cause this? In the spirit of the better errors guide - what does "unable to register drift recorder" mean?
Keeping at info level is fine with me if it's meaningful and actionable to the user. Channeling the good errors guide, I don't personally think I'd know what "unable to register drift recorder" meant if I hadn't read this PR, or what I should do about it.
@@ -892,7 +937,13 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco | |||
log = log.WithValues("deletion-timestamp", managed.GetDeletionTimestamp()) | |||
|
|||
if observation.ResourceExists && policy.ShouldDelete() { | |||
if err := external.Delete(externalCtx, managed); err != nil { | |||
err := external.Delete(externalCtx, managed) | |||
var retryAfterErr RetryAfterError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This long variable name is redundant given its limited scope and the fact that its type RetryAfterError
is written immediately after its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
cluster cluster.Cluster | ||
} | ||
|
||
var _ manager.Runnable = &driftRecorder{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I prefer to add these "does it satisfy the interface" checks to test files, since they are a kind of test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would argue the inverse: they are for the reader aka documentation. Testing is a side-effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're both, and besides documentation (e.g. Godoc examples) can go in test files too. Not a hill I'll die on, but precedent elsewhere in Crossplane is to put these in test files.
This was merged via #683. |
Description of your changes
Adds a metric
crossplane_resource_drift_seconds
that records the time since the previous reconcile when a resource was found to be out of sync.As we don't record the last reconcile on the object (for reasons, we cannot update an object on every observe), we count from the last reconcile in the same process. This means obviously that we don't emit any metric when the provider restarts.
I have:
make reviewable test
to ensure this PR is ready for review.How has this code been tested