Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconciler/managed: add crossplane_resource_drift_seconds metric #489

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ require (
github.com/go-logr/logr v1.2.4
github.com/google/go-cmp v0.5.9
github.com/hashicorp/vault/api v1.9.2
github.com/prometheus/client_golang v1.15.1
github.com/spf13/afero v1.9.5
golang.org/x/time v0.3.0
google.golang.org/grpc v1.57.0
Expand All @@ -17,6 +18,7 @@ require (
k8s.io/apiextensions-apiserver v0.27.4
k8s.io/apimachinery v0.27.4
k8s.io/client-go v0.27.4
k8s.io/component-base v0.27.4
sigs.k8s.io/controller-runtime v0.15.0
sigs.k8s.io/controller-tools v0.12.1
sigs.k8s.io/yaml v1.3.0
Expand All @@ -26,6 +28,7 @@ require (
github.com/Azure/go-ansiterm v0.0.0-20230124172434-306776ec8161 // indirect
github.com/Microsoft/go-winio v0.6.1 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/blang/semver/v4 v4.0.0 // indirect
github.com/bufbuild/connect-go v1.9.0 // indirect
github.com/bufbuild/connect-opentelemetry-go v0.4.0 // indirect
github.com/bufbuild/protocompile v0.6.0 // indirect
Expand Down Expand Up @@ -93,7 +96,6 @@ require (
github.com/pkg/browser v0.0.0-20210911075715-681adbf594b8 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/pkg/profile v1.7.0 // indirect
github.com/prometheus/client_golang v1.15.1 // indirect
github.com/prometheus/client_model v0.4.0 // indirect
github.com/prometheus/common v0.44.0 // indirect
github.com/prometheus/procfs v0.10.0 // indirect
Expand Down Expand Up @@ -127,7 +129,6 @@ require (
gopkg.in/inf.v0 v0.9.1 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
k8s.io/component-base v0.27.4 // indirect
k8s.io/klog/v2 v2.100.1 // indirect
k8s.io/kube-openapi v0.0.0-20230525220651-2546d827e515 // indirect
k8s.io/utils v0.0.0-20230505201702-9f6742963106 // indirect
Expand Down
2 changes: 2 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ github.com/benbjohnson/clock v1.3.5 h1:VvXlSJBzZpA/zum6Sj74hxwYI2DIxRWuNIoXAzHZz
github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM=
github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw=
github.com/bgentry/speakeasy v0.1.0/go.mod h1:+zsyZBPWlz7T6j88CTgSN5bM796AkVf0kBD4zp0CCIs=
github.com/blang/semver/v4 v4.0.0 h1:1PFHFE6yCCTv8C1TeyNNarDzntLi7wMI5i/pzqYIsAM=
github.com/blang/semver/v4 v4.0.0/go.mod h1:IbckMUScFkM3pff0VJDNKRiT6TG/YpiHIM2yvyW5YoQ=
github.com/bufbuild/buf v1.25.1 h1:8ed5AjZ+zPIJf72rxtfsDit/MtaBimaSRn9Y+5G++y0=
github.com/bufbuild/buf v1.25.1/go.mod h1:UMPncXMWgrmIM+0QpwTEwjNr2SA0z2YIVZZsmNflvB4=
github.com/bufbuild/connect-go v1.9.0 h1:JIgAeNuFpo+SUPfU19Yt5TcWlznsN5Bv10/gI/6Pjoc=
Expand Down
100 changes: 100 additions & 0 deletions pkg/reconciler/managed/metrics.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
/*
Copyright 2023 The Crossplane Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package managed

import (
"context"
"sync"
"time"

"github.com/prometheus/client_golang/prometheus"
"k8s.io/apimachinery/pkg/runtime/schema"
"k8s.io/client-go/tools/cache"
kmetrics "k8s.io/component-base/metrics"
"sigs.k8s.io/controller-runtime/pkg/cluster"
"sigs.k8s.io/controller-runtime/pkg/manager"
"sigs.k8s.io/controller-runtime/pkg/metrics"

"github.com/crossplane/crossplane-runtime/pkg/errors"
"github.com/crossplane/crossplane-runtime/pkg/resource"
)

func init() {
metrics.Registry.MustRegister(drift)
}

var subSystem = "crossplane"

var (
drift = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a precedent of using the prometheus client elsewhere? It might be nice to use Opencensus (e.g. to have a single metrics library if we wanted to add tracing in future). I'm guessing we may not get a choice in the matter, since we presumably need to extend the metrics controller-runtime is exposing (and I think it uses the Prometheus client per kubernetes-sigs/controller-runtime#305).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally did what controller-runtime does. Not sure what you are suggesting here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really suggesting anything. 🙂 I'm essentially asking whether using Opencensus instead would be possible and advisable, but I think I know the answer is "no, not possible".

Subsystem: subSystem,
Name: "resource_drift_seconds",
Help: "ALPHA: How long since the previous successful reconcile when a resource was found to be out of sync; excludes restart of the provider",
Buckets: kmetrics.ExponentialBuckets(10e-9, 10, 10),
}, []string{"group", "kind"})
)

// driftRecorder records the time since the last observation of a resource
// and records the time since on update as a metric. This represents an upper
// bound for the duration the drift existed.
type driftRecorder struct {
lastObservation sync.Map
gvk schema.GroupVersionKind

cluster cluster.Cluster
}

var _ manager.Runnable = &driftRecorder{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I prefer to add these "does it satisfy the interface" checks to test files, since they are a kind of test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would argue the inverse: they are for the reader aka documentation. Testing is a side-effect.

Copy link
Member

@negz negz Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're both, and besides documentation (e.g. Godoc examples) can go in test files too. Not a hill I'll die on, but precedent elsewhere in Crossplane is to put these in test files.


func (r *driftRecorder) Start(ctx context.Context) error {
inf, err := r.cluster.GetCache().GetInformerForKind(ctx, r.gvk)
if err != nil {
return errors.Wrapf(err, "cannot get informer for drift recorder for resource %s", r.gvk)
}

registered, err := inf.AddEventHandler(cache.ResourceEventHandlerFuncs{
DeleteFunc: func(obj interface{}) {
if final, ok := obj.(cache.DeletedFinalStateUnknown); ok {
obj = final.Obj
}
managed := obj.(resource.Managed)
r.lastObservation.Delete(managed.GetName())
},
})
if err != nil {
return errors.Wrap(err, "cannot add delete event handler to informer for drift recorder")
}
defer inf.RemoveEventHandler(registered) //nolint:errcheck // this happens on destruction. We cannot do anything anyway.

<-ctx.Done()

return nil
}
negz marked this conversation as resolved.
Show resolved Hide resolved

func (r *driftRecorder) recordUnchanged(name string) {
r.lastObservation.Store(name, time.Now())
}

func (r *driftRecorder) recordUpdate(name string) {
last, ok := r.lastObservation.Load(name)
if !ok {
return
}
drift.WithLabelValues(r.gvk.Group, r.gvk.Kind).Observe(time.Since(last.(time.Time)).Seconds())

r.lastObservation.Store(name, time.Now())
}
73 changes: 72 additions & 1 deletion pkg/reconciler/managed/reconciler.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ package managed

import (
"context"
"fmt"
"strings"
"time"

Expand Down Expand Up @@ -88,6 +89,30 @@ const (
reasonReconciliationPaused event.Reason = "ReconciliationPaused"
)

// RetryAfterError is returned by a reconciler when it is unable to complete
// the actions, and the reconciler should retry after the specified duration.
type RetryAfterError struct {
retryAfter time.Duration
}

// NewRetryAfterError creates a new ErrRetryAfter with the given retry-after duration.
func NewRetryAfterError(retryAfter time.Duration) RetryAfterError {
return RetryAfterError{retryAfter: retryAfter}
}

// DefaultRetryAfter is the default retry-after duration for ErrRetryAfter.
const DefaultRetryAfter = 30 * time.Second

// Error implements the error interface.
func (e RetryAfterError) Error() string {
return fmt.Sprintf("retry after %v", e.retryAfter)
}

// RetryAfter returns the duration after which the reconciler should retry.
func (e RetryAfterError) RetryAfter() time.Duration {
return e.retryAfter
}

// ControllerName returns the recommended name for controllers that use this
// package to reconcile a particular kind of managed resource.
func ControllerName(kind string) string {
Expand Down Expand Up @@ -318,15 +343,27 @@ type ExternalClient interface {
// external resource does not exist. Create implementations may update
// managed resource annotations, and those updates will be persisted.
// All other updates will be discarded.
//
// This can return RetryAfterError to indicate that the resource cannot
// be created at this time, and that the controller should retry after
// the given duration.
Create(ctx context.Context, mg resource.Managed) (ExternalCreation, error)

// Update the external resource represented by the supplied Managed
// resource, if necessary. Called unless Observe reports that the
// associated external resource is up to date.
//
// This can return RetryAfterError to indicate that the resource cannot
// be updated at this time, and that the controller should retry after
// the given duration.
Update(ctx context.Context, mg resource.Managed) (ExternalUpdate, error)

// Delete the external resource upon deletion of its associated Managed
// resource. Called when the managed resource has been deleted.
//
// This can return RetryAfterError to indicate that the resource cannot
// be deleted at this time, and that the controller should retry after
// the given duration.
Delete(ctx context.Context, mg resource.Managed) error
}

Expand Down Expand Up @@ -482,6 +519,8 @@ type Reconciler struct {

features feature.Flags

driftRecorder driftRecorder

// The below structs embed the set of interfaces used to implement the
// managed resource reconciler. We do this primarily for readability, so
// that the reconciler logic reads r.external.Connect(),
Expand Down Expand Up @@ -671,6 +710,7 @@ func NewReconciler(m manager.Manager, of resource.ManagedKind, o ...ReconcilerOp
creationGracePeriod: defaultGracePeriod,
timeout: reconcileTimeout,
managed: defaultMRManaged(m),
driftRecorder: driftRecorder{cluster: m},
external: defaultMRExternal(),
supportedManagementPolicies: defaultSupportedManagementPolicies(),
log: logging.NewNopLogger(),
Expand All @@ -681,6 +721,11 @@ func NewReconciler(m manager.Manager, of resource.ManagedKind, o ...ReconcilerOp
ro(r)
}

if err := m.Add(&r.driftRecorder); err != nil {
r.log.Info("unable to register drift recorder with controller manager", "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a debug log? What would a typical user do with/about this log when they saw it?

Copy link
Contributor Author

@sttts sttts Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would even promote it to an error if I could with our logging interface. This signals that something is broken. It's not debug output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative: we add an error as return value as a breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some rationale as to why we don't have error log level: https://dave.cheney.net/2015/11/05/lets-talk-about-logging

Copy link
Member

@negz negz Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second part of my original comment still stands though - how would a user fix this? What would cause this? In the spirit of the better errors guide - what does "unable to register drift recorder" mean?

Keeping at info level is fine with me if it's meaningful and actionable to the user. Channeling the good errors guide, I don't personally think I'd know what "unable to register drift recorder" meant if I hadn't read this PR, or what I should do about it.

// no way to recover from this
}

return r
}

Expand Down Expand Up @@ -892,7 +937,13 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
log = log.WithValues("deletion-timestamp", managed.GetDeletionTimestamp())

if observation.ResourceExists && policy.ShouldDelete() {
if err := external.Delete(externalCtx, managed); err != nil {
err := external.Delete(externalCtx, managed)
var retryAfterErr RetryAfterError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This long variable name is redundant given its limited scope and the fact that its type RetryAfterError is written immediately after its name.

https://github.com/crossplane/crossplane/tree/master/contributing#use-descriptive-variable-names-sparingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if errors.As(err, &retryAfterErr) {
log.Debug("External resource cannot be deleted now", "requeue-after", time.Now().Add(retryAfterErr.RetryAfter()))
return reconcile.Result{RequeueAfter: retryAfterErr.RetryAfter()}, nil
}
if err != nil {
// We'll hit this condition if we can't delete our external
// resource, for example if our provider credentials don't have
// access to delete it. If this is the first time we encounter
Expand Down Expand Up @@ -981,6 +1032,11 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
}

creation, err := external.Create(externalCtx, managed)
var retryAfterErr RetryAfterError
if errors.As(err, &retryAfterErr) {
log.Debug("External resource cannot be created now", "requeue-after", time.Now().Add(retryAfterErr.RetryAfter()))
return reconcile.Result{RequeueAfter: retryAfterErr.RetryAfter()}, nil
}
if err != nil {
// We'll hit this condition if we can't create our external
// resource, for example if our provider credentials don't have
Expand Down Expand Up @@ -1079,6 +1135,13 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
// https://github.com/crossplane/crossplane/issues/289
log.Debug("External resource is up to date", "requeue-after", time.Now().Add(r.pollInterval))
managed.SetConditions(xpv1.ReconcileSuccess())

// record that we intentionally did not update the managed resource
// because no drift was detected. We call this so late in the reconcile
// because all the cases above could contribute (for different reasons)
// that the external object would not have been updated.
r.driftRecorder.recordUnchanged(managed.GetName())

return reconcile.Result{RequeueAfter: r.pollInterval}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}

Expand All @@ -1094,6 +1157,11 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
}

update, err := external.Update(externalCtx, managed)
var retryAfterErr RetryAfterError
if errors.As(err, &retryAfterErr) {
log.Debug("External resource cannot be updated now", "requeue-after", time.Now().Add(retryAfterErr.RetryAfter()))
return reconcile.Result{RequeueAfter: retryAfterErr.RetryAfter()}, nil
}
if err != nil {
// We'll hit this condition if we can't update our external resource,
// for example if our provider credentials don't have access to update
Expand All @@ -1106,6 +1174,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}

// record the drift after the successful update.
r.driftRecorder.recordUpdate(managed.GetName())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also reset the timer here (i.e. call recordUnchanged)? I believe we rely on an upcoming reconciliation which we expect to happen after a successful update, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there would be an unchanged case afterwards. But I agree, we can directly reset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upjet runtime signals a resource-up-to-date in certain circumstances without actually making sure that the external resource is actually up-to-date:

  • If an async operation is still ongoing at the time the resource is being reconciled
  • If the native provider's ttl has expired and we have not yet drained the runner

Especially the first case is common because we asynchronously run the Terraform operations. The recorded metrics will be reflecting a lower bound on the intended metric in this case.

The proposed changes are good from my perspective. The cases mentioned above can be considered as some violations of the contract between a provider and the managed reconciler. Just dropping this note here to increase awareness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe we should add an error type to external.Update to "come back later" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

background: this metric is really only useful if it is an upper bound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulucinar see second commit. Is that reasonable?


if _, err := r.managed.PublishConnection(ctx, managed, update.ConnectionDetails); err != nil {
// If this is the first time we encounter this issue we'll be requeued
// implicitly when we update our status with the new error condition. If
Expand Down
5 changes: 5 additions & 0 deletions pkg/resource/fake/mocks.go
Original file line number Diff line number Diff line change
Expand Up @@ -480,6 +480,11 @@ func (m *Manager) GetRESTMapper() meta.RESTMapper { return m.RESTMapper }
// GetLogger returns the logger.
func (m *Manager) GetLogger() logr.Logger { return m.Logger }

// Add adds a runnable to the manager.
func (m *Manager) Add(_ manager.Runnable) error {
return nil // do nothing
}

// GV returns a mock schema.GroupVersion.
var GV = schema.GroupVersion{Group: "g", Version: "v"}

Expand Down