Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make use of rate limiting in managed reconciler #243

Merged
merged 3 commits into from
Feb 19, 2021

Conversation

hasheddan
Copy link
Member

@hasheddan hasheddan commented Feb 11, 2021

Description of your changes

Updates managed reconciler to return Requeue: true instead of
RequeueAfter when encountering non-status update errors. This allows the
reconciler to make use of the controller rate limiter rather use a
constant value for requeuing after errors.

This also renames longWait to pollInterval to more accurately reflect
its behavior, and adds default rate limiters to be used at both the provider and
controller level. The provider rate limiter is a configurable token
bucket and the controller limiter is a max of limiter that uses the
provider limiter and a per-item exponential backoff.

Signed-off-by: hasheddan [email protected]

xref #40

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

Unit tests.

@hasheddan hasheddan requested a review from negz February 11, 2021 19:49
func (r *Reconciler) maybeBackoff(ctx context.Context, m resource.Managed, wait time.Duration, err error) (reconcile.Result, error) {
updateErr := errors.Wrap(r.client.Status().Update(ctx, m), errUpdateManagedStatus)
if r.useBackoff && err != nil {
return reconcile.Result{}, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could concatenate the update error in the case where we both pass an error, backoff is enabled, and the status update fails.

}
return reconcile.Result{RequeueAfter: wait}, updateErr
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little averse to wrappers that produce return values like this, but I can't think of a better approach if we want to allow folks to pick between backoff or fixed requeuafter values.

Do you think we could just straight-up switch to using backoff? i.e. If/when folks bump to this version of runtime they'd be using backoff? Providers could cut over at their own leisure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would feel pretty good about just switching over if we include a default implementation (and potentially some variations) that are recommended because we are essentially just using an extremely naive rate limiting function right now as it is. Would just need to explicitly call out in release notes though or else consumers would be defaulting to backing off to ~16 minutes with the controller-runtime implementation. The long wait would remain 1 minute thought so likely not too big of a problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If/when folks bump to this version of runtime they'd be using backoff? Providers could cut over at their own leisure.

That sounds good to me.

I would feel pretty good about just switching over if we include a default implementation (and potentially some variations) that are recommended because we are essentially just using an extremely naive rate limiting function right now as it is.

I agree that a simple default rate limiter would go a long way here. I believe the big three providers would just opt for using the default, maybe except provider-azure since its requests take longer than others.

@@ -42,8 +42,7 @@ const (
reconcileGracePeriod = 30 * time.Second
reconcileTimeout = 1 * time.Minute

defaultManagedShortWait = 30 * time.Second
defaultManagedLongWait = 1 * time.Minute
defaultManagedLongWait = 1 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe long doesn't make sense in context anymore, since we don't have a short wait? I tend to think of this as the "speculative" or "happy path" poll interval these days.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@negz I was considering that as well, but didn't know if we want to the WithLongWait() on update. I think most providers don't actually use that option (i.e. they default to the 1 minute), and it might also be good for the break to be exposed to give a stronger indication that something has changed behind the scenes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd be making a fairly equivalent breaking change in removing WithShortWait(), right? We could also have WithLongWait become an alias for WithPollInterval or whatever we called this now.

@@ -545,10 +545,10 @@ func (r *Reconciler) Reconcile(_ context.Context, req reconcile.Request) (reconc
// If this is the first time we encounter this issue we'll be requeued
// implicitly when we update our status with the new error condition.
// If not, we want to try again after a short wait.
log.Debug("Cannot initialize managed resource", "error", err, "requeue-after", time.Now().Add(r.shortWait))
log.Debug("Cannot initialize managed resource", "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to bring this into scope for this PR, but it would be great if we could eventually make our backoff state inspectable such that we could keep something like the requeue-after context. I'd feel more comfortable with longer backoffs (e.g. multiple minutes) if we could provide users with context as to how long our backoff was and why.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah unfortunately we don't have access to the rate limiting interface within the reconciler (it is set at the controller level) and the controller-runtime default actually doesn't allow any sort of "peek" method that doesn't also increment the failure count used for backoff. This was part of my motivation for having a "layer 2" rate limiter because we could:

  1. Set a default without folks having to explicitly pass it in to the controller.
  2. Decouple the external API backoff from the k8s API server one.
  3. Do things like log what the backoff value is currently at.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to tackle this (someday) at the controller-runtime level if possible. It would be nice if, for example, the controller.Manager had a method we could call to ask what the soonest time we'd be requeued at was.

record.Event(managed, event.Warning(reasonCannotInitialize, err))
managed.SetConditions(xpv1.ReconcileError(err))
return reconcile.Result{RequeueAfter: r.shortWait}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we be able to just return reconcile.Result, err in these situations if we deprecated the Synced status per #198? Just curious - we don't need to bring that into scope with this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we still need to call the status update in some scenarios (such as #242). One thing we could do it concatenate the err from the method that was called with the status update. I played around with this a bit initially though and it felt pretty sloppy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's true. We also need to assume that at the Observe method (and possibly also Create and Update) could be mutating the managed resource's status.

That said, perhaps it's acceptable for us to only persist these status updates in the happy path? i.e. In the return statements that follow any place we'd currently log "Successfully Xed the external resource". When we hit an error case we'd return the error without persisting the managed resource's status. When we didn't hit an error we'd call r.client.Status().Update() as we do today, and return that error if there was one.

Without having thought through it too much I feel like this is probably acceptable? Managed resource viewers should be able to tell that we're trying to create or delete a managed resource because we'll be emitting warning events like CannotCreateExternalResource. Any status fields that would be set by the ExternalClient methods would presumably just get set once the error cleared and we were able to proceed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it's acceptable for us to only persist these status updates in the happy path? i.e. In the return statements that follow any place we'd currently log "Successfully Xed the external resource". When we hit an error case we'd return the error without persisting the managed resource's status.

Would that result in errors not appearing in conditions, hence shown only as event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muvaf yep, I don't think we can do this currently

@hasheddan hasheddan marked this pull request as ready for review February 15, 2021 20:48
@hasheddan hasheddan changed the title Allow managed reconciler to make use of rate limiting if enabled Make use of rate limiting in managed reconciler Feb 15, 2021
@hasheddan
Copy link
Member Author

Example usage: crossplane-contrib/provider-aws#544

Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for tackling this long-running issue @hasheddan ! Solid solution.

// after a specified duration when it is not actively waiting for an external
// operation, but wishes to check whether an existing external resource needs to
// be synced to its Crossplane Managed resource.
func WithpollInterval(after time.Duration) ReconcilerOption {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func WithpollInterval(after time.Duration) ReconcilerOption {
func WithPollInterval(after time.Duration) ReconcilerOption {

record.Event(managed, event.Warning(reasonCannotInitialize, err))
managed.SetConditions(xpv1.ReconcileError(err))
return reconcile.Result{RequeueAfter: r.shortWait}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it's acceptable for us to only persist these status updates in the happy path? i.e. In the return statements that follow any place we'd currently log "Successfully Xed the external resource". When we hit an error case we'd return the error without persisting the managed resource's status.

Would that result in errors not appearing in conditions, hence shown only as event?

Updates managed reconciler to return Requeue: true instead of
RequeueAfter when encountering non-status update errors. This allows the
reconciler to make use of the controller rate limiter rather use a
constant value for requeuing after errors.

This also renames longWait to pollInterval to more accurately reflect
its behavior.

Signed-off-by: hasheddan <[email protected]>
Adds default rate limiters to be used at both the provider and
controller level. The provider rate limiter is a configurable token
bucket and the controller limiter is a max of limiter that uses the
provider limiter and a per-item exponential backoff.

Signed-off-by: hasheddan <[email protected]>
Copy link
Member Author

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@negz I adjusted the managed reconciler comments as you pointed out. Would love for you to get a final look at this before merge.

// registered with a controller manager. The bucket size is a linear function of
// the requeues per second.
func NewDefaultProviderRateLimiter(rps int) *workqueue.BucketRateLimiter {
return &workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(rps), rps*10)}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 10 as a linear constant here to determine bucket size is likely not suitable for every provider. However, providers may use their own BucketRateLimiter with a different value. Using the number of controllers under management is not an awful proxy, but it really depends on actual runtime number of objects / user preference. I feel pretty good about having this as the default and letting providers adjust / expose configuration as needed.

Note: if a user wanted a strict cap of requeues per second they would likely need to set bucket size = 1 here.

pkg/ratelimiter/default.go Outdated Show resolved Hide resolved
pkg/reconciler/managed/reconciler.go Outdated Show resolved Hide resolved
Updates comments in managed reconciler to indicate requeues are not tied
to short wait but instead are explicit and trigger the configured
backoff strategy.

Signed-off-by: hasheddan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants