Resiliency for Components #1863

halspang · 2022-07-07T22:24:25Z

Describe the proposal

With Resiliency being introduced into the Dapr runtime in 1.7 (as a preview feature), we should now consider how Resiliency interacts with components and if there is a way to get them to utilize the runtime's resiliency.

Tenets

Components must be able to interact with Dapr's resiliency
Runtime resiliency must not weaken/remove any component resiliency without a direct replacement
Components must be able to onboard over time
Components should be able to advertise/be recognized that they have implemented resiliency

Background

Components across the various building blocks have varying levels of retries. Some components, such as the Redis statestore offer both retries and timeouts. Other components offer just retry toggling like the Kafka pubsub which can enable consumption retries. Further still, some components don't offer traditional retries but rather the service they interact with has their own semantic version of retries. A prime example of this is redelivery count for Azure Service Bus. Exhausting the redelivery count can even cause messages to enter a Dead Letter Queue, which is another avenue of resiliency.

The moral of the story here is that component level retries are vast and varied. What works for one component may not fully cover another, or worse, may break some semantic functionality.

An Important Distinction

Pubsub redelivery (and by extension message binding redelivery) was briefly touched on above. It is worth calling out directly though because it has a distinct difference in functionality as compared to standard retries. Specifically, a redelivery of a message can result in a different host. This is a benefit of the pubsub architecture in that you can be resilient to host failures. Utilizing just the runtime retries here would mean you stuck to a host which, if it was dead, would reduce the delivery rate of your messages. As such, this is something that should be strongly considered and/or maintained while interacting with runtime resiliency.

Impact on Stable Certification

Handling resiliency correctly should become a future bar for becoming a stable component. Any component that is already stable will have it added over time, similar to how we handled certification tests.

Solution

The crux of this design will be a new interface that components will be expected to implement in order to be considered "resilient". The interface would define a few methods for resilient operation:

type Resilient interface {
    // Determine if resiliency should be used in place of component retries and disable/enable as needed.
    modifyComponentBehaviorForResiliency(metadata map[string]string) error
}

By making this into an interface, it forces the implementer to consider the most important point of a resilient component which is how component retries and dapr retries.

It was considered to add a method to the interface that would translate the component errors into a permanent/retryable error which the runtime understands. This was decided against because, upon initial inspection, there are fairly few errors that are actually permanent. Generally, these are bad requests, unauthenticated (handled on init) or perhaps some very specific errors (e.g. etags). With these cases, it's easier to just wrap the error in the code with backoff.Permanent(error).

This also allows for Dapr to determine if a component is resilient or not via type inspection.

Example Component

Using Redis as an example, you would see changes like this:

// Removing the component retry/timeout logic (triggered by property field "daprResiliencyPolicyDefined").
func (r *StateStore) modifyComponentBehaviorForResiliency(metadata map[string]string) error {
	if _, ok := metadata[maxRetries]; ok {
		metadata[maxRetries] = "0"
	}

	if _, ok := metadata[maxRetryBackoff]; ok {
		metadata[maxRetryBackoff] = "-1"
	}

	if _, ok := metadata[readTimeout]; ok {
		metadata[readTimeout] = "-1"
	}

	if _, ok := metadata[writeTimeout]; ok {
		metadata[writeTimeout] = "-1"
	}

	// This is an alias for maxRetries, but it has to be checked as well.
	if _, ok := metadata[redisMaxRetries]; ok {
		metadata[redisMaxRetries] = "0"
	}

	return nil
}

// Example of returning a permanent retry.
func (r *StateStore) setValue(req *state.SetRequest) error {
	err := state.CheckRequestOptions(req.Options)
	if err != nil {
		return err
	}
	ver, err := r.parseETag(req)
	if err != nil {
		return err
	}
	ttl, err := r.parseTTL(req)
	if err != nil {
                 // We will always fail to parse a malformed request.
		return backoff.Permanent(fmt.Errorf("failed to parse ttl from metadata: %s", err))
	}
        // Remainder of function excluded.
}

Example Runtime Init

props := a.convertMetadataItemsToProperties(s.Spec.Metadata)

// Only tell the component to turn off resiliency if a policy is defined for this component.
compType := reflect.TypeOf(store)
if a.resiliency.PolicyDefined(s.Name, resiliency.Component) && compType.Implements(reflect.TypeOf((*Resilient)(nil)).Elem()) {
	props["daprResiliencyPolicyDefined"] = "true"
}

err = store.Init(state.Metadata{
	Properties: props,
})

The text was updated successfully, but these errors were encountered:

halspang · 2022-08-02T16:37:53Z

@berndverst - Any thoughts on this?

berndverst · 2022-08-11T23:55:02Z

LGTM

Agree that in the future implementation of Resiliency should be a requirement for stable component certification.

So every error not explicitly returned as backoff.Permanent is considered retriable? Sounds good, we will need to evaluate what errors are really permanent and which ones aren't. Could be handy for Azure components for example to create a utility which parses the internal status code and returns backoff.Permanent if the internal code was certain 4XX status codes.

Also as discussed initResiliency(metadata map[string]string) error is a terrible name. It makes more sense to call this modifyComponentBehaviorForResiliency or something like that. The purpose is to disable / tweak component specific retries that are controlled via component metadata so that the resiliency policy can work its magic.

halspang · 2022-08-12T16:47:36Z

So every error not explicitly returned as backoff.Permanent is considered retriable? Sounds good, we will need to evaluate what errors are really permanent and which ones aren't. Could be handy for Azure components for example to create a utility which parses the internal status code and returns backoff.Permanent if the internal code was certain 4XX status codes.

A lot of this proposal is about forcing people to think about the individual component and what it means to be retryable. I think for components like Azure (and probably GCP/AWS) where they have a commonality in the SDKs will take to a common utility file well.

Also as discussed initResiliency(metadata map[string]string) error is a terrible name. It makes more sense to call this modifyComponentBehaviorForResiliency or something like that. The purpose is to disable / tweak component specific retries that are controlled via component metadata so that the resiliency policy can work its magic.

Yeah, I'm not sure what my initial thought was on the naming there. Updated it in the proposal.

dapr-bot · 2022-09-11T17:00:33Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

dapr-bot · 2022-09-18T17:01:03Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

berndverst · 2023-10-31T16:19:00Z

This will need design work for runtime, and then refactorings of every single component which is a lot of work - similar to the effort to do stable certification of components.

halspang mentioned this issue Aug 2, 2022

Tracking issue: Disable built-in retries in components #1797

Open

4 tasks

halspang mentioned this issue Aug 11, 2022

Proposal: Clarification on criteria for stable component status and deprecation of "stable candidate" term. dapr/dapr#4970

Closed

dapr-bot added the stale label Sep 11, 2022

dapr-bot closed this as completed Sep 18, 2022

mukundansundar removed the stale label Sep 18, 2022

mukundansundar reopened this Sep 18, 2022

artursouza added P1 pinned Issue does not get stale labels Sep 19, 2022

berndverst self-assigned this Nov 17, 2022

berndverst mentioned this issue Jan 4, 2023

Inconsistent retry behavior between pubsub providers #2204

Closed

mukundansundar mentioned this issue Jun 27, 2023

v1.12 Release Planning dapr/dapr#6508

Closed

mukundansundar added this to v1.12 Release Tracking Board Jul 5, 2023

mukundansundar unassigned berndverst Jul 18, 2023

berndverst mentioned this issue Aug 14, 2023

[Feature] Add Timing of Re-Deliveries When Messages Are Negatively Responded On JetStream #3079

Closed

berndverst added this to the v1.13 milestone Oct 31, 2023

berndverst added the Epic label Oct 31, 2023

berndverst mentioned this issue Nov 20, 2023

Inline resiliency policies for PubSub Subscription dapr/proposals#46

Open

ItalyPaleAle modified the milestones: v1.13, v1.14 Feb 26, 2024

mikeee mentioned this issue Mar 19, 2024

v1.14 Release Planning dapr/dapr#7605

Closed

43 tasks

berndverst removed this from the v1.14 milestone Jul 2, 2024

berndverst added this to the v1.15 milestone Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resiliency for Components #1863

Resiliency for Components #1863

halspang commented Jul 7, 2022 •

edited

Loading

halspang commented Aug 2, 2022

berndverst commented Aug 11, 2022 •

edited

Loading

halspang commented Aug 12, 2022

dapr-bot commented Sep 11, 2022

dapr-bot commented Sep 18, 2022

berndverst commented Oct 31, 2023

Resiliency for Components #1863

Resiliency for Components #1863

Comments

halspang commented Jul 7, 2022 • edited Loading

Describe the proposal

Tenets

Background

An Important Distinction

Impact on Stable Certification

Solution

Example Component

Example Runtime Init

halspang commented Aug 2, 2022

berndverst commented Aug 11, 2022 • edited Loading

halspang commented Aug 12, 2022

dapr-bot commented Sep 11, 2022

dapr-bot commented Sep 18, 2022

berndverst commented Oct 31, 2023

halspang commented Jul 7, 2022 •

edited

Loading

berndverst commented Aug 11, 2022 •

edited

Loading