-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resiliency for Components #1863
Comments
@berndverst - Any thoughts on this? |
LGTM Agree that in the future implementation of Resiliency should be a requirement for stable component certification. So every error not explicitly returned as Also as discussed |
A lot of this proposal is about forcing people to think about the individual component and what it means to be retryable. I think for components like Azure (and probably GCP/AWS) where they have a commonality in the SDKs will take to a common utility file well.
Yeah, I'm not sure what my initial thought was on the naming there. Updated it in the proposal. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions. |
This will need design work for runtime, and then refactorings of every single component which is a lot of work - similar to the effort to do stable certification of components. |
Describe the proposal
With Resiliency being introduced into the Dapr runtime in 1.7 (as a preview feature), we should now consider how Resiliency interacts with components and if there is a way to get them to utilize the runtime's resiliency.
Tenets
Background
Components across the various building blocks have varying levels of retries. Some components, such as the Redis statestore offer both retries and timeouts. Other components offer just retry toggling like the Kafka pubsub which can enable consumption retries. Further still, some components don't offer traditional retries but rather the service they interact with has their own semantic version of retries. A prime example of this is redelivery count for Azure Service Bus. Exhausting the redelivery count can even cause messages to enter a Dead Letter Queue, which is another avenue of resiliency.
The moral of the story here is that component level retries are vast and varied. What works for one component may not fully cover another, or worse, may break some semantic functionality.
An Important Distinction
Pubsub redelivery (and by extension message binding redelivery) was briefly touched on above. It is worth calling out directly though because it has a distinct difference in functionality as compared to standard retries. Specifically, a redelivery of a message can result in a different host. This is a benefit of the pubsub architecture in that you can be resilient to host failures. Utilizing just the runtime retries here would mean you stuck to a host which, if it was dead, would reduce the delivery rate of your messages. As such, this is something that should be strongly considered and/or maintained while interacting with runtime resiliency.
Impact on Stable Certification
Handling resiliency correctly should become a future bar for becoming a stable component. Any component that is already stable will have it added over time, similar to how we handled certification tests.
Solution
The crux of this design will be a new interface that components will be expected to implement in order to be considered "resilient". The interface would define a few methods for resilient operation:
By making this into an interface, it forces the implementer to consider the most important point of a resilient component which is how component retries and dapr retries.
It was considered to add a method to the interface that would translate the component errors into a permanent/retryable error which the runtime understands. This was decided against because, upon initial inspection, there are fairly few errors that are actually permanent. Generally, these are bad requests, unauthenticated (handled on init) or perhaps some very specific errors (e.g. etags). With these cases, it's easier to just wrap the error in the code with
backoff.Permanent(error)
.This also allows for Dapr to determine if a component is resilient or not via type inspection.
Example Component
Using Redis as an example, you would see changes like this:
Example Runtime Init
The text was updated successfully, but these errors were encountered: