-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Polly to perform retries on failed event processing #123
Conversation
case HttpRequestException _: | ||
case EventHubsException _: | ||
case RequestFailedException _: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should retry on all Http, EventHub, Azure Identity and BlobStorage related exceptions.
src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs
Show resolved
Hide resolved
src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make option to avoid impacting Normalization to FHIR and add telemetry.
src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs
Outdated
Show resolved
Hide resolved
src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs
Outdated
Show resolved
Hide resolved
173e4d4
to
df6f053
Compare
src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in the normalize processor look good. The ask is to keep the existing retry logic in the EventConsumerService but make it optional and off for normalization and on for FHIR conversion (or you could look at moving it to the processor for the FHIR converter).
case HttpRequestException _: | ||
case EventHubsException _: | ||
case AuthenticationFailedException _: | ||
case RequestFailedException _: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are calling the Azure.Messaging.EventHubs.Producer.EventHubProducerClient downstream, I believe the Producer will throw an Azure.Messaging.EventHubs.EventHubsException instead of a Microsoft.Azure.EventHubs.EventHubsException (legacy).
I think this should be updated to reference the new SDK exception (or both). Feel free to double check. I testing by changing my Normalization Event Hub to be something bogus in order to log an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. Does this entire class need to be updated to use the new API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I see that other interfaces we created use the old API. Its a bit confusing... are their plans to update the project to use a single API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to transition to the new one, but that should probably be its own work item. It may be harder than just doing a find and replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question - When an EventHubsException occurs would we expect ExceptionRetryableFilter
to return true
and retry?
For some reason for me when I intentionally log an EventHubsException the code seems to still hit the default block in that switch statement and will return false. I am wondering if it is because it is a System.Exception {Azure.Messaging.EventHubs.EventHubsException}
or something else with the switch statement that is going on, or if I am misunderstanding how ExceptionRetryableFilter
works. I was thinking ExceptionRetryableFilter
would hit the block of code at the at the end that returns true so that it retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind - it works fine when pulling in the latest changes.
{ | ||
bool ExceptionRetryableFilter(Exception ee) | ||
{ | ||
logger.LogError(new Exception("Encountered retryable exception", ee)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.LogError(new Exception("Encountered retryable exception", ee));
Would suggest just logging the original exception. We only get the first inner exception in the logs and if the original exception is already wrapped we will mis the details. I like the message though, can we switch the message to info? i.e. "Encountered retryable exception {0}", ee.GetType()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
|
||
return Policy | ||
.Handle<Exception>(ExceptionRetryableFilter) | ||
.WaitAndRetryForeverAsync(retryCount => TimeSpan.FromSeconds(Math.Min(30, Math.Pow(2, retryCount)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time between retries will be the lower of 30 seconds and the result of the Math.Pow function. Since the Pow function is based on the retry count, the initial retry should be 2 ^ 1 = 2 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean that we should lower the maximum time between all retries to 10 seconds from 30?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread the code line. Thanks for the clarification. I think we are good here.
|
||
return Policy | ||
.Handle<Exception>(ExceptionRetryableFilter) | ||
.WaitAndRetryForeverAsync(retryCount => TimeSpan.FromSeconds(Math.Min(30, Math.Pow(2, retryCount)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, you're saying that we no longer should retry forever, as we were doing before, and instead limit the overall retry operation to 10 minutes?
If that is the case should we also limit the overall retry operation in the Normalization Processor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sorry, the intent is to cap the timespan for exponential back off. I misread the code, I think we are covered. Based on your above comment the longest we will wait for a retry is 30 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Added some suggestions.
@@ -34,15 +38,51 @@ public class Processor : IEventConsumer | |||
_templateManager = templateManager; | |||
_measurementImportService = measurementImportService; | |||
_logger = logger; | |||
_retryPolicy = CreateRetryPolicy(logger); | |||
} | |||
|
|||
public async Task ConsumeAsync(IEnumerable<IEventMessage> events) | |||
{ | |||
EnsureArg.IsNotNull(events); | |||
EnsureArg.IsNotNull(_templateDefinition); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the _templateDefinition is null, I think we will end up with the 100% cpu issue here. I think if you do something similar to what you did in the Normalize Processor, that would address this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EnsureArg is called outside of the retryPolicy.ExecutAsync, so throwing an exception here shouldn't cause us to retry forever.
I do, however, see the EnsureArg check being thrown inside of the retryPolicy.ExecuteAsync in the Normalization Processor. I'll adjust that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wi-y Any particular reason why we check the validity of the _templateDefinition
outside of the constructor? From the looks of it, If its null/empty there is never an opportunity to update it.
Currently, when an event incurs an error the entire batch is retried forever. This PR utilizes
Polly
to perform the retry logic. Only a specific set of exceptions will be allowed to retry.