Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Polly to perform retries on failed event processing #123

Merged
merged 13 commits into from
Jul 13, 2021

Conversation

rogordon01
Copy link
Contributor

Currently, when an event incurs an error the entire batch is retried forever. This PR utilizes Polly to perform the retry logic. Only a specific set of exceptions will be allowed to retry.

@rogordon01 rogordon01 requested a review from a team as a code owner July 9, 2021 22:24
Comment on lines 78 to 80
case HttpRequestException _:
case EventHubsException _:
case RequestFailedException _:
Copy link
Contributor Author

@rogordon01 rogordon01 Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should retry on all Http, EventHub, Azure Identity and BlobStorage related exceptions.

Copy link
Member

@dustinburson dustinburson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make option to avoid impacting Normalization to FHIR and add telemetry.

@rogordon01 rogordon01 force-pushed the personal/rogordon/fixRetryForeverBug branch from 173e4d4 to df6f053 Compare July 12, 2021 21:29
Copy link
Member

@dustinburson dustinburson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in the normalize processor look good. The ask is to keep the existing retry logic in the EventConsumerService but make it optional and off for normalization and on for FHIR conversion (or you could look at moving it to the processor for the FHIR converter).

case HttpRequestException _:
case EventHubsException _:
case AuthenticationFailedException _:
case RequestFailedException _:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are calling the Azure.Messaging.EventHubs.Producer.EventHubProducerClient downstream, I believe the Producer will throw an Azure.Messaging.EventHubs.EventHubsException instead of a Microsoft.Azure.EventHubs.EventHubsException (legacy).

I think this should be updated to reference the new SDK exception (or both). Feel free to double check. I testing by changing my Normalization Event Hub to be something bogus in order to log an error

Copy link
Contributor Author

@rogordon01 rogordon01 Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. Does this entire class need to be updated to use the new API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I see that other interfaces we created use the old API. Its a bit confusing... are their plans to update the project to use a single API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to transition to the new one, but that should probably be its own work item. It may be harder than just doing a find and replace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - When an EventHubsException occurs would we expect ExceptionRetryableFilter to return true and retry?

For some reason for me when I intentionally log an EventHubsException the code seems to still hit the default block in that switch statement and will return false. I am wondering if it is because it is a System.Exception {Azure.Messaging.EventHubs.EventHubsException} or something else with the switch statement that is going on, or if I am misunderstanding how ExceptionRetryableFilter works. I was thinking ExceptionRetryableFilter would hit the block of code at the at the end that returns true so that it retries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind - it works fine when pulling in the latest changes.

@rogordon01
Copy link
Contributor Author

rogordon01 commented Jul 13, 2021 via email

{
bool ExceptionRetryableFilter(Exception ee)
{
logger.LogError(new Exception("Encountered retryable exception", ee));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.LogError(new Exception("Encountered retryable exception", ee));

Would suggest just logging the original exception. We only get the first inner exception in the logs and if the original exception is already wrapped we will mis the details. I like the message though, can we switch the message to info? i.e. "Encountered retryable exception {0}", ee.GetType()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do


return Policy
.Handle<Exception>(ExceptionRetryableFilter)
.WaitAndRetryForeverAsync(retryCount => TimeSpan.FromSeconds(Math.Min(30, Math.Pow(2, retryCount))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30

30 seconds may be too long for an initial retry. Would suggest 5 to 10.

Copy link
Contributor Author

@rogordon01 rogordon01 Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time between retries will be the lower of 30 seconds and the result of the Math.Pow function. Since the Pow function is based on the retry count, the initial retry should be 2 ^ 1 = 2 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean that we should lower the maximum time between all retries to 10 seconds from 30?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread the code line. Thanks for the clarification. I think we are good here.


return Policy
.Handle<Exception>(ExceptionRetryableFilter)
.WaitAndRetryForeverAsync(retryCount => TimeSpan.FromSeconds(Math.Min(30, Math.Pow(2, retryCount))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math.Pow(2, retryCount)

We will probably want to cap the max retry time. Say 10 minutes?

Copy link
Contributor Author

@rogordon01 rogordon01 Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, you're saying that we no longer should retry forever, as we were doing before, and instead limit the overall retry operation to 10 minutes?
If that is the case should we also limit the overall retry operation in the Normalization Processor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sorry, the intent is to cap the timespan for exponential back off. I misread the code, I think we are covered. Based on your above comment the longest we will wait for a retry is 30 seconds.

Copy link
Member

@dustinburson dustinburson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Added some suggestions.

@@ -34,15 +38,51 @@ public class Processor : IEventConsumer
_templateManager = templateManager;
_measurementImportService = measurementImportService;
_logger = logger;
_retryPolicy = CreateRetryPolicy(logger);
}

public async Task ConsumeAsync(IEnumerable<IEventMessage> events)
{
EnsureArg.IsNotNull(events);
EnsureArg.IsNotNull(_templateDefinition);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the _templateDefinition is null, I think we will end up with the 100% cpu issue here. I think if you do something similar to what you did in the Normalize Processor, that would address this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EnsureArg is called outside of the retryPolicy.ExecutAsync, so throwing an exception here shouldn't cause us to retry forever.
I do, however, see the EnsureArg check being thrown inside of the retryPolicy.ExecuteAsync in the Normalization Processor. I'll adjust that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wi-y Any particular reason why we check the validity of the _templateDefinition outside of the constructor? From the looks of it, If its null/empty there is never an opportunity to update it.

@rogordon01 rogordon01 merged commit e863902 into master Jul 13, 2021
@rogordon01 rogordon01 deleted the personal/rogordon/fixRetryForeverBug branch July 13, 2021 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants