Use Polly to perform retries on failed event processing #123

rogordon01 · 2021-07-09T22:24:45Z

Currently, when an event incurs an error the entire batch is retried forever. This PR utilizes Polly to perform the retry logic. Only a specific set of exceptions will be allowed to retry.

rogordon01 · 2021-07-09T22:26:35Z

src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs

+                    case HttpRequestException _:
+                    case EventHubsException _:
+                    case RequestFailedException _:


These should retry on all Http, EventHub, Azure Identity and BlobStorage related exceptions.

src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs

dustinburson

Please make option to avoid impacting Normalization to FHIR and add telemetry.

src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs

src/console/Startup.cs

src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs

dustinburson

The changes in the normalize processor look good. The ask is to keep the existing retry logic in the EventConsumerService but make it optional and off for normalization and on for FHIR conversion (or you could look at moving it to the processor for the FHIR converter).

src/console/Normalize/Processor.cs

wi-y · 2021-07-13T00:03:24Z

src/console/Normalize/Processor.cs

+                    case HttpRequestException _:
+                    case EventHubsException _:
+                    case AuthenticationFailedException _:
+                    case RequestFailedException _:


Since we are calling the Azure.Messaging.EventHubs.Producer.EventHubProducerClient downstream, I believe the Producer will throw an Azure.Messaging.EventHubs.EventHubsException instead of a Microsoft.Azure.EventHubs.EventHubsException (legacy).

I think this should be updated to reference the new SDK exception (or both). Feel free to double check. I testing by changing my Normalization Event Hub to be something bogus in order to log an error

Thanks for pointing that out. Does this entire class need to be updated to use the new API?

Actually I see that other interfaces we created use the old API. Its a bit confusing... are their plans to update the project to use a single API?

It would be good to transition to the new one, but that should probably be its own work item. It may be harder than just doing a find and replace.

Question - When an EventHubsException occurs would we expect ExceptionRetryableFilter to return true and retry?

For some reason for me when I intentionally log an EventHubsException the code seems to still hit the default block in that switch statement and will return false. I am wondering if it is because it is a System.Exception {Azure.Messaging.EventHubs.EventHubsException} or something else with the switch statement that is going on, or if I am misunderstanding how ExceptionRetryableFilter works. I was thinking ExceptionRetryableFilter would hit the block of code at the at the end that returns true so that it retries.

Nevermind - it works fine when pulling in the latest changes.

rogordon01 · 2021-07-13T03:18:49Z

Yeah I changed the exception to that of the new SDK per your suggestion. Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Will ***@***.***> Sent: Monday, July 12, 2021 8:08:35 PM To: microsoft/iomt-fhir ***@***.***> Cc: Rob Gordon ***@***.***>; Author ***@***.***> Subject: Re: [microsoft/iomt-fhir] Use Polly to perform retries on failed event processing (#123) @wi-y commented on this pull request.

________________________________ In src/console/Normalize/Processor.cs<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fiomt-fhir%2Fpull%2F123%23discussion_r668393748&data=04%7C01%7CRob.Gordon%40microsoft.com%7C1bcc65c67e654d49855a08d945ab8146%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637617425190418517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QtUixuzWE2qgjZQ4CtBxrlP33vz34IR55x16ezO0tTM%3D&reserved=0>:

@@ -90,5 +109,49 @@ public async Task ConsumeAsync(IEnumerable<IEventMessage> events)

eventStats.TotalEventsProcessedBytes); } } + + private static AsyncPolicy CreateRetryPolicy(ITelemetryLogger logger) + { + bool ExceptionRetryableFilter(Exception ee) + { + switch (ee) + { + case AggregateException ae when ae.InnerExceptions.Any(ExceptionRetryableFilter): + case OperationCanceledException _: + case HttpRequestException _: + case EventHubsException _: + case AuthenticationFailedException _: + case RequestFailedException _: Nevermind - it works fine when pulling in the latest changes. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fiomt-fhir%2Fpull%2F123%23discussion_r668393748&data=04%7C01%7CRob.Gordon%40microsoft.com%7C1bcc65c67e654d49855a08d945ab8146%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637617425190428512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wRYB7aC9imfA0qBKQlb2RaaC6i%2Fi6j9PNeMGIuKP9Zs%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARAAJ2XABSMYW5ENGMOO6C3TXOU3HANCNFSM5ADRAEZA&data=04%7C01%7CRob.Gordon%40microsoft.com%7C1bcc65c67e654d49855a08d945ab8146%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637617425190428512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5SxwZUWrt1V1H1%2BxYwBxqovy5Abt6RrcxWYkiL0ecg0%3D&reserved=0>.

dustinburson · 2021-07-13T17:30:30Z

src/console/MeasurementCollectionToFhir/Processor.cs

+        {
+            bool ExceptionRetryableFilter(Exception ee)
+            {
+                logger.LogError(new Exception("Encountered retryable exception", ee));


logger.LogError(new Exception("Encountered retryable exception", ee));

Would suggest just logging the original exception. We only get the first inner exception in the logs and if the original exception is already wrapped we will mis the details. I like the message though, can we switch the message to info? i.e. "Encountered retryable exception {0}", ee.GetType()

dustinburson · 2021-07-13T17:31:16Z

src/console/MeasurementCollectionToFhir/Processor.cs

+
+            return Policy
+                .Handle<Exception>(ExceptionRetryableFilter)
+                .WaitAndRetryForeverAsync(retryCount => TimeSpan.FromSeconds(Math.Min(30, Math.Pow(2, retryCount))));


30

30 seconds may be too long for an initial retry. Would suggest 5 to 10.

The time between retries will be the lower of 30 seconds and the result of the Math.Pow function. Since the Pow function is based on the retry count, the initial retry should be 2 ^ 1 = 2 seconds.

Or do you mean that we should lower the maximum time between all retries to 10 seconds from 30?

I misread the code line. Thanks for the clarification. I think we are good here.

dustinburson · 2021-07-13T17:31:42Z

src/console/MeasurementCollectionToFhir/Processor.cs

+
+            return Policy
+                .Handle<Exception>(ExceptionRetryableFilter)
+                .WaitAndRetryForeverAsync(retryCount => TimeSpan.FromSeconds(Math.Min(30, Math.Pow(2, retryCount))));


Math.Pow(2, retryCount)

We will probably want to cap the max retry time. Say 10 minutes?

Just to confirm, you're saying that we no longer should retry forever, as we were doing before, and instead limit the overall retry operation to 10 minutes?
If that is the case should we also limit the overall retry operation in the Normalization Processor?

No sorry, the intent is to cap the timespan for exponential back off. I misread the code, I think we are covered. Based on your above comment the longest we will wait for a retry is 30 seconds.

dustinburson

Looks good. Added some suggestions.

namalu · 2021-07-13T21:43:25Z

src/console/MeasurementCollectionToFhir/Processor.cs

@@ -34,15 +38,51 @@ public class Processor : IEventConsumer
            _templateManager = templateManager;
            _measurementImportService = measurementImportService;
            _logger = logger;
+            _retryPolicy = CreateRetryPolicy(logger);
        }

        public async Task ConsumeAsync(IEnumerable<IEventMessage> events)
        {
            EnsureArg.IsNotNull(events);
            EnsureArg.IsNotNull(_templateDefinition);


If the _templateDefinition is null, I think we will end up with the 100% cpu issue here. I think if you do something similar to what you did in the Normalize Processor, that would address this issue.

The EnsureArg is called outside of the retryPolicy.ExecutAsync, so throwing an exception here shouldn't cause us to retry forever.
I do, however, see the EnsureArg check being thrown inside of the retryPolicy.ExecuteAsync in the Normalization Processor. I'll adjust that

@wi-y Any particular reason why we check the validity of the _templateDefinition outside of the constructor? From the looks of it, If its null/empty there is never an opportunity to update it.

rogordon01 requested a review from a team as a code owner July 9, 2021 22:24

rogordon01 requested review from dustinburson, wi-y and namalu July 9, 2021 22:25

rogordon01 commented Jul 9, 2021

View reviewed changes

dustinburson reviewed Jul 12, 2021

View reviewed changes

src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs Show resolved Hide resolved

dustinburson reviewed Jul 12, 2021

View reviewed changes

src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs Outdated Show resolved Hide resolved

dustinburson requested changes Jul 12, 2021

View reviewed changes

dustinburson reviewed Jul 12, 2021

View reviewed changes

src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs Outdated Show resolved Hide resolved

namalu requested changes Jul 12, 2021

View reviewed changes

src/console/Startup.cs Outdated Show resolved Hide resolved

src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs Outdated Show resolved Hide resolved

rogordon01 added 6 commits July 12, 2021 14:01

Use Polly to perform retries on failed event processing

6ae29db

Retrying on Azure Identity exceptions

05bf359

Removing unneeded method

cc0c97e

Fixing comment

b8e4f82

Changes due to code review

8661e64

Locating retry logic to Normalization Processor

df6f053

rogordon01 force-pushed the personal/rogordon/fixRetryForeverBug branch from 173e4d4 to df6f053 Compare July 12, 2021 21:29

rogordon01 added 2 commits July 12, 2021 14:31

remove extra space

825ebb2

Tracking both retryable and non-retryable exceptions

a162848

dustinburson reviewed Jul 12, 2021

View reviewed changes

src/lib/Microsoft.Health.Events/EventConsumers/Service/EventConsumerService.cs Show resolved Hide resolved

dustinburson requested changes Jul 12, 2021

View reviewed changes

dustinburson reviewed Jul 12, 2021

View reviewed changes

src/console/Normalize/Processor.cs Outdated Show resolved Hide resolved

Moving retry logic into FhirConversion processor

a18b407

wi-y reviewed Jul 13, 2021

View reviewed changes

Using new Azure Eventhubs SDK

bec414a

rogordon01 requested review from namalu and dustinburson July 13, 2021 02:22

dustinburson reviewed Jul 13, 2021

View reviewed changes

dustinburson approved these changes Jul 13, 2021

View reviewed changes

Adding additional logging per pr comments

f9d0060

namalu approved these changes Jul 13, 2021

View reviewed changes

rogordon01 added 2 commits July 13, 2021 15:04

Moving check outside of retry policy

58223aa

Moving argument checks to constructor

ba5293e

rogordon01 merged commit e863902 into master Jul 13, 2021

rogordon01 deleted the personal/rogordon/fixRetryForeverBug branch July 13, 2021 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Polly to perform retries on failed event processing #123

Use Polly to perform retries on failed event processing #123

rogordon01 commented Jul 9, 2021

rogordon01 Jul 9, 2021 •

edited

Loading

dustinburson left a comment

dustinburson left a comment

wi-y Jul 13, 2021

rogordon01 Jul 13, 2021 •

edited

Loading

rogordon01 Jul 13, 2021

wi-y Jul 13, 2021

wi-y Jul 13, 2021

wi-y Jul 13, 2021

rogordon01 commented Jul 13, 2021 via email

dustinburson Jul 13, 2021

rogordon01 Jul 13, 2021

dustinburson Jul 13, 2021

rogordon01 Jul 13, 2021 •

edited

Loading

rogordon01 Jul 13, 2021

dustinburson Jul 13, 2021

dustinburson Jul 13, 2021

rogordon01 Jul 13, 2021 •

edited

Loading

dustinburson Jul 13, 2021

dustinburson left a comment

namalu Jul 13, 2021

rogordon01 Jul 13, 2021

rogordon01 Jul 13, 2021

Use Polly to perform retries on failed event processing #123

Use Polly to perform retries on failed event processing #123

Conversation

rogordon01 commented Jul 9, 2021

rogordon01 Jul 9, 2021 • edited Loading

Choose a reason for hiding this comment

dustinburson left a comment

Choose a reason for hiding this comment

dustinburson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rogordon01 Jul 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rogordon01 commented Jul 13, 2021 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rogordon01 Jul 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rogordon01 Jul 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dustinburson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rogordon01 Jul 9, 2021 •

edited

Loading

rogordon01 Jul 13, 2021 •

edited

Loading

rogordon01 Jul 13, 2021 •

edited

Loading

rogordon01 Jul 13, 2021 •

edited

Loading