Personal/duburson/large msg perf improvements v2 #209

dustinburson · 2022-06-23T03:41:04Z

Updated approach to the large perf message improvements. Instead of streaming, keep existing batch service but update settings to lower thresholds (buffer size of 10 instead of 100 and wait time of 10 seconds instead of 30). This will reduce overhead while still allowing for batches on egress for the perf benefits.

Add new Normalization service, NormalizationEventConsumerService. This replaces the MeasurementEventNormalizationService + Normalize.Processor
Fixed bug where checkpoint client will find delete blobs
Added new Converter, EventMessageJObjectCoverter to add properties directly to the JObject rather then new object declarations.
Add additional metrics to NormalizationEventConsumerService

…rocessess events immediately while emitting progress metrics and checkpointing when thresholds are hit.

* Update EventMessageJTokenCoverter to add properties directly to the JObject rather then new object declartions * Add missing metrics to NormalizationStreamingEventConsumerService

Add DeviceEvents to new Normalization

* Send projected events as a batch rather than one at a time.

…EventNormalizationService + Normalize.Processor * Move ITemplateManager and TemplateManager into the Ingest library so they can be referenced by the new class.

dustinburson · 2022-06-23T03:43:26Z

src/lib/Microsoft.Health.Fhir.Ingest/Service/NormalizationEventConsumerService.cs

+            await _retryPolicy.ExecuteAsync(async () => await ConsumeAsyncImpl(events));
+        }
+
+        private Task<IContentTemplate> GetNormalizationTemplate()


One concern I have, is with a lower batch size (10), we are refreshing the template much more frequently. Going to look at ways to reduce the impact.

…n template if content on blob is modified.

dustinburson · 2022-06-24T15:48:21Z

src/lib/Microsoft.Health.Fhir.Ingest/Service/NormalizationEventConsumerService.cs

+            await _retryPolicy.ExecuteAsync(async () => await ConsumeAsyncImpl(events));
+        }
+
+        private async Task<IContentTemplate> GetNormalizationTemplate()


Still not 100% happy with this. Main issue is the code is executed per partition on different threads, hence the need for a semaphore to control concurrency when there isn't any horizontal scaling. A better approach long term would be to move this to a separate class that is injected that will monitor the file and update the template when there are changes, otherwise

Note, a semaphoresilm is used over a readwritelockslim because the readwritelockslim doesn't work with async calls used in the function. State is stored in a threadlocal variable so when the awaited call returns, unless we get the same thread, the lock is lost.

I had taken a stab at optimizing this in the past. My approach entailed wrapping the current TemplateManager in a CachingTemplateManager. And caching the results for X minutes. The reason I liked this approach is that it eliminates the majority of network requests to the storage account.

A cache miss could result in multiple threads (one per partition) attempting to fill the cache. But I felt that is ok as the multiple attempts would fill the cache with the same data. Locking could also be implemented inside of the cache population delegate if we wanted to eliminate that.

The big downside here is we go from instantly picking up customer changes to forcing a new read at regular intervals. So that change would need to be articulated to customers. But even setting the interval to 60 seconds could save hundreds of network calls to the underlying storage account, which would be very useful.

Code is below

using System; using Microsoft.Extensions.Caching.Memory; using Microsoft.Extensions.Options; namespace Microsoft.Health.Fhir.Ingest.Console.Template { public class CachingTemplateManager : ITemplateManager { private ITemplateManager _wrappedTemplateManager; private IMemoryCache _templateCache; public CachingTemplateManager( ITemplateManager wrappedTemplateManager, IMemoryCache cache) { _wrappedTemplateManager = wrappedTemplateManager; _templateCache = cache; } public byte[] GetTemplate(string templateName) { var key = $"{templateName}Bytes"; return _templateCache.GetOrCreate(key, e => { e.SetAbsoluteExpiration(TimeSpan.FromMinutes(1)); return _wrappedTemplateManager.GetTemplate(templateName); }); } public string GetTemplateAsString(string templateName) { return _templateCache.GetOrCreate(templateName, e => { e.SetAbsoluteExpiration(TimeSpan.FromMinutes(1)); return _wrappedTemplateManager.GetTemplateAsString(templateName); }); } } }

Thanks Rob, great suggestion and something similar to this is what I was thinking we would transition to. I hadn't consider using a cache but that makes sense. I was thinking we would have a task that fires every so often (5 minutes?) and checks for changes and updates the templates if needed.

Apiece that I want to include was generating the full template, i.e. I think the TemplateManager should ultimately build and return the completed IContentTemplate. Otherwise each thread processing a partition is still building new IContentTemplates in memory, increasing our memory pressure and garbage collection.

Agreed, it would make sense for the TemplateManager to build the completed template. We can always do a follow up PR to use caching or your suggested timer based approach. Both options reduce the number of network calls to the storage account as well as building the completed IContentTemplate.

I think ultimately, we would want to use something like Blob Storage Events if possible. That way we can eliminate all unnecessary storage calls.

Actually, since the only way to update mappings is through a provisioning operation, is there some way we can set an environment variable (maybe the template CONTENT-MD5 hash) to signal when we need to fetch an updated mapping?

Just to make sure I understand the suggestion, it is to not refresh the template during execution at all, rather rely on the provisioning flow to trigger the update?

That is an approach we can take though it doesn't address how new templates are picked up in OSS deployments. It is a good idea and a good way for us improve this but we will need to have different configurable strategies we can inject for OSS vs managed service. We are looking at some improvements for how users can manage their templates, perhaps we can do this as part of that work stream.

For now I, I felt it was a pretty substantial departure from how we refresh templates today, so I wanted to be cautious and preserve the existing behavior for now.

rogordon01

Looks good. Just had a small suggestion on how we retrieve the template

rogordon01 · 2022-06-24T17:11:14Z

src/lib/Microsoft.Health.Fhir.Ingest/Service/NormalizationEventConsumerService.cs

+            await _retryPolicy.ExecuteAsync(async () => await ConsumeAsyncImpl(events));
+        }
+
+        private async Task<IContentTemplate> GetNormalizationTemplate()


I had taken a stab at optimizing this in the past. My approach entailed wrapping the current TemplateManager in a CachingTemplateManager. And caching the results for X minutes. The reason I liked this approach is that it eliminates the majority of network requests to the storage account.

A cache miss could result in multiple threads (one per partition) attempting to fill the cache. But I felt that is ok as the multiple attempts would fill the cache with the same data. Locking could also be implemented inside of the cache population delegate if we wanted to eliminate that.

The big downside here is we go from instantly picking up customer changes to forcing a new read at regular intervals. So that change would need to be articulated to customers. But even setting the interval to 60 seconds could save hundreds of network calls to the underlying storage account, which would be very useful.

Code is below

using System; using Microsoft.Extensions.Caching.Memory; using Microsoft.Extensions.Options; namespace Microsoft.Health.Fhir.Ingest.Console.Template { public class CachingTemplateManager : ITemplateManager { private ITemplateManager _wrappedTemplateManager; private IMemoryCache _templateCache; public CachingTemplateManager( ITemplateManager wrappedTemplateManager, IMemoryCache cache) { _wrappedTemplateManager = wrappedTemplateManager; _templateCache = cache; } public byte[] GetTemplate(string templateName) { var key = $"{templateName}Bytes"; return _templateCache.GetOrCreate(key, e => { e.SetAbsoluteExpiration(TimeSpan.FromMinutes(1)); return _wrappedTemplateManager.GetTemplate(templateName); }); } public string GetTemplateAsString(string templateName) { return _templateCache.GetOrCreate(templateName, e => { e.SetAbsoluteExpiration(TimeSpan.FromMinutes(1)); return _wrappedTemplateManager.GetTemplateAsString(templateName); }); } } }

namalu · 2022-06-24T18:34:21Z

src/lib/Microsoft.Health.Fhir.Ingest/Service/NormalizationEventConsumerService.cs

+            {
+                semaphore.Release();
+            }
+        }


nit: Can this method (GetNormalizationTemplate()) and the semaphore be moved to the TemplateManager? Just curious if we can encapsulate all the logic that determines which template to use in a single class that we can update later if we want to.

The template manager is also used for the FHIR mapping templates. I didn't want to impact that code significantly with this PR. But you are correct, this can be and should be handled by it's own class/responsibility. I am planning on addressing this in a follow up PR (but it won't be part of next week's release).

namalu · 2022-06-24T18:37:35Z

src/lib/Microsoft.Health.Fhir.Ingest/Service/NormalizationEventConsumerService.cs

+        {
+            var template = await GetNormalizationTemplate();
+
+            var normalizationBatch = new List<(string sourcePartition, IMeasurement measurement)>(50);


What is the mechanism that limits the number of events to 50?

This doesn't limit the list to 50, this is just initial capacity of the list (the internal array size that backs the list). If not set, the array size is 0 and then needs to be increased as elements are added. If more than 50 elements, the capacity will be expanded.

The intent is to set a reasonable starting capacity excessive new array allocations & array copies as the list grows. I believe the implementation of List doubles the capacity when it exceeds the limit.

See https://docs.microsoft.com/en-us/dotnet/api/system.collections.generic.list-1.-ctor?view=net-6.0#system-collections-generic-list-1-ctor(system-int32) for more details.

I was able to find the source code here https://source.dot.net/#System.Private.CoreLib/List.cs,421, when growing it does double the underlying array length. If no capacity is set, when it first grows, the default array size is 4.

dustinburson added 12 commits June 5, 2022 13:20

Add abstract implementation of a StreamingEventConsumerService that p…

8036753

…rocessess events immediately while emitting progress metrics and checkpointing when thresholds are hit.

Add new Normalization service based on StreamingEventConsumerService.

2c0d073

* Fix bug where checkpoint client will find delete blobs

705e6b9

* Update EventMessageJTokenCoverter to add properties directly to the JObject rather then new object declartions * Add missing metrics to NormalizationStreamingEventConsumerService

Revert copy of appsettings in project

a6286a9

Merge branch 'main' into personal/duburson/large-msg-perf-improvements

76910b5

Merge branch 'main' into personal/duburson/large-msg-perf-improvements

1be96c3

Address PR feedback

fd65e8f

Add DeviceEvents to new Normalization

* Add Metric for dropped events.

a32a396

* Send projected events as a batch rather than one at a time.

Remove metric that was accidently checked in as part of previous commit

521b142

Add missing changes to Startup

9c70550

Address PR feedback

b460f1e

* Create new NormalizationEventConsumerService to replace Measurement…

7b7e597

…EventNormalizationService + Normalize.Processor * Move ITemplateManager and TemplateManager into the Ingest library so they can be referenced by the new class.

dustinburson requested a review from a team as a code owner June 23, 2022 03:41

dustinburson commented Jun 23, 2022

View reviewed changes

Update NormalizationEventConsumerService to only refresh normalizatio…

b56c8bc

…n template if content on blob is modified.

dustinburson commented Jun 24, 2022

View reviewed changes

rogordon01 approved these changes Jun 24, 2022

View reviewed changes

namalu approved these changes Jun 24, 2022

View reviewed changes

dustinburson merged commit 022b7c3 into main Jun 24, 2022

dustinburson deleted the personal/duburson/large-msg-perf-improvements-v2 branch November 8, 2022 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Personal/duburson/large msg perf improvements v2 #209

Personal/duburson/large msg perf improvements v2 #209

dustinburson commented Jun 23, 2022

dustinburson Jun 23, 2022

dustinburson Jun 24, 2022 •

edited

Loading

rogordon01 Jun 24, 2022 •

edited

Loading

dustinburson Jun 24, 2022 •

edited

Loading

rogordon01 Jun 24, 2022

namalu Jun 24, 2022

namalu Jun 24, 2022

dustinburson Jun 24, 2022

rogordon01 left a comment

rogordon01 Jun 24, 2022 •

edited

Loading

namalu Jun 24, 2022

dustinburson Jun 24, 2022

namalu Jun 24, 2022

dustinburson Jun 24, 2022

dustinburson Jun 24, 2022

Personal/duburson/large msg perf improvements v2 #209

Personal/duburson/large msg perf improvements v2 #209

Conversation

dustinburson commented Jun 23, 2022

Choose a reason for hiding this comment

dustinburson Jun 24, 2022 • edited Loading

Choose a reason for hiding this comment

rogordon01 Jun 24, 2022 • edited Loading

Choose a reason for hiding this comment

dustinburson Jun 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rogordon01 left a comment

Choose a reason for hiding this comment

rogordon01 Jun 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dustinburson Jun 24, 2022 •

edited

Loading

rogordon01 Jun 24, 2022 •

edited

Loading

dustinburson Jun 24, 2022 •

edited

Loading

rogordon01 Jun 24, 2022 •

edited

Loading