-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch correlation compression #649
Conversation
Co-authored-by: Chris Gillum <[email protected]>
Thanks for finding the fix for this! I’m fine with going lite on testing for this. I actually created a separate PR earlier today that includes a lighter weight distributed tracking implementation that should never run into this kind of problem, which I think will be the longer term solution once it’s ready. |
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
/azp-run |
Supersedes: #639
Bug description: After enabling Distributed Tracing, Azure Table Storage would sometimes raise an exception claiming that some field exceeded the size limitations for an entry. After investigation, it was determined that this field was the
Correlation
Property of the History table, which distributed tracing populates and relies upon.Root cause: The cause of this issue was in the
UpdateStateAsync
method ofAzureTableTrackingStore
. Normally, our property-compression step (CompressLargeMessageAsync
) should have avoided any size-limitation exceptions, but Distributed Tracing updates the `Correlation" field the History table with "raw correlation data" after the compression step.How to fix (this PR): This exception was simply due to an ordering issue: we should make sure Correlation Tracing propagates/updates all its correlation data prior to the compression step. To do this, we simply move the
CorrelationTraceClient.Propagate(...
call inUpdateStateAsync
to be immediately before the call toCompressLargeMessageAsync
.On tests: I would like to test this but, after spending a few hours fiddling with the Correlation Tracing infrastructure, I'm starting to suspect that we don't currently have a means of testing this without making very intrusive changes into the Correlation Tracing design. Let me explain below.
In spirit, the test is simple: we simply need to create a large enough correlation-data-string so that it triggers our compression step. Then, we want to validate that Azure Table Storage does not error out, which means that no raw Correlation data escaped the compression.
The are two challenges with that.
(1) we don't know exactly what causes Correlation data to grow very large. This means that I don't have a "natural" way of generating a large Correlation payload. This means that, to test this, I'm currently opting to simply set the Correlation data string to a very large value.
(2) However, during unit-testing, it seems to me that we don't have granular enough control to set this large Correlation payload past the initial (client) trace.
In Distributed Tracing, we associate several "traces" with one another. For example, we associate a durable client's "trace" with an orchestration "trace" later in the execution. The issue here is that, while we do have access to the client's trace when setting up a test (which would allow us to make its correlation-string arbitrarily long), we do not have direct access to the orchestrator's trace, and that is the relevant trace to modify to trigger this behavior.
In other words, we need to have a means to set the correlation-data-string at the time when the orchestration trace is generated, and currently I don't know of any good ways of doing that without refactoring the
CorrelationTraceClient
into a more unit-testing friendly design, which would make this tiny PR much larger in scope.My bias here would be to create a new Distributed Tracing work item to consider a refactor of Distributed Tracing that facilitates this scenario and then to merge this PR after all tests pass. I feel comfortable suggesting this since the change is quite small, the fix was validated by the affected user, and because it applies to a still in-preview feature.
All that said, I'm always open to discussion and I'm open to pair program this in case we can think of a clear way to test this :) . Perhaps my unfamiliarity with this side of the codebase prevents me from seeing an easier way. Thanks !