Patch correlation compression #649

davidmrdavid · 2021-12-01T03:59:03Z

Supersedes: #639

Bug description: After enabling Distributed Tracing, Azure Table Storage would sometimes raise an exception claiming that some field exceeded the size limitations for an entry. After investigation, it was determined that this field was the Correlation Property of the History table, which distributed tracing populates and relies upon.

Root cause: The cause of this issue was in the UpdateStateAsync method of AzureTableTrackingStore. Normally, our property-compression step (CompressLargeMessageAsync) should have avoided any size-limitation exceptions, but Distributed Tracing updates the `Correlation" field the History table with "raw correlation data" after the compression step.

How to fix (this PR): This exception was simply due to an ordering issue: we should make sure Correlation Tracing propagates/updates all its correlation data prior to the compression step. To do this, we simply move the CorrelationTraceClient.Propagate(... call in UpdateStateAsync to be immediately before the call to CompressLargeMessageAsync.

On tests: I would like to test this but, after spending a few hours fiddling with the Correlation Tracing infrastructure, I'm starting to suspect that we don't currently have a means of testing this without making very intrusive changes into the Correlation Tracing design. Let me explain below.

In spirit, the test is simple: we simply need to create a large enough correlation-data-string so that it triggers our compression step. Then, we want to validate that Azure Table Storage does not error out, which means that no raw Correlation data escaped the compression.

The are two challenges with that.

(1) we don't know exactly what causes Correlation data to grow very large. This means that I don't have a "natural" way of generating a large Correlation payload. This means that, to test this, I'm currently opting to simply set the Correlation data string to a very large value.

(2) However, during unit-testing, it seems to me that we don't have granular enough control to set this large Correlation payload past the initial (client) trace.

In Distributed Tracing, we associate several "traces" with one another. For example, we associate a durable client's "trace" with an orchestration "trace" later in the execution. The issue here is that, while we do have access to the client's trace when setting up a test (which would allow us to make its correlation-string arbitrarily long), we do not have direct access to the orchestrator's trace, and that is the relevant trace to modify to trigger this behavior.

In other words, we need to have a means to set the correlation-data-string at the time when the orchestration trace is generated, and currently I don't know of any good ways of doing that without refactoring the CorrelationTraceClient into a more unit-testing friendly design, which would make this tiny PR much larger in scope.

My bias here would be to create a new Distributed Tracing work item to consider a refactor of Distributed Tracing that facilitates this scenario and then to merge this PR after all tests pass. I feel comfortable suggesting this since the change is quite small, the fix was validated by the affected user, and because it applies to a still in-preview feature.

All that said, I'm always open to discussion and I'm open to pair program this in case we can think of a clear way to test this :) . Perhaps my unfamiliarity with this side of the codebase prevents me from seeing an easier way. Thanks !

Co-authored-by: Chris Gillum <[email protected]>

cgillum · 2021-12-01T06:04:05Z

Thanks for finding the fix for this! I’m fine with going lite on testing for this. I actually created a separate PR earlier today that includes a lighter weight distributed tracking implementation that should never run into this kind of problem, which I think will be the longer term solution once it’s ready.

src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs

…Correlation data

cgillum

LGTM!

davidmrdavid · 2021-12-01T19:35:18Z

/azp-run

davidmrdavid and others added 9 commits November 11, 2021 17:56

add logging

fdf2ff2

log message

58e48de

include most feedback

5c111b5

Log info about every property

0c6f87f

Apply further feedback

c64e4d3

Update src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs

60f4c5a

Co-authored-by: Chris Gillum <[email protected]>

improve logging, move correlation logic

495134e

remove logging

ff657e1

restore AzureStorage csproj

06bd30a

davidmrdavid mentioned this pull request Dec 1, 2021

Add extra logging in table-storage path #639

Closed

davidmrdavid requested review from cgillum, amdeel and bachuv December 1, 2021 04:00

cgillum reviewed Dec 1, 2021

View reviewed changes

src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs Outdated Show resolved Hide resolved

Remove unecessary Propagate call - ConvertToTableEntity already sets …

a162cce

…Correlation data

davidmrdavid requested a review from cgillum December 1, 2021 19:00

cgillum approved these changes Dec 1, 2021

View reviewed changes

davidmrdavid merged commit a21c400 into main Dec 1, 2021

davidmrdavid deleted the dajusto/patch-correlation-compression branch December 1, 2021 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch correlation compression #649

Patch correlation compression #649

davidmrdavid commented Dec 1, 2021 •

edited

Loading

cgillum commented Dec 1, 2021

cgillum left a comment

davidmrdavid commented Dec 1, 2021

Patch correlation compression #649

Patch correlation compression #649

Conversation

davidmrdavid commented Dec 1, 2021 • edited Loading

cgillum commented Dec 1, 2021

cgillum left a comment

Choose a reason for hiding this comment

davidmrdavid commented Dec 1, 2021

davidmrdavid commented Dec 1, 2021 •

edited

Loading