Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Tracing #261

Merged
merged 21 commits into from
Jan 30, 2020
Merged

Conversation

TsuyoshiUshio
Copy link
Collaborator

Hi @cgillum

I'm creating this pull request to have this branch to hands off to you.
I open this PR for visibility.

What I've done:

  • Correlation design for Orchestration and implementation
  • Working Sample

TODO

  • Error Tracking for throwing exception case.
  • Continuous as New design
  • Durable Functions Extension contribution

Now one of the test case will fail for not implementing it for multiple orchestration retry case. However, I open this for share.

@cgillum cgillum changed the title Distributed Tracing Hands Off Distributed Tracing Mar 7, 2019
@cgillum cgillum added durable functions This issue is an ask from the Durable Functions team. enhancement labels Mar 7, 2019
@TsuyoshiUshio
Copy link
Collaborator Author

correlated

Done

  • Correlation with Orchestration
  • Correlation with Multiple Orchestration
  • Correlation with Exception
  • Correlation with Retry (Activity)
  • Correlation with Retry (Orchestrator)
  • Multi Layered Orchestration
  • Multi Layered Orchestration with Retry
  • Fan-Out Fan-In

TODO

  • Continue As New Testing
  • Automatic tracked Dependency with correlation
  • Correlation with Http Correlation Protocol ( Currently only for W3C TraceContext)
  • Activity Restoration (After .NET implement it)

@@ -132,7 +132,7 @@ public override string GetStatus()
/// <param name="context">The orchestration context</param>
/// <param name="input">The typed input</param>
/// <returns>The typed output</returns>
public abstract Task<TResult> RunTask(OrchestrationContext context, TInput input);
public abstract Task<TResult> RunTaskAsync(OrchestrationContext context, TInput input);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, which cannot be allowed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cgillum

Sorry, I didn't notice that. However, I might mistake to refactor the name. I recover it.

@TsuyoshiUshio TsuyoshiUshio force-pushed the correlation/refactor branch 2 times, most recently from 6b2c65b to f3d15b3 Compare March 9, 2019 23:35
@TsuyoshiUshio
Copy link
Collaborator Author

Hi @cgillum The CI fails, it looks the script can't refer the $ServiceBusConnectionString . Hmm I have no idea. Any thoughts? If not, I'll ask the Azure DevOps guys.

@cgillum
Copy link
Member

cgillum commented Mar 10, 2019

@TsuyoshiUshio I'm not sure. The CI passed earlier today for a different PR. I have not made any changes to the pipeline.

@TsuyoshiUshio
Copy link
Collaborator Author

@cgillum I'm asking the Azure DevOps guys.

@@ -0,0 +1,38 @@
using Microsoft.VisualStudio.TestTools.UnitTesting;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using [](start = 0, length = 5)

File header + move using into namespace

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this still needs to be done?

@@ -11,6 +11,8 @@
// limitations under the License.
// ----------------------------------------------------------------------------------

using System.Diagnostics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using [](start = 0, length = 5)

move into namespace

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{

#if NETSTANDARD2_0
private static AsyncLocal<TraceContext> current = new AsyncLocal<TraceContext>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private [](start = 8, length = 7)

don't use private modifier per rest of repo convention

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @simonporter ,
Before fix this, Could you clarify the convention? Should I use internal or default for the modifier in this case? Why do you forbid private modifier? I'd like to know the reason for the future contribution of this repo. Also want to properly fix this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done that. I change private into default.

@TsuyoshiUshio
Copy link
Collaborator Author

One note. I'm adding Microsoft.ApplicationInsights namespace on the DurableTask.Core. It is just temporary. It is used for ActivityExtension which enable W3C TraceContext. Application Insights team move the extension to the System.Diagnostics namespace. The PR is already merged 15 days before. It will be included the next version of the .Net Core implementation. dotnet/corefx#33207

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @simonporter ,

Sounds good for the convention? I'm working with @cgillum for enabling Distributed Tracing for Durable Functions. However, we try to make it available for other teams that use DurableTask. This PR is not definitive version. If you don't like my implementation, you can dispose it. However, the architecture and what I experienced might be helpful at least for the future implementor. Could you give me 30 min to explain this PR in person? I'll come. (Hopefully with @cgillum . ) This PR might be too big for review in here.

@TsuyoshiUshio
Copy link
Collaborator Author

The CI fails are the CI issue. Not code. the lack of permission for public PR.
I'm investigating it with my colleague.

@ShreyasRmsft
Copy link

@TsuyoshiUshio thank you for bringing this to our notice. We (VsTest team) are working on getting distributed test runs to work with public projects. Will keep you updated.

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @ShreyasRmsft , Thank you for telling me! Sure. Sounds exciting. until then we just add PR contributors some kind of group.

@TsuyoshiUshio
Copy link
Collaborator Author

@ShreyasRmsft Could you tell me the best place to file this issue? I spent several days for this issue. I don't want people have the same experience as me. Also, if it is OK. could you tell me when it is going to be published? I also MSFT guy. However, I can't identify you from your GitHub account. :)

@ShreyasRmsft
Copy link

@TsuyoshiUshio
Copy link
Collaborator Author

Thank you, @ShreyasRmsft , I've done that!

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @simonporter

This is the sample of the raw message with Two Layered OrchestratorTraceContext. Please see the TraceContextStore part. I change the serialization option only for this part for supporting reference. I don't want to change the whole serialization option. Only for TraceContextStore part. However, if you have any idea to reduce the message or put out of queue, please let me know. around 1500 bytes.

{"$type":"DurableTask.AzureStorage.MessageData","ActivityId":"76185f4d-bb4a-4ed4-857e-3e9eb1e42690","TaskMessage":{"$type":"DurableTask.Core.TaskMessage","Event":{"$type":"DurableTask.Core.History.TaskCompletedEvent","EventType":5,"TaskScheduledId":0,"Result":"\"Hello world with retry\"","EventId":-1,"IsPlayed":false,"Timestamp":"2019-03-21T17:48:43.5263421Z"},"SequenceNumber":0,"OrchestrationInstance":{"$type":"DurableTask.Core.OrchestrationInstance","InstanceId":"67b1704f68b049ae9d415d58aa0169c9:2","ExecutionId":"163b7b9a6efe40d586a68f844f7461b4"}},"CompressedBlobName":null,"SequenceNumber":9,"Episode":1,"TraceContextStore":{"$type":"DurableTask.Core.TraceContextStore","TraceContextJson":"{\"$id\":\"1\",\"$type\":\"DurableTask.Core.TraceContext, DurableTask.Core\",\"StartTime\":\"2019-03-21T17:48:43.526875+00:00\",\"ParentId\":\"|568a23ca-42c37626d2275a9f.568a23cf_568a23fc_568a2451_568a2456_568a2458_\",\"Traceparent\":\"00-c91fda74588ce04693a54a2ce6919c56-bab42c068a951740-02\",\"Tracestate\":null,\"ParentSpanId\":\"0463ebaf65d93843\",\"OrchestrationTraceContexts\":[{\"$id\":\"2\",\"$type\":\"DurableTask.Core.TraceContext, DurableTask.Core\",\"StartTime\":\"2019-03-21T17:47:26.5219035+00:00\",\"ParentId\":\"|568a23ca-42c37626d2275a9f.568a23cf_568a23fc_\",\"Traceparent\":\"00-c91fda74588ce04693a54a2ce6919c56-2158e5eba664fb4a-02\",\"Tracestate\":null,\"ParentSpanId\":\"cf0bcc1ba2f5f34f\",\"OrchestrationTraceContexts\":[{\"$ref\":\"2\"}],\"RootId\":\"00-c91fda74588ce04693a54a2ce6919c56-2158e5eba664fb4a-02\"},{\"$id\":\"3\",\"$type\":\"DurableTask.Core.TraceContext, DurableTask.Core\",\"StartTime\":\"2019-03-21T17:48:43.4696347+00:00\",\"ParentId\":\"|568a23ca-42c37626d2275a9f.568a23cf_568a23fc_568a2451_\",\"Traceparent\":\"00-c91fda74588ce04693a54a2ce6919c56-d84f945c3a626642-02\",\"Tracestate\":null,\"ParentSpanId\":\"2158e5eba664fb4a\",\"OrchestrationTraceContexts\":[{\"$ref\":\"2\"},{\"$ref\":\"3\"}],\"RootId\":\"00-c91fda74588ce04693a54a2ce6919c56-d84f945c3a626642-02\"}],\"RootId\":\"00-c91fda74588ce04693a54a2ce6919c56-bab42c068a951740-02\"}"}}

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @simonporter , @cgillum ,

I implement the test for ContinueAsNew. The current behavior is that Orchestration which is created by ContinueAsNew is considered as the same Orchestration. It is because of the isReplay is true after the ContinueAsNew. If you don't like this behavior, we might change the behavior that if an orchestration is continueAsNew, then start a new orchestration. However, current behavior looks good as well.

image

@@ -612,6 +614,7 @@ static TaskHubInfo GetTaskHubInfo(string taskHub, int partitionCount)
session.StartNewLogicalTraceScope();

List<MessageData> outOfOrderMessages = null;
var current = Activity.Current; // Correlation for checking the state of the Activity.Current. TODO remove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

// ----------------------------------------------------------------------------------

namespace DurableTask.Core
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed could this be refactored to encapsulate activity, thus reducing the need to decide between both?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduce the new structure. I encapsulate activity to the TraceContext object. New name is

  • TraceContextBase (abstract class)
  • W3CTraceContext (implementation of the TraceContextBase)
  • HttpCorrelationProtocolTraceContext (Implementation of the TraceContextBase)

For enabling this, I create a Factory to refer the configuration of the Protocol settings.
These two Contexts have limited property. Some property is not serialized to reduce the size of the message. It requires special serialization, It is handled on TraceContextStore

{
/// <summary>
/// TraceMessages for Distributed Tracing
/// </summary>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a static class? also might make sense to rename to TraceConstants.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/// For the Newton.json 7.0.1 requires JsonSerializerSettings. It will all so requires DurableTask users.
/// To avoid this breaking change, I provide TraceContextStore which support the serialization/deserialization of the <see cref="TraceContext"/>
/// </summary>
public class TraceContextStore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used while storing and retrieving from Message Data, would it make sense to handle this simply as a property and constructor on Trace Context? We could also just use a custom JsonConverter attribute on top of the TraceContext since it seems like it needs to be serialized in a special way? we could also handle it via a private JsonProperty on message data. Incase neither of these are possible consider renaming to maybe SerializableTraceContext?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the TraceContextStore and make it available for TraceContextBase. I implement it as property and static method. I tried to use constructor, however, it requires additional logic to move restored object to the target object. It could be error prone. I create a Restore method simply pass the json string it creates TraceContextBase subclass object.

@@ -341,6 +341,11 @@ public void HandleTaskFailedEvent(TaskFailedEvent failedEvent)
var taskFailedException = new TaskFailedException(failedEvent.EventId, taskId, info.Name, info.Version,
failedEvent.Reason, cause);

// correlation
var traceContext = CorrelationTraceContext.Current;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are traceContext and activity used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

if (orchestrationState.OrchestrationStatus == OrchestrationStatus.Completed)
{
CorrelationTraceClient.TrackDepencencyTelemetry(dependencyActivity);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these all separate Ifs? instead of a single if ((orchestrationState.OrchestrationStatus == OrchestrationStatus.Completed) ||
(orchestrationState.OrchestrationStatus == OrchestrationStatus.Failed)) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Guid traceActivityId = Guid.NewGuid();
var session = new ActivitySession(this.storageAccountName, this.settings.TaskHubName, message, traceActivityId);
session.StartNewLogicalTraceScope();

// correlation
string name = $"{TraceMessages.Activity} {((TaskScheduledEvent)session.MessageData.TaskMessage.Event)?.Name?.GetTargetClassName()}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonporter to confirm here, but I don't believe this custom logic to get the class NAme would always hold, particularly in the case of a custom NameVersion object manager. Can this be moved one level up to The TaskOrchestrationManager in Core? In that case you would have access to the object manager and can get the class NAme more simply.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adarsh1 Where can I find the TaskOrchestrationManager? I can't find it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry meant the TaskActivityDispatcher

currentTraceContext = currentTraceContext ?? current.CreateTraceContext();

data.TraceContextStore = TraceContextStore.Create(currentTraceContext);
rawContent = await messageManager.SerializeMessageDataAsync(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like data is the only thing changing between the if and else, should the raw content initialization simply stay out of this if else structure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -35,4 +35,5 @@
<ItemGroup>
<ProjectReference Include="..\DurableTask.Core\DurableTask.Core.csproj" />
</ItemGroup>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo if no change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@TsuyoshiUshio
Copy link
Collaborator Author

Thank you @adarsh1 ! I'll do it as we discussed. I'll start this week. :) I appreciate your review.

@TsuyoshiUshio
Copy link
Collaborator Author

TsuyoshiUshio commented May 6, 2019

Hi @adarsh1 , @cgillum

I fix all your request. Review it again, please. I checked the Scenario Test passes and execute the sample and check if we can see the same tracing on the portal.

TODO

  • Check if there is no affect for ServiceBus side. (even if they don't implement Distributed Tracing) using CI (We might add my github e-mail account to the CI to validate)
  • Rebase (I haven't done rebase for current master to keep commit log for review. once it becomes ok, I can rebase it and solve conflict (if there is ))
  • Library Project(DurableTaskCorrelationTelemetryInitializer and TraceContextBaseExtensions should be shared. however, I'm not sure where to share since it references Application Insights. Maybe DurableTask.Correlation or something?
  • More Unit Testing for TraceContextBase and sub classes.

/// This property is not serialized.
/// </summary>
[JsonIgnore]
public Activity CurrentActivity { get; set; } // TODO make it protected after refactoring is done.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make protected, or remove TODO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double check the dependency. This property depend on the subclass, factory, and test class. I hope only these three can use this property. I madeit internal and removed TODO.

Traceparent = CurrentActivity.GetTraceparent();
Tracestate = CurrentActivity.GetTracestate();
ParentSpanId = CurrentActivity.GetParentSpanId();
// ParentId = activity.Id, // TODO check if it is not necessary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

public override TimeSpan Duration => CurrentActivity?.Duration ?? DateTimeOffset.UtcNow - StartTime;

/// <inheritdoc />
public override string TelemetryId => throw new NotImplementedException(); // TODO Implement the HTTPCorrelation Protocol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this Safe? Would things fail if HttpCorrelationProtocol is used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HttpCorrelationProtocol is not fully tested. App Insights team encourage to use W3C now. So I start implement/test with W3C. However, I implement HttpCorrelationProtocol. It should work, however, to make sure it works, it requires Scenario Test with TelemetryInitializer for HttpCorrelationProtocol.

public string SerializableTraceContext =>
JsonConvert.SerializeObject(this, new JsonSerializerSettings()
{
TypeNameHandling = TypeNameHandling.Objects,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really should be using a custom JsonConverter here, if we need special handling.
If we don't have the time to do that for now maybe a TODO would be ok?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add TODO.

Guid traceActivityId = Guid.NewGuid();
var session = new ActivitySession(this.storageAccountName, this.settings.TaskHubName, message, traceActivityId);
session.StartNewLogicalTraceScope();

// correlation
string name = $"{TraceMessages.Activity} {((TaskScheduledEvent)session.MessageData.TaskMessage.Event)?.Name?.GetTargetClassName()}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry meant the TaskActivityDispatcher

Copy link
Contributor

@adarsh1 adarsh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ad a few more nits around existing TODOs, other than that looks ok to me

@TsuyoshiUshio
Copy link
Collaborator Author

Thank you for your review, @adarsh1 , Hi, @cgillum , can I add my github account to the Azure DevOps task? I'd like to test this PR doesn't broke tests exists.

Copy link
Collaborator Author

@TsuyoshiUshio TsuyoshiUshio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the requests

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @cgillum ,

I fix the issues that you requested. However, some need to discuss with you. I leave these as unresovled comments. Could you have a look these three?

  1. TelmetryActiviator's TelemetryClient should be static, left a comment.
  2. samples/Correlation.Samples/Program.cs I leave the comments for represent the available demo scenario.
  3. How? (following picture)
    image

@cgillum
Copy link
Member

cgillum commented Jan 20, 2020

Hi @TsuyoshiUshio - I have reviewed your latest updates and responded to your questions. There are also a few requested fixes that have not been made yet. Please take a look.

Copy link
Collaborator Author

@TsuyoshiUshio TsuyoshiUshio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix issues.

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @cgillum , I fixed them. Only one concerns left.

  1. I fixed static TelemetryClient issue and find out the root cause. I'd like to listen the opinion about the design decision. Please refer the comment. (I leave it not resolved)

@TsuyoshiUshio
Copy link
Collaborator Author

TsuyoshiUshio commented Jan 27, 2020

Hi @cgillum this is friendly reminder. :) Please have a look. If you are ok, let's merge it.
If not, I'd happy to fix or refactor some parts.

@TsuyoshiUshio TsuyoshiUshio requested a review from cgillum January 30, 2020 15:30
@jkerak
Copy link

jkerak commented Jan 30, 2020

Hi. I'm not sure if this is the correct place to ask this question (I can't find a link to some kind of issue or feature request). Is there any kind of limit to number of executions that will be logged (and correlated) to AI? In some cases, I may have tens of thousands of activity function executions within a single orchestrator execution. I'm also curious what an eternal orchestrator would look like. To be clear, I'm really asking about what the limit of the app insights UI might be. I'm sure log analytics would store whatever is sent to it.

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - thanks for your patience. I suggest one more small change and then let's merge this.

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @cgillum I fixed all of the issues.

@cgillum cgillum merged commit 13b6d4c into Azure:correlation Jan 30, 2020
@cgillum
Copy link
Member

cgillum commented Jan 30, 2020

Thanks! I will merge this now. Glad to get this first step finally checked in. :)

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @jkerak

You are using tens of thousands of activity functions in a single orchestrator right? I'm not sure about the limitation of the durable side, however, Application Insights has some quota https://docs.microsoft.com/en-us/azure/azure-monitor/app/api-custom-events-metrics#limits This issue might be a good place to talk about correlation. Azure/azure-functions-durable-extension#939

@TsuyoshiUshio
Copy link
Collaborator Author

Thank you @cgillum for kind support of this PR. I'll go in implementing durable functions side.

TsuyoshiUshio added a commit to TsuyoshiUshio/durabletask that referenced this pull request Mar 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
durable functions This issue is an ask from the Durable Functions team. enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants