Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync and Async children (FOLLOWS_FROM) #65

Closed
tedsuo opened this issue Apr 22, 2019 · 29 comments
Closed

Sync and Async children (FOLLOWS_FROM) #65

tedsuo opened this issue Apr 22, 2019 · 29 comments
Assignees
Labels
area:api Cross language API specification issue area:sampling Related to trace sampling area:span-relationships Related to span relationships release:after-ga Not required before GA release, and not going to work on before GA spec:protocol Related to the specification/protocol directory spec:trace Related to the specification/trace directory
Milestone

Comments

@tedsuo
Copy link
Contributor

tedsuo commented Apr 22, 2019

In OpenTracing, we have CHILD_OF and FOLLOWS_FROM. In the new project, we are considering whether to include this concept as a flag on the SpanBuilder option when setting the span parent. The new naming is proposed to be sync and async children, to make the relationship more clear.

Reference PR: open-telemetry/opentelemetry-java#130

Questions:

  • Do we still want this at all? It can be useful for critical path and other types of trace analysis.
  • Do we also need an unknown flag as well?
@tedsuo tedsuo changed the title Boolean field to set a Span as asynchronous Sync and Async children (FOLLOWS_FROM) Apr 22, 2019
@tylerbenson
Copy link
Member

tylerbenson commented May 2, 2019

@tedsuo How would you label server span with a parent propagated from an http request?

Perhaps a better name would be direct/indirect?
(Anything pulled directly from the current context would be direct otherwise indirect)

Perhaps that wouldn't give you the info required for critical path analysis you need... part of the problem is only the caller really knows if they're blocking, where the link/relationship is established by the callee.

{I guess some of this was already discussed in #14}

@rochdev
Copy link
Member

rochdev commented May 22, 2019

I don't think sync and async are the correct semantics for this. My understanding of ChildOf and FollowsFrom is that ChildOf is the previous operation that directly created the new span, and FollowsFrom is any previous operation that indirectly caused the new span. This is unrelated to the references being asynchronous or not. Both potentially have value but they are conceptually different.

An example of this in JavaScript is promises. The ChildOf would generally be the span in the scope where promise.then() was called, and FollowsFrom would be the span where resolve() was called for example.

My take on the questions above:

Do we still want this at all? It can be useful for critical path and other types of trace analysis.

I think both ChildOf/FollowsFrom and sync/async potentially have value for different reasons.

Do we also need an unknown flag as well?

What is the case where this would not be known? I think this is something that is always known in advance.

@carlosalberto
Copy link
Contributor

@rochdev Hey, sorry for the late answer.

Agreed with what you wrote - but what names you think we should use? @tylerbenson already mentioned direct/indirect as options, and if you have something in mind feel free to propose it.

@rochdev
Copy link
Member

rochdev commented Jun 3, 2019

I think ChildOf and FollowsFrom made a lot of sense. To be honest, semantic-wise I think OpenTracing got a lot of things right.

How was this relationship called in OpenCensus?

@bogdandrutu
Copy link
Member

@rochdev I am worried that the understanding that of childOf and followsFrom is different for you than for others. I think what you explained is different than what others explained to me, I am very confused now about what is the correct meaning of childOf vs followsFrom.

@SergeyKanzhelev
Copy link
Member

Moving to the API revision milestone on specification. We need more feedback collected

@SergeyKanzhelev SergeyKanzhelev transferred this issue from open-telemetry/opentelemetry-java Jun 3, 2019
@SergeyKanzhelev SergeyKanzhelev added the area:api Cross language API specification issue label Jun 3, 2019
@SergeyKanzhelev SergeyKanzhelev added this to the API revision: 07-2019 milestone Jun 3, 2019
@rochdev
Copy link
Member

rochdev commented Jun 3, 2019

In Node this concept is very important and is core to how context propagation works in the runtime itself. For this purpose specifically, we call them execution and trigger. In Node you can only have one of each because it's limited to function calls, but from a tracing perspective it makes sense that you could follow from multiple different operations.

Let me give an example specific to Node at the language level:

const promise = new Promise((resolve, reject) => {
  resolve() // execution ID here it 2
})

// execution ID here is 1
promise.then(() => {
  // here, execution ID is 1 and trigger ID is 2
})

The reasoning for the above is that resolve() is what triggered the execution of the callback, but the callback was actually registered when then() was called, so that's its execution parent.

For a case like this, we could then say that the callback is running as a ChildOf the context where then() was called, and FollowsFrom the context where resolve() was called.

It's possible I got this completely wrong and that ChildOf/FollowsFrom has nothing to do with the relationship described above. I think the best person to explain the real meaning of these is probably @tedsuo.

In general, I think the different wordings proposed in this thread make sense, but they don't necessarily map 1:1 with each other.

@yurishkuro
Copy link
Member

The problem here is slightly bigger than what to call these. Kudos @rochdev for thinking OpenTracing got it right, but it didn't, not quite (cf. this blog post). The most fundamental question in analyzing the graph of events is the Lamport's happens-before relationship. In the OpenTracing span model the following holds:

parent.start  happens-before child.start

That's it! Neither child-of nor follows-from imply any further causality. Child-of only means parent depends on the outcome of child, in some way. It doesn't mean the parent is blocked - it can be doing other things (thus sync/async naming isn't quite right). It doesn't mean child completes before parent - it looks this way, but parent (RPC caller) may timeout before child (RPC server). In case of such a timeout, OpenTracing does not have a convention on how parent should record that fact (sad face).

The difference between child-of and follows-from is useful, in practice, for calculating critical path, but strictly speaking that calculation is not possible since the causality is not captured between the ends of spans, so critical path can only be calculated via a heuristic (I would love to be disproven on this!).

Another odd thing about child-of and follows-from is that it's the child span that defines this reference type, even though it talks about parent's dependency on child outcome. If you're a remote server, how do you even know if parent/caller does or does not depend on your outcome? I tend to think of this as the nature of the protocol: producer of a message to Kafka does not respect any response, so the receiver should use follows-from. Sender of HTTP request does expect a response, so the server always uses child-of, even of the sender doesn't care about the outcome - in that case it can internally create a follows-from span first, and then a normal pair of RPC call spans. So it's possible to rationalize this way, but it's still kind of dirty.

Of course, there's always the argument that OpenTelemetry 1.0 is not supposed to improve upon OpenTracing/OpenCensus (convergence is more important than improvements), in which case it doesn't matter much what we call these, because the model would need to be revisited anyway.

@rochdev
Copy link
Member

rochdev commented Jun 4, 2019

Thanks for the clarification @yurishkuro! It sounds like this is a larger discussion then. Would it make more sense to wait then instead of implementing this knowing that it's not necessarily the correct way to handle this relationship?

Of course, this depends on whether users are currently depending on the feature. If that's the case, then I think we should get more information about exactly how it's used which would give us a better understanding of how the currently used feature should be called.

@iredelmeier iredelmeier added spec:context Related to the specification/context directory area:sampling Related to trace sampling labels Jul 30, 2019
@iredelmeier iredelmeier added the area:span-relationships Related to span relationships label Aug 13, 2019
@AlexanderWert
Copy link
Member

I think a flag for sync / async (blocking / non-blocking) spans would be very useful for trace analysis. In this way you could much easier and quicker identify hot spots along the critical path and, thus, the "root causes" for long trace timings.
Such a flag is important because it indicates whether a child span's time is included in the parents time or not. Thus, whether a parent span is slow because of the child span or independently of the child span. Without such a flag, this question cannot be answered.

I agree that in many cases only the caller knows whether a child span is blocking or not, but, in such cases this information could be propagated with the context accordingly, so this information could be added at the child span's side.

@Oberon00
Copy link
Member

Oberon00 commented Sep 9, 2019

To fully capture all (or at least more) of the possible relationships between Spans, in addition to the create(parent) API, we would need APIs that signal that a span begins/ends waiting for a particular (set of) child(ren) and possibly even that it consumes the result of a particular child. Of course we would need to identify a child on the parent side without the child communicating back it's span ID, which is a whole other problem (but can probably be solved elegantly by introducing IDs not only for the nodes but also the edges in the span graph). Heck, you can even wait synchronously (occupying a thread) or asynchronously (by using something like async/await where other operations can be scheduled while a different one waits for I/O).

To show some difficult cases (pseudo C#) :

// Start async (the child has no idea whether we used a blocking API or an async
// one -- nor should it have to know).
var request = myClient.GetAsync("http://example.com/myAPI"); 
var myCalcResult = /* some expensive calculation */; // Could be its own subspan

// Block for child request, but not indefinitely
var maybeResult = await request.withTimeout(500);
if (maybeResult.HasValue) {
  renderCompletion(myCalcResult, maybeResult.Value); // Consume result
} else {
  renderPleaseWait();

  // Another operation/span will consume the result (the handle-response part of the
  // client could become a child of the server Span, or it could be an independent
  // root span with a CONSUMES relation to the server span).
  delayedRequests.AddPending(new PendingInfo(myCalcResult, request)); 
}

@SergeyKanzhelev SergeyKanzhelev removed this from the API revision: 07-2019 milestone Sep 27, 2019
@SergeyKanzhelev
Copy link
Member

Moving to v0.3

@jmacd
Copy link
Contributor

jmacd commented Jan 22, 2020

Closing this as being accomplished through Links in the current spec.

@jmacd jmacd closed this as completed Jan 22, 2020
@yurishkuro
Copy link
Member

Links don't address the issue in this ticket. I suggest to close #86 instead, because there's more discussions here.

@carlosalberto
Copy link
Contributor

After giving this a try via #906, we decided to postpone it in order to develop it properly. Re-labeling it so we can add this feature after GA.

@dotNetDR
Copy link

any update on this?

@h1z3y3
Copy link

h1z3y3 commented Apr 25, 2022

+1

@austinlparker
Copy link
Member

Solved via span links.

carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this issue Oct 21, 2024
…pen-telemetry#65)

OTLP proposal originally used a separate ResultCode enumeration for server
to tell the client whether the failed request can be retried or no.

After discussion here open-telemetry/opentelemetry-proto#47 (comment)
it became clear that the goal can be achieved using gRPC status codes without
a need for custom enumeration.

This change removes the ResultCode and explains how to use gRPC status codes.

Co-authored-by: Yuri Shkuro <[email protected]>
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this issue Oct 23, 2024
…pen-telemetry#65)

OTLP proposal originally used a separate ResultCode enumeration for server
to tell the client whether the failed request can be retried or no.

After discussion here open-telemetry/opentelemetry-proto#47 (comment)
it became clear that the goal can be achieved using gRPC status codes without
a need for custom enumeration.

This change removes the ResultCode and explains how to use gRPC status codes.

Co-authored-by: Yuri Shkuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue area:sampling Related to trace sampling area:span-relationships Related to span relationships release:after-ga Not required before GA release, and not going to work on before GA spec:protocol Related to the specification/protocol directory spec:trace Related to the specification/trace directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.