Questions about span relationships in messaging semantic convention #1085

iNikem · 2020-10-11T06:22:09Z

Problem description

I noticed an inconsistency in how messaging semantic conventions describe parent-child relationships between producers and consumers.

First example says that consumers should have message producer as their parent. But second example says that

the propagated trace and span IDs are not known when the receiving span is started

and thus we don't have parent-child relation between consumer spans and producers. Also there is no indication what parent should "Receive" span have.

And then third example contradicts it again and tell receiving spans to have message producers as their parent. Why second example couldn't have parents and this one can? And how shall backends handle such convention which essentially say "everything can happen"?

Analysis

Problem with old remote parent

First I would like to start a discussion about having message producer as a parent span for message consumers in an async case. Although it certainly does make sense in the majority of cases of streaming or event-based platforms, I wonder if this makes sense in situations, when we can reasonably expect to process old messages. Like in messages reprocessing in event-source systems or lambda-architectures (or kappa). If there is a gap of several days (or weeks) between consumer span and producer span, does it still make sense to have producer span as a parent?

Pull vs push consumption

I see two different ways to consume messages. Push-based consumption uses some kind of listener mechanism and focuses on message processing. Message retrieval is done by framework and in general of no interest to us. Pull-based consumption has an explicit step of "requesting" or "receiving" the message and then processing it.

Despite the problem with old remote parents, I think that in push-based systems the only sensible relationship between message producing and its processing is the parent-child one. Message processing is done in the context of some retrieve/process cycle of the framework which we don't want to see as the parent.

In pull-based system the question "what span should be the parent of whom" is not that clear to me. Somewhere in my application's code there is something similar to the following pseudo-code:

var messages = queue.pollNext()
for m in messages {
  process(m)
}

I want to see a span for pollNext execution, because it may block while waiting for messages. This span, following semantic convention, will be "receive" span. Depending on API, pollNext can always return a single message or it may return a collection as in the example above. In the latter case it cannot have remote span as the parent. In the former it may. Convention's argument "the propagated trace and span IDs are not known when the receiving span is started" may be relevant or not, because we can delay starting the span until message is received and remote parent is known. If we don't use remote span as a parent for any reason, should we use currently active span (at the moment of pollNext invocation) as a parent? Probably yes?

Then we have zero, one or more process method executions. Each one of them should result in a "process" span. What span should be the parent of those? I argue that pollNext cannot be their parent, because it has already ended (but convention tells us to do exactly this). Should they have remote producer span as a parent? Should they have currently active span as a parent? Although former case seems sensible, it will result in a big chunk of unaccounted time in the currently active span. Or is it enough to link from process spans to the current span?

Proposal

Spans corresponding to "receive" operation SHOULD always use implicitly selected currently active span as a parent.

Spans corresponding to "process" operation of a single message SHOULD use the producer span as a parent. In case when "process" span corresponds to the processing of several messages at once (batch processing), producer spans SHOULD NOT be used as parents and implicitly selected parent MAY be used.

If possible "process" spans SHOULD link to the corresponding "receive" span(s).

The text was updated successfully, but these errors were encountered:

arminru · 2020-10-12T15:31:02Z

@iNikem This is related to issue #958. @anuraaga please have a look here.

And then third example contradicts it again and tell receiving spans to have message producers as their parent. Why second example couldn't have parents and this one can?

I don't think the spec contradicts itself here in the examples. It simply states that in the batch receive scenario laid out it does not have access to the parent span context propagated along with the message in order to make it a parent of that span. In the batch processing but individual receive scenario, you might have access to it and if so, you can use it as the parent. Same for the first example with individual processing without separate receive spans.

And how shall backends handle such convention which essentially say "everything can happen"?

Backends will still be able to determine the kind of relationship between the parents by looking at their span kinds and the messaging operation attribute, if any.

Problem with old remote parent

I don't see an issue with that. If a back end can still correlate it with the initial producing span, it should be fine but might cause issue in the UI when it tries to display the entire trace at once. If the back end does not store traces long enough or does not look back in time long enough, it will not know that parent and treat it accordingly. Would there be any alternative?

If we don't use remote span as a parent for any reason, should we use currently active span (at the moment of pollNext invocation) as a parent? Probably yes?

I'd say yes.

I argue that pollNext cannot be their parent, because it has already ended (but convention tells us to do exactly this).

Child spans can be created after their parent has already ended - this is supported by our data model and API. A message consumer span can also be the child of a message producer span, for example, despite the producer span likely being ended already at the time the message is consumed.

kenfinnigan · 2020-10-12T16:01:58Z

Apologies if it's not, but this gets to concerns I have had around messaging systems like Kafka and how a span should represent the work.

From my understanding of spans in such a scenario, the link between a PRODUCER and CONSUMER is more a "Follows From" and not "Parent-Child" relationship. Whatever action in a PRODUCER causes a message to be sent can be more of an indirect consequence and not a direct one. For me, "Parent-Child" should only apply when there is a direct relationship.

For instance, if a message is produced to indicate 400 customers have gone overdrawn on their bank account in a day, it's not a direct cause of the 400th customer that went overdrawn, it's an indirect one.

Oberon00 · 2020-10-12T16:24:41Z

The follows-from / child-of distinction is not really supported by OpenTelemetry currently. See #65 and #826.

iNikem · 2020-10-12T17:10:00Z

Child spans can be created after their parent has already ended - this is supported by our data model and API. A message consumer span can also be the child of a message producer span, for example, despite the producer span likely being ended already at the time the message is consumed.

True. Should semantic convention guide us which of two options to choose: pollNext as parent or producer as parent? It may be very important for auto-instrumentations.

kenfinnigan · 2020-10-12T17:19:27Z

That's unfortunate @Oberon00, as I believe it helps resolve the ambiguity about what is and isn't a parent span

justinfoote · 2020-10-16T00:06:45Z

I agree with @arminru; I don't see a contradiction here, though I think there's not enough clarity about the scenarios and use cases being covered.

Here are my thoughts:

Spans corresponding to "receive" operation SHOULD always use implicitly selected currently active span as a parent.

I'm with you so far. Especially if we're expecting that a "receive" operation pulls
mulitiple messages from the queue, but does not include processing time.

Spans corresponding to "process" operation of a single message SHOULD use the producer span as a parent.

Probably... But in your example, I think the currently active span might make more sense than the producer. There are some use cases that woule benefit from producer-parenting, and other use cases that would be more clear with active-span-parenting.

In case when "process" span corresponds to the processing of several messages at once (batch processing), producer spans SHOULD NOT be used as parents and implicitly selected parent MAY be used.

This may be the right way to distinguish the use cases I mentioned just above, but I'd feel more comfortable if we listed those use cases out explicitly.
And parenting a multi-message process operation to a single producer is problematic.
But perhaps we also should include semantic conventions for a link from this multi-message process operation to all of the producers for those messages.

If possible "process" spans SHOULD link to the corresponding "receive" span(s).

This sounds good in the case that the "process" span does not already parent to the receive span.

The remaining questions (That I see. Have I missed some?):

Should semantic convention guide us which of two options to choose: pollNext as parent or producer as parent?
- I say yes, with a caveat that there won't be a single answer here that covers all use cases. Sometimes the producer should be the parent, sometimes a locally executing span.
It's not explicitly asked, but should the semantic conventions suggest a way of indicating a follows-from relationship?
- I say yes to this too. This could be links with some semantic conventions around their attributes.

Finally

I'd like to add some nuance to the proposal, but some of the detail I think is pending a more complete analysis of use cases.

Define "receive" operation to be pull-based receiving of multiple messages from the queue, without processing time.
For "process" spans that parent to producers, specify semantic conventions for a link to the currently active span.
For "process" spans that parent to the currently active span, specify semantic conventions for a link to the producer span.

kenfinnigan · 2021-07-06T19:57:43Z

I've recently been verifying various messaging scenarios with Kafka and Quarkus, and I must say I'm a bit confused about what is the spec'd approach for how this should look.

I have a system consisting of three services: Ticker -> Processor -> Viewer. Between each service, a message is added to a Kafka topic, but it could equally be any message broker.

The framework creates "send" and "receive" spans when producing and consuming a message respectively. The Processor service creates a Span, named "Process ticks" while processing the message.

Currently, it gives a trace that looks like:

Which I think is correct based on the current spec, but visually looks odd as it implies to me that "receive" on the Processor is separate from the subsequent "send", which is not the case. Without the "receive", there would be no "send".

Which made me think it should actually needs to look like this:

Or am I misinterpreting the spec and it should look like the latter trace?

Hopefully, this is relevant to the discussion at hand.

trask · 2021-07-06T20:39:57Z

hey @kenfinnigan, looking at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md#apache-kafka-example and I'm confused also.

I would expect "Span Prod2" in that example to parent "Span Proc1" (and not have any links).

And I would expect the desired picture for your example above to be:

yurishkuro · 2021-07-07T01:17:47Z

sound like a bug in the instrumentation

iNikem added the spec:trace Related to the specification/trace directory label Oct 11, 2020

iNikem mentioned this issue Oct 11, 2020

Messaging convention reviewed open-telemetry/opentelemetry-java-instrumentation#1297

Merged

arminru added the area:semantic-conventions Related to semantic conventions label Oct 12, 2020

Oberon00 added the area:span-relationships Related to span relationships label Oct 12, 2020

andrewhsu added priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Oct 13, 2020

andrewhsu assigned justinfoote Oct 13, 2020

andrewhsu added release:after-ga Not required before GA release, and not going to work on before GA and removed priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Oct 13, 2020

iNikem mentioned this issue Nov 25, 2020

Provision to add current span as links to kafka consumer span open-telemetry/opentelemetry-java-instrumentation#1756

Closed

iNikem mentioned this issue Jan 21, 2021

Kafka Polling - Issue with instrumentation open-telemetry/opentelemetry-java-instrumentation#1890

Closed

trask mentioned this issue Mar 19, 2021

Jaeger Showing 2 Traces With Same MessageId open-telemetry/opentelemetry-java-instrumentation#2598

Closed

kenfinnigan mentioned this issue Jul 7, 2021

Update Kafka messaging example #1799

Closed

mateuszrzeszutek mentioned this issue Aug 26, 2021

Instrument spring-kafka batch message listeners open-telemetry/opentelemetry-java-instrumentation#3922

Merged

tedsuo added the semconv:messaging label Jan 12, 2022

pyohannes moved this to V1 - Stable Semantics in Spec: Messaging Semantics Mar 29, 2022

pyohannes added this to Spec: Messaging Semantics Mar 29, 2022

pyohannes mentioned this issue Mar 10, 2023

Span structure for messaging scenarios open-telemetry/oteps#220

Merged

pyohannes assigned pyohannes and unassigned justinfoote Mar 16, 2023

carlosalberto closed this as completed in open-telemetry/oteps#220 Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about span relationships in messaging semantic convention #1085

Questions about span relationships in messaging semantic convention #1085

iNikem commented Oct 11, 2020

arminru commented Oct 12, 2020

kenfinnigan commented Oct 12, 2020

Oberon00 commented Oct 12, 2020

iNikem commented Oct 12, 2020

kenfinnigan commented Oct 12, 2020

justinfoote commented Oct 16, 2020

kenfinnigan commented Jul 6, 2021

trask commented Jul 6, 2021

yurishkuro commented Jul 7, 2021

Questions about span relationships in messaging semantic convention #1085

Questions about span relationships in messaging semantic convention #1085

Comments

iNikem commented Oct 11, 2020

Problem description

Analysis

Problem with old remote parent

Pull vs push consumption

Proposal

arminru commented Oct 12, 2020

kenfinnigan commented Oct 12, 2020

Oberon00 commented Oct 12, 2020

iNikem commented Oct 12, 2020

kenfinnigan commented Oct 12, 2020

justinfoote commented Oct 16, 2020

Here are my thoughts:

The remaining questions (That I see. Have I missed some?):

Finally

kenfinnigan commented Jul 6, 2021

trask commented Jul 6, 2021

yurishkuro commented Jul 7, 2021