Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about span relationships in messaging semantic convention #1085

Closed
iNikem opened this issue Oct 11, 2020 · 9 comments · Fixed by open-telemetry/oteps#220
Closed
Assignees
Labels
area:semantic-conventions Related to semantic conventions area:span-relationships Related to span relationships release:after-ga Not required before GA release, and not going to work on before GA semconv:messaging spec:trace Related to the specification/trace directory

Comments

@iNikem
Copy link
Contributor

iNikem commented Oct 11, 2020

Problem description

I noticed an inconsistency in how messaging semantic conventions describe parent-child relationships between producers and consumers.

First example says that consumers should have message producer as their parent. But second example says that

the propagated trace and span IDs are not known when the receiving span is started

and thus we don't have parent-child relation between consumer spans and producers. Also there is no indication what parent should "Receive" span have.

And then third example contradicts it again and tell receiving spans to have message producers as their parent. Why second example couldn't have parents and this one can? And how shall backends handle such convention which essentially say "everything can happen"?

Analysis

Problem with old remote parent

First I would like to start a discussion about having message producer as a parent span for message consumers in an async case. Although it certainly does make sense in the majority of cases of streaming or event-based platforms, I wonder if this makes sense in situations, when we can reasonably expect to process old messages. Like in messages reprocessing in event-source systems or lambda-architectures (or kappa). If there is a gap of several days (or weeks) between consumer span and producer span, does it still make sense to have producer span as a parent?

Pull vs push consumption

I see two different ways to consume messages. Push-based consumption uses some kind of listener mechanism and focuses on message processing. Message retrieval is done by framework and in general of no interest to us. Pull-based consumption has an explicit step of "requesting" or "receiving" the message and then processing it.

Despite the problem with old remote parents, I think that in push-based systems the only sensible relationship between message producing and its processing is the parent-child one. Message processing is done in the context of some retrieve/process cycle of the framework which we don't want to see as the parent.

In pull-based system the question "what span should be the parent of whom" is not that clear to me. Somewhere in my application's code there is something similar to the following pseudo-code:

var messages = queue.pollNext()
for m in messages {
  process(m)
}

I want to see a span for pollNext execution, because it may block while waiting for messages. This span, following semantic convention, will be "receive" span. Depending on API, pollNext can always return a single message or it may return a collection as in the example above. In the latter case it cannot have remote span as the parent. In the former it may. Convention's argument "the propagated trace and span IDs are not known when the receiving span is started" may be relevant or not, because we can delay starting the span until message is received and remote parent is known. If we don't use remote span as a parent for any reason, should we use currently active span (at the moment of pollNext invocation) as a parent? Probably yes?

Then we have zero, one or more process method executions. Each one of them should result in a "process" span. What span should be the parent of those? I argue that pollNext cannot be their parent, because it has already ended (but convention tells us to do exactly this). Should they have remote producer span as a parent? Should they have currently active span as a parent? Although former case seems sensible, it will result in a big chunk of unaccounted time in the currently active span. Or is it enough to link from process spans to the current span?

Proposal

Spans corresponding to "receive" operation SHOULD always use implicitly selected currently active span as a parent.

Spans corresponding to "process" operation of a single message SHOULD use the producer span as a parent. In case when "process" span corresponds to the processing of several messages at once (batch processing), producer spans SHOULD NOT be used as parents and implicitly selected parent MAY be used.

If possible "process" spans SHOULD link to the corresponding "receive" span(s).

@arminru
Copy link
Member

arminru commented Oct 12, 2020

@iNikem This is related to issue #958. @anuraaga please have a look here.

And then third example contradicts it again and tell receiving spans to have message producers as their parent. Why second example couldn't have parents and this one can?

I don't think the spec contradicts itself here in the examples. It simply states that in the batch receive scenario laid out it does not have access to the parent span context propagated along with the message in order to make it a parent of that span. In the batch processing but individual receive scenario, you might have access to it and if so, you can use it as the parent. Same for the first example with individual processing without separate receive spans.

And how shall backends handle such convention which essentially say "everything can happen"?

Backends will still be able to determine the kind of relationship between the parents by looking at their span kinds and the messaging operation attribute, if any.

Problem with old remote parent

I don't see an issue with that. If a back end can still correlate it with the initial producing span, it should be fine but might cause issue in the UI when it tries to display the entire trace at once. If the back end does not store traces long enough or does not look back in time long enough, it will not know that parent and treat it accordingly. Would there be any alternative?

If we don't use remote span as a parent for any reason, should we use currently active span (at the moment of pollNext invocation) as a parent? Probably yes?

I'd say yes.

I argue that pollNext cannot be their parent, because it has already ended (but convention tells us to do exactly this).

Child spans can be created after their parent has already ended - this is supported by our data model and API. A message consumer span can also be the child of a message producer span, for example, despite the producer span likely being ended already at the time the message is consumed.

@arminru arminru added the area:semantic-conventions Related to semantic conventions label Oct 12, 2020
@Oberon00 Oberon00 added the area:span-relationships Related to span relationships label Oct 12, 2020
@kenfinnigan
Copy link
Member

Apologies if it's not, but this gets to concerns I have had around messaging systems like Kafka and how a span should represent the work.

From my understanding of spans in such a scenario, the link between a PRODUCER and CONSUMER is more a "Follows From" and not "Parent-Child" relationship. Whatever action in a PRODUCER causes a message to be sent can be more of an indirect consequence and not a direct one. For me, "Parent-Child" should only apply when there is a direct relationship.

For instance, if a message is produced to indicate 400 customers have gone overdrawn on their bank account in a day, it's not a direct cause of the 400th customer that went overdrawn, it's an indirect one.

@Oberon00
Copy link
Member

The follows-from / child-of distinction is not really supported by OpenTelemetry currently. See #65 and #826.

@iNikem
Copy link
Contributor Author

iNikem commented Oct 12, 2020

Child spans can be created after their parent has already ended - this is supported by our data model and API. A message consumer span can also be the child of a message producer span, for example, despite the producer span likely being ended already at the time the message is consumed.

True. Should semantic convention guide us which of two options to choose: pollNext as parent or producer as parent? It may be very important for auto-instrumentations.

@kenfinnigan
Copy link
Member

That's unfortunate @Oberon00, as I believe it helps resolve the ambiguity about what is and isn't a parent span

@andrewhsu andrewhsu added priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Oct 13, 2020
@andrewhsu andrewhsu added release:after-ga Not required before GA release, and not going to work on before GA and removed priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Oct 13, 2020
@justinfoote
Copy link
Member

I agree with @arminru; I don't see a contradiction here, though I think there's not enough clarity about the scenarios and use cases being covered.

Here are my thoughts:

Spans corresponding to "receive" operation SHOULD always use implicitly selected currently active span as a parent.

I'm with you so far. Especially if we're expecting that a "receive" operation pulls
mulitiple messages from the queue, but does not include processing time.

Spans corresponding to "process" operation of a single message SHOULD use the producer span as a parent.

Probably... But in your example, I think the currently active span might make more sense than the producer. There are some use cases that woule benefit from producer-parenting, and other use cases that would be more clear with active-span-parenting.

In case when "process" span corresponds to the processing of several messages at once (batch processing), producer spans SHOULD NOT be used as parents and implicitly selected parent MAY be used.

This may be the right way to distinguish the use cases I mentioned just above, but I'd feel more comfortable if we listed those use cases out explicitly.
And parenting a multi-message process operation to a single producer is problematic.
But perhaps we also should include semantic conventions for a link from this multi-message process operation to all of the producers for those messages.

If possible "process" spans SHOULD link to the corresponding "receive" span(s).

This sounds good in the case that the "process" span does not already parent to the receive span.

The remaining questions (That I see. Have I missed some?):

  • Should semantic convention guide us which of two options to choose: pollNext as parent or producer as parent?

    • I say yes, with a caveat that there won't be a single answer here that covers all use cases. Sometimes the producer should be the parent, sometimes a locally executing span.
  • It's not explicitly asked, but should the semantic conventions suggest a way of indicating a follows-from relationship?

    • I say yes to this too. This could be links with some semantic conventions around their attributes.

Finally

I'd like to add some nuance to the proposal, but some of the detail I think is pending a more complete analysis of use cases.

  1. Define "receive" operation to be pull-based receiving of multiple messages from the queue, without processing time.

  2. For "process" spans that parent to producers, specify semantic conventions for a link to the currently active span.

  3. For "process" spans that parent to the currently active span, specify semantic conventions for a link to the producer span.

@kenfinnigan
Copy link
Member

I've recently been verifying various messaging scenarios with Kafka and Quarkus, and I must say I'm a bit confused about what is the spec'd approach for how this should look.

I have a system consisting of three services: Ticker -> Processor -> Viewer. Between each service, a message is added to a Kafka topic, but it could equally be any message broker.

The framework creates "send" and "receive" spans when producing and consuming a message respectively. The Processor service creates a Span, named "Process ticks" while processing the message.

Currently, it gives a trace that looks like:
Screen Shot 2021-07-06 at 2 45 17 PM

Which I think is correct based on the current spec, but visually looks odd as it implies to me that "receive" on the Processor is separate from the subsequent "send", which is not the case. Without the "receive", there would be no "send".

Which made me think it should actually needs to look like this:
Screen Shot 2021-07-06 at 3 10 40 PM

Or am I misinterpreting the spec and it should look like the latter trace?

Hopefully, this is relevant to the discussion at hand.

@trask
Copy link
Member

trask commented Jul 6, 2021

hey @kenfinnigan, looking at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md#apache-kafka-example and I'm confused also.

I would expect "Span Prod2" in that example to parent "Span Proc1" (and not have any links).

And I would expect the desired picture for your example above to be:

image

@yurishkuro
Copy link
Member

sound like a bug in the instrumentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions area:span-relationships Related to span relationships release:after-ga Not required before GA release, and not going to work on before GA semconv:messaging spec:trace Related to the specification/trace directory
Projects
Status: V1 - Stable Semantics
Development

Successfully merging a pull request may close this issue.

10 participants