Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry active message consumption spans are incorrectly classified as transactions #7001

Open
axw opened this issue Jan 6, 2022 · 7 comments

Comments

@axw
Copy link
Member

axw commented Jan 6, 2022

APM Server converts OpenTelemetry spans to either transactions or spans, depending on several criteria:

  • root OTel spans become Elastic APM transactions
  • server and consumer OTel spans become Elastic APM transactions
  • all other OTel spans become Elastic APM spans

The issue with converting all consumer OTel spans to Elastic APM transactions is that not all consumer spans represent service entrypoints. For example, a message may be actively consumed (e.g. with active polling) within another OTel span. In this case, the outer span should become an Elastic APM transaction, and the message consumption span should become an Elastic APM span.

We require additional information to be recorded in OTLP for this: see open-telemetry/oteps#182

@ymotongpoo
Copy link

ymotongpoo commented Nov 28, 2022

Is this still the case? I tried this with a small snippet on v8.5.1 and all traces are recorded as transactions, not traces.
Untitled

image

@axw
Copy link
Member Author

axw commented Nov 29, 2022

Hi @ymotongpoo! Yes, this is still the case.

I tried this with a small snippet on v8.5.1 and all traces are recorded as transactions, not traces.

Can you share the snippet? I can't tell from the screenshot which spans should or should not be considered transactions.

The code relevant to this bug is:

// Message consumption results in either a transaction or a span based
// on whether the consumption is active or passive. Otel spans
// currently do not have the metadata to make this distinction. For
// now, we assume that the majority of consumption is passive, and
// therefore start a transaction whenever span kind == consumer.
name := otelSpan.Name()
spanID := otelSpan.SpanID().HexString()
event := baseEvent
initEventLabels(&event)
event.Timestamp = startTime.Add(timeDelta)
event.Trace.ID = otelSpan.TraceID().HexString()
event.Event.Duration = duration
event.Event.Outcome = spanStatusOutcome(otelSpan.Status())
event.Parent.ID = parentID
if root || otelSpan.Kind() == ptrace.SpanKindServer || otelSpan.Kind() == ptrace.SpanKindConsumer {
event.Processor = model.TransactionProcessor
event.Transaction = &model.Transaction{
ID: spanID,
Name: name,
Sampled: true,
}
TranslateTransaction(otelSpan.Attributes(), otelSpan.Status(), otelLibrary, &event)
} else {
event.Processor = model.SpanProcessor
event.Span = &model.Span{
ID: spanID,
Name: name,
}
TranslateSpan(otelSpan.Kind(), otelSpan.Attributes(), &event)
}

If/when the linked OTEP is accepted and implemented, then instead of checking if root || otelSpan.Kind() == ptrace.SpanKindServer || otelSpan.Kind() == ptrace.SpanKindConsumer { we could check if root || otelSpan.HasRemoteParent() {. Only in that case would we translate to an Elastic APM transaction.

@ymotongpoo
Copy link

ymotongpoo commented Nov 29, 2022

Hi @axw, thanks for the response! Here's the code snippet I used for the demo.

func handler(w http.ResponseWriter, r *http.Request) {
	q := r.URL.Query()
	c := q.Get("c")
	n, err := strconv.Atoi(c)
	if err != nil {
		n = 2
	}
	if n > 8 {
		n = 8
	}
	ctx, span := otel.Tracer("handler").Start(r.Context(), "sample.handler.call.foo")
	foo(ctx, n)
	span.End()
	w.WriteHeader(http.StatusOK)
	fmt.Fprintf(w, "requested: %v, executed: %v", c, n)
}

func foo(ctx context.Context, n int) {
	if n == 0 {
		return
	}
	ctx, span := otel.Tracer("handler").Start(ctx, "sample.foo")
	span.SetAttributes(attribute.Key("iteration").Int(n))
	time.Sleep(interval)
	span.AddEvent(fmt.Sprintf("mid-interval-%v", n))
	time.Sleep(interval)
	foo(ctx, n-1)
	span.End()
}

I didn't used OpenTelemetry Collector and directly sent the traces with OTLP exporter from the app.

	// OTLP exporter config for Elastic Observability
	otlpexp, err := otlptracegrpc.New(
		context.Background(),
		otlptracegrpc.WithEndpoint(OTLPEndpoint),
		otlptracegrpc.WithHeaders(OTLPHeaders),
	)
	if err != nil {
		return nil, err
	}

Because this is a standalone demo running on Cloud Run and doesn't have other depending services, so the traces should be either of:

  • root
  • SpanKindServer
  • SpanKindClient

@axw
Copy link
Member Author

axw commented Nov 29, 2022

I see, thanks. This is not exactly the same issue as in the description, but would also be solved by the change I described.

For this case you have a workaround: set the sample.handler.call.foo span kind to SpanKindServer. IIANM, it will otherwise be SpanKindInternal.

@ymotongpoo
Copy link

Thank you for the workaround, @axw. However, neither of SpanKind worked and still all traces are recognized as transactions.

Just in case, I changed the SpanKind of sample.foo as well, but it didn't work. Here's the current code that is instrumented with the workaround.

func handler(w http.ResponseWriter, r *http.Request) {
	q := r.URL.Query()
	c := q.Get("c")
	n, err := strconv.Atoi(c)
	if err != nil {
		n = 2
	}
	if n > 8 {
		n = 8
	}
	ctx, span := otel.Tracer("handler").Start(
		r.Context(),
		"sample.handler.call.foo",
		trace.WithSpanKind(trace.SpanKindInternal),
	)
	foo(ctx, n)
	span.End()
	w.WriteHeader(http.StatusOK)
	fmt.Fprintf(w, "requested: %v, executed: %v", c, n)
}

func foo(ctx context.Context, n int) {
	if n == 0 {
		return
	}
	ctx, span := otel.Tracer("handler").Start(
		ctx,
		"sample.foo",
		trace.WithSpanKind(trace.SpanKindInternal),
	)
	span.SetAttributes(attribute.Key("iteration").Int(n))
	time.Sleep(interval)
	span.AddEvent(fmt.Sprintf("mid-interval-%v", n))
	time.Sleep(interval)
	foo(ctx, n-1)
	span.End()
}

@axw
Copy link
Member Author

axw commented Nov 29, 2022

@ymotongpoo sorry, I missed something in your original comment and went off track a bit. I think what you're seeing here is related to Cloud Run injecting a traceparent header. First of all I'll try to clarify some things, and correct some comments I made above.

Transactions are a special kind of span that have additional attributes associated with them. They describe an event captured by an Elastic APM agent instrumenting a service. You can think of transactions as the highest level of work you’re measuring within a service. As an example, a transaction might be a:

  • Request to your server
  • ...

I guess you already know this, but just so we're on the same page:

https://opentelemetry.io/docs/reference/specification/trace/api/#spankind says:

SERVER Indicates that the span covers server-side handling of a synchronous RPC or other remote request. This span is often the child of a remote CLIENT span that was expected to wait for a response.
...
INTERNAL Default value. Indicates that the span represents an internal operation within an application, as opposed to an operations with remote parents or children.

I think I misunderstood your trace before, and thought that sample.handler.call.foo was meant to represent the incoming HTTP request handler. After looking at your screenshots again, I suppose that you are also using otlphttp to wrap your handler, and that is the sample.handler transaction? In that case, sample.handler.call.foo and sample.foo should have the internal span kind, and you don't need to explicitly use WithSpanKind -- sorry about that.

Going back to your initial comment, I overlooked something:

I tried this with a small snippet on v8.5.1 and all traces are recorded as transactions, not traces.

The Traces page is simply listing root transactions, by looking for any transactions without a parent ID. In your example we should expect that sample.handler is listed on the Traces page. Given that it doesn't show up, that implies sample.handler does have a parent ID. You could check that by clicking the sample.handler row and looking for a parent.id field under the metadata.

Assuming that's the case, I think you're hitting something similar to this issue: elastic/apm#609. i.e. I expect Cloud Run is injecting a traceparent header, and your instrumentation is continuing the trace with that trace context. It therefore doesn't think the server transaction is the root, and doesn't show up under Traces. You could confirm this by passing trace.WithNewRoot to the otlphttp instrumentation.

@ymotongpoo
Copy link

@axw Thank you for the detailed answer.

i.e. I expect Cloud Run is injecting a traceparent header, and your instrumentation is continuing the trace with that trace context.

I realized I missed the fact that Cloud Run injects the traceparent automatically.

You could confirm this by passing trace.WithNewRoot to the otlphttp instrumentation.

As you suggested, I tried creating span with trace.WithNewRoot() and now I see sample.handler in the list. (I also tried using trace.WithNewRoot() from sample.handler.call.foo and it was recognized in the trace list as well)

image

The snippet was:

	otelHandler := otelhttp.NewHandler(
		http.HandlerFunc(handler),
		"sample.handler",
		otelhttp.WithSpanOptions(trace.WithNewRoot()),
	)

Given above, as you pointed out, my case seems about elastic/apm#609.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants