Traceparent header overriden by an intermediate service #609

n-e · 2022-03-08T17:18:18Z

In the W3C TraceContext specification it's possible that a system is monitored by multiple tracing tools.

Example: A -> B -> C (Service A calls Service B which calls Service C). A and C are monitored by Elastic APM, B is monitored by another service.

Elastic APM is unable to link the transactions in A and C, since C's parent is a transaction in B (since the third-party service in B writes a traceparent header), which Elastic APM doesn't know about since B is monitored by a third-party service.

If we control B, the problem is easy to fix:

either monitor B with Elastic APM
or don't monitor it and forward the traceparent header

However, Google Cloud Run does its own monitoring, which doesn't look like it can be disabled (in this case A is a downstream service, B is the Cloud Run runtime, and C is the service hosted in Cloud Run).

As a workaround, I have forked the elastic-apm-node agent and made it use the elastic-apm-traceparent header by default instead of the traceparent header. However I have no idea what a good long-term solution would be. Any ideas?

Related: #286

The text was updated successfully, but these errors were encountered:

felixbarny · 2022-03-08T17:32:36Z

We're currently planning the implementation for what's been discussed in #286: #600

But I'm not sure that would cover the scenario you're describing. Are you saying that even when sending a request with a traceparent header to the Google Cloud Run runtime (B), it will create a new trace id when it calls service C?

basepi · 2022-03-08T18:08:57Z

This seems like a bug in Google Cloud Run. It should definitely not be overwriting an incoming traceparent. GCR should be name-spacing in this case, not using the W3C traceparent.

n-e · 2022-03-08T18:36:17Z

Here’s an example below with a (publicly accessible for now) cloud run service that echoes back the http request it has received.

The traceparent header received by the service has its parentId (and flags) changed.

curl https://test-nico-puofkkaada-ew.a.run.app -H 'Traceparent: 00-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa-bbbbbbbbbbbbbbbb-00'

Request served by localhost

HTTP/1.1 GET /

Host: test-nico-puofkkaada-ew.a.run.app
Forwarded: for="REDACTED";proto=https
User-Agent: curl/7.64.1
Accept: */*
Traceparent: 00-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa-a94c32b3846c1e53-01
X-Cloud-Trace-Context: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/12199181237242043987;o=1
X-Forwarded-For: REDACTED
X-Forwarded-Proto: https

felixbarny · 2022-03-09T08:24:22Z

At least the trace id is preserved, which is something.

What would help is if our UI would support showing traces that are incomplete. Meaning that if there are services A -> B -> C, and if only the spans from services A and C are in Elasticsearch, it should show these, even if C's parent (B) is not stored in Elasticsearch.

We have discussed these situations in the past already:

Events refering to a non existing parent event are not shown in waterfall. kibana#24827
[APM] Trace timeline: Show message when transactions are missing kibana#25117

However, I wonder if this changes things and whether we should look at ways to make this work.

@sqren do you have thoughts on that?

However, the fact that the sampling flag is overridden by Google Cloud Run is problematic, too.

Other agents have config options that make the agent use a elastic-apm-traceparent header, instead of the standard traceparent header. So (re-)introducing that for the Node.js agent could also be an option.

@elastic/apm-agent-node-js WDYT?

But I agree with @basepi that this seems like a weird behavior from Google Cloud Run at best.

trentm · 2022-03-09T17:50:14Z

What would help is if our UI would support showing traces that are incomplete.

I think we definitely should look at APM UI being able to show traces that (a) have missing spans and/or (b) are missing the root transaction. I would expect this to be a relatively common situation: A customer is migrating a system to Elastic Observability but can't monitor all services. Or a customer is using some load-balancer or proxy that supports W3C trace-context (so it changes the parentId), but Elastic doesn't support getting its trace data. Or a customer has one of their services written in a language for which Elastic doesn't have an APM agent, etc.

Other agents have config options that make the agent use a elastic-apm-traceparent header, instead of the standard traceparent header. So (re-)introducing that for the Node.js agent could also be an option.

@felixbarny Isn't that option about adding elastic-apm-traceparent in addition to traceparent for outgoing requests? My understanding is that even with use_elastic_traceparent_header=true our APM agent will still prefer an incoming traceparent header if it exists. That's what the Node.js APM agent currently does.

The Node.js APM agent currently does have useElasticTraceparentHeader, and it defaults to true. It just isn't currently documented (not sure why).

this seems like a weird behavior from Google Cloud Run at best.

Is it weird that Cloud Run changes the parent-id? Perhaps, from its point of view, it is adding a span to the trace (the span being its Cloud Run dispatcher/load-balancer/whatever-they-call-it service.

@n-e Are you able to look at any of these traces in Google Cloud's trace viewer? https://cloud.google.com/trace/docs/viewing-details I'm curious if they show an added span for the Cloud Run service that routes the incoming HTTP request.

I agree that it is problematic that Cloud Run ignores the incoming "sampled" flag. (They document that here: https://cloud.google.com/run/docs/trace#trace_sampling_rate) However, they are still compliant with the spec, which says "The following are a set of suggestions that vendors SHOULD use to increase vendor interoperability. ..." (https://www.w3.org/TR/trace-context/#sampled-flag).

sorenlouv · 2022-03-10T22:45:02Z

@sqren do you have thoughts on that?

I agree we could do a better job of showing orphaned services. Afaict we won't be able to connect the orphaned service to the tree. WDYT about placing it at the very end of the trace labelling it "orphan" or similar?

Correct trace

Suggested view if an intermediary service is missing

n-e · 2022-03-11T13:31:54Z

@n-e Are you able to look at any of these traces in Google Cloud's trace viewer?

Query:

curl https://test-nico-puofkkaada-ew.a.run.app -H 'Traceparent: 00-ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa-bbbbbbbbbbbbbbbb-00'
Request served by localhost

HTTP/1.1 GET /

Host: test-nico-puofkkaada-ew.a.run.app
Traceparent: 00-ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa-a7b54b18ba85b13c-01
X-Cloud-Trace-Context: ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa/12084647744699216188;o=1
X-Forwarded-For: REDACTED
X-Forwarded-Proto: https
Forwarded: for="REDACTED";proto=https
User-Agent: curl/7.64.1
Accept: */*

Cloud Trace (there are the matching traceId and parentTraceId on the right):

knoring1 · 2023-06-09T08:56:06Z

In the W3C Trace Context they use the example of adding the latest vendor specific parent id to the tracestate to allow the vendor to pick up the latest parent id recorded even if the parent id in the traceparent header has been modified by an intermediate system. Wouldn't implementing support for that in the apm agents solve the issue mentioned without the use of the elastic-apm-traceparent header?

ty-elastic · 2023-12-13T13:57:15Z

I ran into this as well. would it make any sense to have the receiving APM client prefer elastic-apm-traceparent to traceparent if it is present? With Cloud Run and Elastic APM instrumenting a node service, it is difficult to strip the modified traceparent header pre-Elastic APM agent, and if its there, elastic-apm-traceparent is ignored...

n-e added the discussion label Mar 8, 2022

axw mentioned this issue Nov 29, 2022

OpenTelemetry active message consumption spans are incorrectly classified as transactions elastic/apm-server#7001

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traceparent header overriden by an intermediate service #609

Traceparent header overriden by an intermediate service #609

n-e commented Mar 8, 2022

felixbarny commented Mar 8, 2022

basepi commented Mar 8, 2022

n-e commented Mar 8, 2022 •

edited

Loading

felixbarny commented Mar 9, 2022

trentm commented Mar 9, 2022

sorenlouv commented Mar 10, 2022 •

edited

Loading

n-e commented Mar 11, 2022

knoring1 commented Jun 9, 2023

ty-elastic commented Dec 13, 2023

Traceparent header overriden by an intermediate service #609

Traceparent header overriden by an intermediate service #609

Comments

n-e commented Mar 8, 2022

felixbarny commented Mar 8, 2022

basepi commented Mar 8, 2022

n-e commented Mar 8, 2022 • edited Loading

felixbarny commented Mar 9, 2022

trentm commented Mar 9, 2022

sorenlouv commented Mar 10, 2022 • edited Loading

Correct trace

Suggested view if an intermediary service is missing

n-e commented Mar 11, 2022

knoring1 commented Jun 9, 2023

ty-elastic commented Dec 13, 2023

n-e commented Mar 8, 2022 •

edited

Loading

sorenlouv commented Mar 10, 2022 •

edited

Loading