Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traceparent header overriden by an intermediate service #609

Open
n-e opened this issue Mar 8, 2022 · 9 comments
Open

Traceparent header overriden by an intermediate service #609

n-e opened this issue Mar 8, 2022 · 9 comments

Comments

@n-e
Copy link

n-e commented Mar 8, 2022

In the W3C TraceContext specification it's possible that a system is monitored by multiple tracing tools.

Example: A -> B -> C (Service A calls Service B which calls Service C). A and C are monitored by Elastic APM, B is monitored by another service.

Elastic APM is unable to link the transactions in A and C, since C's parent is a transaction in B (since the third-party service in B writes a traceparent header), which Elastic APM doesn't know about since B is monitored by a third-party service.

If we control B, the problem is easy to fix:

  • either monitor B with Elastic APM
  • or don't monitor it and forward the traceparent header

However, Google Cloud Run does its own monitoring, which doesn't look like it can be disabled (in this case A is a downstream service, B is the Cloud Run runtime, and C is the service hosted in Cloud Run).

As a workaround, I have forked the elastic-apm-node agent and made it use the elastic-apm-traceparent header by default instead of the traceparent header. However I have no idea what a good long-term solution would be. Any ideas?

Related: #286

@n-e n-e added the discussion label Mar 8, 2022
@felixbarny
Copy link
Member

We're currently planning the implementation for what's been discussed in #286: #600

But I'm not sure that would cover the scenario you're describing. Are you saying that even when sending a request with a traceparent header to the Google Cloud Run runtime (B), it will create a new trace id when it calls service C?

@basepi
Copy link
Contributor

basepi commented Mar 8, 2022

This seems like a bug in Google Cloud Run. It should definitely not be overwriting an incoming traceparent. GCR should be name-spacing in this case, not using the W3C traceparent.

@n-e
Copy link
Author

n-e commented Mar 8, 2022

Here’s an example below with a (publicly accessible for now) cloud run service that echoes back the http request it has received.

The traceparent header received by the service has its parentId (and flags) changed.

curl https://test-nico-puofkkaada-ew.a.run.app -H 'Traceparent: 00-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa-bbbbbbbbbbbbbbbb-00'

Request served by localhost

HTTP/1.1 GET /

Host: test-nico-puofkkaada-ew.a.run.app
Forwarded: for="REDACTED";proto=https
User-Agent: curl/7.64.1
Accept: */*
Traceparent: 00-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa-a94c32b3846c1e53-01
X-Cloud-Trace-Context: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/12199181237242043987;o=1
X-Forwarded-For: REDACTED
X-Forwarded-Proto: https

@felixbarny
Copy link
Member

At least the trace id is preserved, which is something.

What would help is if our UI would support showing traces that are incomplete. Meaning that if there are services A -> B -> C, and if only the spans from services A and C are in Elasticsearch, it should show these, even if C's parent (B) is not stored in Elasticsearch.

We have discussed these situations in the past already:

However, I wonder if this changes things and whether we should look at ways to make this work.

@sqren do you have thoughts on that?


However, the fact that the sampling flag is overridden by Google Cloud Run is problematic, too.

Other agents have config options that make the agent use a elastic-apm-traceparent header, instead of the standard traceparent header. So (re-)introducing that for the Node.js agent could also be an option.

@elastic/apm-agent-node-js WDYT?


But I agree with @basepi that this seems like a weird behavior from Google Cloud Run at best.

@trentm
Copy link
Member

trentm commented Mar 9, 2022

What would help is if our UI would support showing traces that are incomplete.

I think we definitely should look at APM UI being able to show traces that (a) have missing spans and/or (b) are missing the root transaction. I would expect this to be a relatively common situation: A customer is migrating a system to Elastic Observability but can't monitor all services. Or a customer is using some load-balancer or proxy that supports W3C trace-context (so it changes the parentId), but Elastic doesn't support getting its trace data. Or a customer has one of their services written in a language for which Elastic doesn't have an APM agent, etc.

Other agents have config options that make the agent use a elastic-apm-traceparent header, instead of the standard traceparent header. So (re-)introducing that for the Node.js agent could also be an option.

@felixbarny Isn't that option about adding elastic-apm-traceparent in addition to traceparent for outgoing requests? My understanding is that even with use_elastic_traceparent_header=true our APM agent will still prefer an incoming traceparent header if it exists. That's what the Node.js APM agent currently does.

The Node.js APM agent currently does have useElasticTraceparentHeader, and it defaults to true. It just isn't currently documented (not sure why).

this seems like a weird behavior from Google Cloud Run at best.

Is it weird that Cloud Run changes the parent-id? Perhaps, from its point of view, it is adding a span to the trace (the span being its Cloud Run dispatcher/load-balancer/whatever-they-call-it service.

@n-e Are you able to look at any of these traces in Google Cloud's trace viewer? https://cloud.google.com/trace/docs/viewing-details I'm curious if they show an added span for the Cloud Run service that routes the incoming HTTP request.

I agree that it is problematic that Cloud Run ignores the incoming "sampled" flag. (They document that here: https://cloud.google.com/run/docs/trace#trace_sampling_rate) However, they are still compliant with the spec, which says "The following are a set of suggestions that vendors SHOULD use to increase vendor interoperability. ..." (https://www.w3.org/TR/trace-context/#sampled-flag).

@sorenlouv
Copy link
Member

sorenlouv commented Mar 10, 2022

@sqren do you have thoughts on that?

I agree we could do a better job of showing orphaned services. Afaict we won't be able to connect the orphaned service to the tree. WDYT about placing it at the very end of the trace labelling it "orphan" or similar?

Correct trace

Suggested view if an intermediary service is missing

@n-e
Copy link
Author

n-e commented Mar 11, 2022

@n-e Are you able to look at any of these traces in Google Cloud's trace viewer?

Query:

curl https://test-nico-puofkkaada-ew.a.run.app -H 'Traceparent: 00-ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa-bbbbbbbbbbbbbbbb-00'
Request served by localhost

HTTP/1.1 GET /

Host: test-nico-puofkkaada-ew.a.run.app
Traceparent: 00-ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa-a7b54b18ba85b13c-01
X-Cloud-Trace-Context: ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa/12084647744699216188;o=1
X-Forwarded-For: REDACTED
X-Forwarded-Proto: https
Forwarded: for="REDACTED";proto=https
User-Agent: curl/7.64.1
Accept: */*

Cloud Trace (there are the matching traceId and parentTraceId on the right):
image

@knoring1
Copy link

knoring1 commented Jun 9, 2023

In the W3C Trace Context they use the example of adding the latest vendor specific parent id to the tracestate to allow the vendor to pick up the latest parent id recorded even if the parent id in the traceparent header has been modified by an intermediate system. Wouldn't implementing support for that in the apm agents solve the issue mentioned without the use of the elastic-apm-traceparent header?

@ty-elastic
Copy link

I ran into this as well. would it make any sense to have the receiving APM client prefer elastic-apm-traceparent to traceparent if it is present? With Cloud Run and Elastic APM instrumenting a node service, it is difficult to strip the modified traceparent header pre-Elastic APM agent, and if its there, elastic-apm-traceparent is ignored...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants