-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traceparent header overriden by an intermediate service #609
Comments
We're currently planning the implementation for what's been discussed in #286: #600 But I'm not sure that would cover the scenario you're describing. Are you saying that even when sending a request with a |
This seems like a bug in Google Cloud Run. It should definitely not be overwriting an incoming traceparent. GCR should be name-spacing in this case, not using the W3C traceparent. |
Here’s an example below with a (publicly accessible for now) cloud run service that echoes back the http request it has received. The traceparent header received by the service has its parentId (and flags) changed.
|
At least the trace id is preserved, which is something. What would help is if our UI would support showing traces that are incomplete. Meaning that if there are services A -> B -> C, and if only the spans from services A and C are in Elasticsearch, it should show these, even if C's parent (B) is not stored in Elasticsearch. We have discussed these situations in the past already:
However, I wonder if this changes things and whether we should look at ways to make this work. @sqren do you have thoughts on that? However, the fact that the sampling flag is overridden by Google Cloud Run is problematic, too. Other agents have config options that make the agent use a @elastic/apm-agent-node-js WDYT? But I agree with @basepi that this seems like a weird behavior from Google Cloud Run at best. |
I think we definitely should look at APM UI being able to show traces that (a) have missing spans and/or (b) are missing the root transaction. I would expect this to be a relatively common situation: A customer is migrating a system to Elastic Observability but can't monitor all services. Or a customer is using some load-balancer or proxy that supports W3C trace-context (so it changes the parentId), but Elastic doesn't support getting its trace data. Or a customer has one of their services written in a language for which Elastic doesn't have an APM agent, etc.
@felixbarny Isn't that option about adding The Node.js APM agent currently does have
Is it weird that Cloud Run changes the parent-id? Perhaps, from its point of view, it is adding a span to the trace (the span being its Cloud Run dispatcher/load-balancer/whatever-they-call-it service. @n-e Are you able to look at any of these traces in Google Cloud's trace viewer? https://cloud.google.com/trace/docs/viewing-details I'm curious if they show an added span for the Cloud Run service that routes the incoming HTTP request. I agree that it is problematic that Cloud Run ignores the incoming "sampled" flag. (They document that here: https://cloud.google.com/run/docs/trace#trace_sampling_rate) However, they are still compliant with the spec, which says "The following are a set of suggestions that vendors SHOULD use to increase vendor interoperability. ..." (https://www.w3.org/TR/trace-context/#sampled-flag). |
I agree we could do a better job of showing orphaned services. Afaict we won't be able to connect the orphaned service to the tree. WDYT about placing it at the very end of the trace labelling it "orphan" or similar? Correct traceSuggested view if an intermediary service is missing |
Query:
Cloud Trace (there are the matching traceId and parentTraceId on the right): |
In the W3C Trace Context they use the example of adding the latest vendor specific parent id to the |
I ran into this as well. would it make any sense to have the receiving APM client prefer |
In the W3C TraceContext specification it's possible that a system is monitored by multiple tracing tools.
Example: A -> B -> C (Service A calls Service B which calls Service C). A and C are monitored by Elastic APM, B is monitored by another service.
Elastic APM is unable to link the transactions in A and C, since C's parent is a transaction in B (since the third-party service in B writes a
traceparent
header), which Elastic APM doesn't know about since B is monitored by a third-party service.If we control B, the problem is easy to fix:
traceparent
headerHowever, Google Cloud Run does its own monitoring, which doesn't look like it can be disabled (in this case A is a downstream service, B is the Cloud Run runtime, and C is the service hosted in Cloud Run).
As a workaround, I have forked the elastic-apm-node agent and made it use the
elastic-apm-traceparent
header by default instead of thetraceparent
header. However I have no idea what a good long-term solution would be. Any ideas?Related: #286
The text was updated successfully, but these errors were encountered: