Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect span relationship in concurrent scenarios #4639

Open
zmapleshine opened this issue Nov 15, 2021 · 10 comments
Open

Incorrect span relationship in concurrent scenarios #4639

zmapleshine opened this issue Nov 15, 2021 · 10 comments
Labels
bug Something isn't working needs author feedback Waiting for additional feedback from the author repro provided

Comments

@zmapleshine
Copy link

Describe the bug

I used the "Spring Cloud Gateway" component to invoke a downstream service, but I found that span was incorrectly associated. In a concurrent scenario, there are several cases of relationships:

case1:

GATEWAY HTTP GET (io.opentelemetry.netty-4.1:server)
—— GATEWAY FilteringWebHandler.handle (io.opentelemetry.spring-webflux-5.0)
—— GATEWAY HTTP GET (io.opentelemetry.netty-4.1:client)
————SERVICE1 /test-service1 (io.opentelemetry.undertow-1.4:server)
...

I think the 3rd span should be the 2nd span's child, or am I wrong?

case 2: (incorrectly)

—— GATEWAY FilteringWebHandler.handle (io.opentelemetry.spring-webflux-5.0)
—— SERVICE1 /test-service1 (io.opentelemetry.undertow-1.4:server)
...

Look... the spans about netty is missing and their relationship is incorrect.

Steps to reproduce

Use spring cloud gateway services to indirectly and quickly request service1 services .

What did you expect to see?
Similar to situation 1, and Netty client span's parent should be spring-webflux ?

What did you see instead?
There are two situations as above .

What version are you using?
1.7.2

Environment
Compiler: jdk 11.0.10
OS: macOS Monterey 12.0.1

@zmapleshine zmapleshine added the bug Something isn't working label Nov 15, 2021
@zmapleshine
Copy link
Author

zmapleshine commented Nov 15, 2021

A simple project can reproduce this problem:
https://github.com/zmapleshine/otel-instrumentation-reporters-GatewayWebflux

This problem occurs in all historical versions.

@tydhot
Copy link
Member

tydhot commented Nov 30, 2021

I met same problem like this

@tydhot
Copy link
Member

tydhot commented Nov 30, 2021

@mateuszrzeszutek I can provide some information about the loss of span here. My timeout request through the spring cloud gateway will stably lose the server and client span here like case 2.

@mateuszrzeszutek
Copy link
Member

Hey @tydhot ,
Sorry, we haven't had enough time to take a stab at this issue. If you have some more info (or another repro scenario) we'd be very grateful for that. Thanks!

@tydhot
Copy link
Member

tydhot commented Nov 30, 2021

@mateuszrzeszutek I think I've located the cause of the problem.
When the spring cloud gateway wants to cancel some requests due to timeout and other reasons, the filter processing the response does not implement 'doOnCancel', which leads to the loss of the requested client span of netty. This method is implemented in weblux, so only the span of weblux is left. I think the solution of this problem is a little complicated, but I'd like to try it. Can you give me some suggestions?

@mateuszrzeszutek
Copy link
Member

I see - you can probably take a look at DispatcherHandlerInstrumentation first; for some reason we're calling .doOnSuccess(...).doOnError(...).doOnCancel(...) there instead of HandlerAdapterInstrumentation. Maybe the DispatcherHandler call happens a bit too late and if the returned mono was decorated in HandlerAdapter then it'd work?

CC @trask

@tydhot
Copy link
Member

tydhot commented Nov 30, 2021

@mateuszrzeszutek thanks, i'd have a try these days.

@PhilHardwick
Copy link

PhilHardwick commented Jul 27, 2022

Anyone who comes here later on, we saw this in our Spring Cloud Gateway service and it was also causing incoming requests to be associated with the unclosed trace (so we had massive traces that never ended/couldn't be easily sampled). We ended up turning off spring-webflux instrumentation for the spring cloud gateway service and we didn't really lose anything since, as a gateway, the webflux instrumentation wasn't providing much value. Hope this helps!

FWIW I tried to replicate this in a test and apply the fix you suggested, but I couldn't replicate it.

@123liuziming
Copy link
Contributor

I ran into the same problem as case1. Issue is #9495

@breedx-splk
Copy link
Contributor

Does #9572 also address this issue?

@trask trask added the needs author feedback Waiting for additional feedback from the author label Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs author feedback Waiting for additional feedback from the author repro provided
Projects
None yet
Development

No branches or pull requests

7 participants