Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebClient toBodilessEntity loses tracing context in Spring WebFlux 3.0.5 #30273

Closed
davidmelia opened this issue Apr 3, 2023 · 14 comments
Closed
Assignees
Labels
for: external-project Needs a fix in external project in: web Issues in web modules (web, webmvc, webflux, websocket) status: superseded An issue that has been superseded by another theme: observability An issue related to observability and tracing

Comments

@davidmelia
Copy link

Hi,

I'm diagnosing a production Spring WebFlux 3.0.5 bug where certain requests are losing their tracing context and I've narrowed this down to the following (example https://github.com/davidmelia/spring-cloud-function-zipkin/tree/bodiless_entity_bug_

  @GetMapping("/notWork")
  Mono<String> notWork() {
    log.info("1) this has trace id");
    return webClient.head().uri("https://httpbin.org/status/200").retrieve().toBodilessEntity().map(r -> {
      log.info("2) this does not have a trace id {}",r.getStatusCode());
      return r;
    }).thenReturn("OK");
  }

...

INFO [aid=-,tid=642af7c399283af5b17037a829cb556a,sid=b17037a829cb556a] 56133 --- [ctor-http-nio-3] example.DemoController                   : 1) this has trace id
INFO [aid=-,tid=,sid=] 56133 --- [ctor-http-nio-3] example.DemoController                   : 2) this does not have a trace id 200 OK

where toBodilessEntity loses the tracing context. N.B. I am taking advantage of Hooks.enableAutomaticContextPropagation();

The work around is

  @GetMapping("/works")
  Mono<String> works() {
    log.info("3) this has trace id");
    return webClient.head().uri("https://httpbin.org/status/200").retrieve().toEntity(String.class).map(r -> {
      log.info("4) this has trace id {}",r.getStatusCode());
      return r;
    }).thenReturn("OK");
  }

I assume toBodilessEntity should keep the trace context?

Thanks

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Apr 3, 2023
@bclozel bclozel transferred this issue from spring-projects/spring-boot Apr 3, 2023
@bclozel
Copy link
Member

bclozel commented Apr 3, 2023

Hello @davidmelia , thanks for raising this issue.
The tracing context is made available during the observation, not after. In this case, the toBodilessEntity (and other "final" methods in the fluent API) returns the result of the HTTP client operation, here a Mono<ResponseEntity<Void>>. We assume that when you're given the actual response, the client observation is finished and the metric/trace is recorded. Extending the tracing context while you're map-ing on the result would sound like a bug: the time spent on transforming the response into another object should not count against the HTTP client observation.

The workaround you're sharing is not a workaround in my opinion: this merely enables the context propagation for ThreadLocal - here you should see the tracing context for the HTTP server observation, which is still ongoing at that stage since this is being executed within a web controller.

Am I missing something?

@bclozel bclozel added the status: waiting-for-feedback We need additional information before we can continue label Apr 3, 2023
@davidmelia
Copy link
Author

davidmelia commented Apr 3, 2023

I don't quite follow you but we use trace IDs in production to log trace requests. I expect the server trace ID to be propagated through the stack. Using Spring Cloud Sleuth this was the case and the above example preserved the 'tid'. After upgrading to Spring Webflux 3 I can see trace IDs disappearing and I have narrowed down to the use of toBodilessEntity.

This is either a bug OR WebFlux users who use enableAutomaticContextPropagation should not use toBodilessEntity as it wipes out log trace IDs.

@bclozel
Copy link
Member

bclozel commented Apr 3, 2023

Sorry, I misunderstood the issue - I thought this was about the client trace, not about the server trace. Let me have a look into this.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Apr 3, 2023
@bclozel bclozel self-assigned this Apr 3, 2023
@bclozel
Copy link
Member

bclozel commented Apr 7, 2023

I had a deeper look and I did reproduce the problem.
I've checked the reactive pipeline for the toBodilessEntity method and it doesn't seem that we're disconnecting the chain in any way. The main difference with other operators could be here.

@davidmelia I've modified your sample application to ensure that the current Observation is in the reactor context right after the toBodilessEntity call and it is the case. At this stage, I was suspecting a subtle bug with the automatic context propagation feature. Upgrading to Spring Boot 3.0.6-SNAPSHOT (and reactor-core 3.5.5-SNAPSHOT) did solve the problem. I'm now seeing:

2023-04-07T11:22:46.381+02:00  INFO [aid=spring-cloud-function-zipkin,tid=642fe0e6e6e02ea934d44e45df0d661e,sid=34d44e45df0d661e] 23300 --- [ctor-http-nio-3] example.DemoController                   : 1) this has trace id
2023-04-07T11:22:47.156+02:00  INFO [aid=spring-cloud-function-zipkin,tid=,sid=d193b749e4691a89] 23300 --- [ctor-http-nio-3] example.DemoController                   : current observation name='http.server.requests', contextualName='null', error='null', lowCardinalityKeyValues=[exception='none', method='GET', outcome='SUCCESS', status='200', uri='UNKNOWN'], highCardinalityKeyValues=[http.url='/notWork'], map=[class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=642fe0e6e6e02ea934d44e45df0d661e/34d44e45df0d661e}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=0.795679716, duration(nanos)=7.95679716E8, startTimeNanos=5854912936959}', class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@646c1e62'], parentObservation=null
2023-04-07T11:22:47.158+02:00  INFO [aid=spring-cloud-function-zipkin,tid=642fe0e6e6e02ea934d44e45df0d661e,sid=34d44e45df0d661e] 23300 --- [ctor-http-nio-3] example.DemoController                   : 2) this does not have a trace id 200 OK

This change is available in my fork.
I'm closing this issue as it's probably been fixed in reactor-core directly.

Tagging @chemicL to let him know 👍

@bclozel bclozel closed this as not planned Won't fix, can't repro, duplicate, stale Apr 7, 2023
@bclozel bclozel added status: duplicate A duplicate of another issue for: external-project Needs a fix in external project theme: observability An issue related to observability and tracing and removed status: waiting-for-triage An issue we've not yet triaged or decided on status: feedback-provided Feedback has been provided labels Apr 7, 2023
@chemicL
Copy link
Member

chemicL commented Apr 7, 2023

@bclozel thanks for letting me know.

Actually, the latest changes in Reactor don't really fix the reported issue. The fact that you introduced tap operators to the chains make a difference, as now they are considered as Context-altering barriers, in which case the ThreadLocal values do get restored. The same as if a contextWrite(Function.identity()) happened before the map operator.

This helps as the server Context with an Observation is attached to the entire chain by the server. However, the response is delivered from Netty on the event loop. If a contextWrite or tap operator is not along the way, there's nothing to restore in the eyes of reactor-core.

The difference between toEntity and toBodilessEntity is that in case of the latter, the response body Mono is not executed, it is simply ignored:

handleBodyMono(response, response.bodyToMono(bodyClass)

vs

handleBodyMono(response, Mono.<Void>empty())

I'm not yet certain why delivering the entity via the body processing passes the contextWrite operator and in the case of an empty body it does not, but this needs investigating. I would leave the issue open until we understand what is happening. For now, obviously, the workaround is:

toBodilessEntity().contextWrite(Function.identity())

@bclozel
Copy link
Member

bclozel commented Apr 7, 2023

Reopening so we can investigate. This could be a Framework, Reactor Core or Reactor Netty issue.

@bclozel bclozel reopened this Apr 7, 2023
@bclozel bclozel added status: waiting-for-triage An issue we've not yet triaged or decided on and removed status: duplicate A duplicate of another issue for: external-project Needs a fix in external project labels Apr 7, 2023
@chemicL
Copy link
Member

chemicL commented Apr 7, 2023

The previous suggestion is incorrect, it has nothing to do with using Mono.empty() in the handleBodyToMono argument.

The reason for the different behaviour is the fact that upon termination, in case of toEntity, the Publisher completion signal travels via responseMono's contextWrite operator defined in the exchange() method, and the completion signal triggers calling the user's code:

	@Override
	public <T> Mono<ResponseEntity<T>> toEntity(Class<T> bodyClass) {
		return this.responseMono.flatMap(response ->
				WebClientUtils.mapToEntity(response,
						handleBodyMono(response, response.bodyToMono(bodyClass))));
	}

The case in toBodilessEntity is different, because the completion signal doesn't enter the responseMono any more - there is a new Mono left to subscribe to after the response is finished in this case, and it's the Mono.just(entity) created within the last line of the method in thenReturn(entity):

	@Override
	public Mono<ResponseEntity<Void>> toBodilessEntity() {
		return this.responseMono.flatMap(response ->
				WebClientUtils.mapToEntity(response, handleBodyMono(response, Mono.<Void>empty()))
						.flatMap(entity -> response.releaseBody() // this is the
						                           // difference
								.onErrorResume(WebClientUtils.WRAP_EXCEPTION_PREDICATE, exceptionWrappingFunction(response))
								.thenReturn(entity))
		);
	}

My current understanding is that the signals delivered from reactor-netty need to always pass a context restoring Subscriber. Whether reactor-netty should add such protection is debatable. I'm also unsure whether reactor-core should do it - we recently introduced protection against foreign Publishers that are wrapped using Flux.from(). In case signals are delivered from a different Thread, a dedicated Subscriber restores ThreadLocal values from the Subscriber Context.

The signal to the WebClient's Mono is delivered from reactor.netty.channel.FluxReceive#onInboundComplete. There should be a Subscriber that restores ThreadLocals directly after any signal delivered by reactor-netty. Where it should be handled is a matter for discussion.

@rstoyanchev rstoyanchev added the in: web Issues in web modules (web, webmvc, webflux, websocket) label Apr 17, 2023
@chemicL
Copy link
Member

chemicL commented Apr 25, 2023

Here's a minimal reproducer for the issue:

@SpringBootApplication
public class WebFluxApplication {

    public static void main(String[] args) {
        SpringApplication.run(WebFluxApplication.class, args);
        Hooks.enableAutomaticContextPropagation();
    }

}

@RestController
class WebFluxController {

    Logger log = LoggerFactory.getLogger(WebFluxController.class);

    final WebClient webClient;

    WebFluxController(WebClient.Builder webClientBuilder) {
        this.webClient =
                webClientBuilder
//                        .clientConnector(new JettyClientHttpConnector())
//                        .clientConnector(new JdkClientHttpConnector())
                        .baseUrl("https://example.com")
                        .build();
    }

    @GetMapping("/webClient")
    public Mono<String> webClient() {
        log.info("Inside controller");
        return this.webClient
                .get()
                .retrieve()
                .toBodilessEntity()
                .doOnNext(r -> log.info("Missing trace-id"))
                .map(r -> r.getStatusCode().toString());
    }
}

With the application.yaml containing:

logging:
    pattern:
        level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

In case of the commented lines (JdkClientHttpConnector|JettyClientHttpConnector) the problem does not expose itself with reactor-core 3.5.5 thanks to reactor/reactor-core#3418.

For JdkClientHttpConnector, the resulting response is delivered as a CompletionStage wrapped into a Mono:
https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/JdkClientHttpConnector.java#L107

In this case, we treat the CompletionStage as a foreign Thread-changing environment and restore the ThreadLocal values.

For JettyClientHttpConnector, we're dealing with a foreign Publisher, which is also wrapped into a Mono and the ThreadLocal values are restored by calling Mono.fromDirect():
https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/JettyClientHttpConnector.java#L129

The default connector for reactor-netty deals with reactor types, which assume to be aware of ThreadLocal restoration, but they are not in case of FluxReceive: https://github.com/reactor/reactor-netty/blob/1.0.x/reactor-netty-core/src/main/java/reactor/netty/channel/FluxReceive.java#L413

When any signal is delivered as a result of Netty handling, it should be handling the ThreadLocal restoration.

One way to do it would be to implement the restoration in reactor-netty. Another would be to wrap the result with a contextWrite(Function.identity()) in spring-framework codebase, e.g. here: https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/ReactorClientHttpConnector.java#L119.

IMO, reactor-netty should implement this restoration. If someone deals with reactor-netty's HttpClient directly, the same issue would need to be handled. If the restoration is performed on reactor-netty side if the context-propagation library is available and automatic context propagation is enabled, any downstream users wouldn't need to worry about this concern.

@davidmelia
Copy link
Author

Hi guys, any more movement on this?

Is this something we will just live with by adding contextWrite(Function.identity()) or is it a bug?

Thanks

@chemicL
Copy link
Member

chemicL commented May 26, 2023

@davidmelia we brainstormed on this and the course of action is that we intend do handle this in reactor-core: reactor/reactor-core#3478. Once this work is ready, depending on whether other projects need to do anything special, we'll create more issues. reactor/reactor-core#3468 contains the current roadmap for the context-propagation effort related to the Hooks.enableAutomaticContextPropagation feature.

@bclozel
Copy link
Member

bclozel commented Jul 5, 2023

I'm closing this issue in favor of the reactor issues listed above. We'll reopen this if something needs to be done here.

@bclozel bclozel closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2023
@bclozel bclozel added status: superseded An issue that has been superseded by another for: external-project Needs a fix in external project and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Jul 5, 2023
@agorbachenko
Copy link

So .tap() is a workaround to restore context in the case of .releaseBody():

.exchangeToMono(response -> response.releaseBody()
                .tap(contextView -> new DefaultSignalListener<>() {
                })

I have tried this but with no success: cc @chemicL

One way to do it would be to implement the restoration in reactor-netty. Another would be to wrap the result with a contextWrite(Function.identity()) in spring-framework codebase, e.g. here: https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/ReactorClientHttpConnector.java#L119.

@chemicL
Copy link
Member

chemicL commented Aug 3, 2023

@agorbachenko I don't have the context about what you are doing in your application around this code. Please create a minimal reproducer and I can try to offer my help via the Micrometer Slack, where this conversation started.

Using tap the way you did is not the same as contextWrite(Function.identity()). However, where I tested it, it does the job for my scenario (tap is also a barrier for Context, as this operator can alter it). The flow can depend on the surrounding operators, so let's have a concrete example to discuss.

@chemicL
Copy link
Member

chemicL commented Dec 15, 2023

For those coming to this issue, reactor-core 3.6.0 brings a solution to it. The problem is described in our blog post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: external-project Needs a fix in external project in: web Issues in web modules (web, webmvc, webflux, websocket) status: superseded An issue that has been superseded by another theme: observability An issue related to observability and tracing
Projects
None yet
Development

No branches or pull requests

6 participants