Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Observation.Context from WebClient request #32198

Closed
erimolNordnet opened this issue Feb 5, 2024 · 24 comments
Closed

Remove Observation.Context from WebClient request #32198

erimolNordnet opened this issue Feb 5, 2024 · 24 comments
Assignees
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) type: bug A general bug
Milestone

Comments

@erimolNordnet
Copy link

erimolNordnet commented Feb 5, 2024

Have a running app since 3 years in production without any issue. Experienced a memory leak when bumped Spring dependencies and have now traced it down to a single library: spring-webflux 6.1.1 -> 6.1.2 (no other code change)

It seems like their is a circular dependency between ClientRequestObservationContext and ClientRequest.Builder that cannot be garbage collected. The behaviour is deterministic in the sense that I managed to recreate the issue multiple times, it takes a couple of hours before running out of heap space. One instance of the application is quite heavy used with 50-100 requests / s and lots of downstream calls to other MicroServices using WebClient where the issue seems to be.

Changes that I think could be related:

Please let me know if I can provide more information, first time posting a bug report here.

// Erik

Have done an heap dump multiple times and retained memory increasing for the specified types. Attaching a print screen of the heap dump:
Heap dump

Heap before update
Heap before update

Heap after update
Heap after update

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Feb 5, 2024
@jhoeller jhoeller added the in: web Issues in web modules (web, webmvc, webflux, websocket) label Feb 5, 2024
@bclozel bclozel self-assigned this Feb 5, 2024
@bclozel
Copy link
Member

bclozel commented Feb 5, 2024

Thanks for the report.
Unfortunately, I can't reproduce this issue locally. I'm using Spring Boot 3.2.2 and I've locked the Spring Framework version to ext['spring-framework.version'] = "6.1.2".

I'm using the following controller to run many requests with the WebClient:

package com.example.webclientleak;

import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;

import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.reactive.function.client.WebClient;

@RestController
public class TestController {

    private final WebClient webClient;

    public TestController(WebClient.Builder builder) {
        this.webClient = builder.baseUrl("http://localhost:8080").build();
    }

    @GetMapping("/test")
    public String test() {
        return "TEST";
    }

    @GetMapping("/")
    public Mono<String> index(@RequestParam int count) {
        return Flux.range(1, count)
                .flatMap(i -> this.webClient.get().uri("/test").retrieve().bodyToMono(String.class))
                .last();
    }
}

I ran many of queries against the main endpoint (so millions of webclient requests) and I don't see any memory leak in my tools. The Java heap is not growing and I can consistently get it down when triggering manual GCs.

I don't understand why #31702 would change anything, as this commit doesn't change the nature of this "cycle" since the observation context was already pointing to the request builder and the request (which itself contains the context as an attribute). This change merely changes the ordering of things. I don't think cycle like this cause memory leaks in the first place, as they're not reachably from GC roots once the request is sent. Do you have a different perspective on this?

It would be interesting to get why those context instances are still reachable in your case and what's holding a reference to them. In my case, I only had a couple of those in memory as they were held by the observation registry, waiting to be sent/processed.

Additionnally, any indication on how to reproduce this locally would be very much welcome.

Thanks!

@bclozel bclozel added the status: waiting-for-feedback We need additional information before we can continue label Feb 5, 2024
@erimolNordnet
Copy link
Author

erimolNordnet commented Feb 5, 2024

Not really sure either, but as the heapdump indicates some memory are retained and the change happened to the classes what I can see. That was my only lead:
From 6.1.2:

ClientRequestObservationContext observationContext = new ClientRequestObservationContext(requestBuilder);

and before in 6.1.1 it was created without the request:

ClientRequestObservationContext observationContext = new ClientRequestObservationContext();

Also this in 6.1.2

	ClientRequest request = requestBuilder
						.attribute(ClientRequestObservationContext.CURRENT_OBSERVATION_CONTEXT_ATTRIBUTE, observationContext)
						.build();

At least it feels related from the bump in version and what I can see from the dump.

WIll give it a try locally in my integration test if I can hammer on one endpoint, haven't been able to reproduce locally so far either

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 5, 2024
@erimolNordnet
Copy link
Author

Will give you the trace from heapdump as well:
Screenshot 2024-02-05 at 14 20 48

Screenshot 2024-02-05 at 14 22 32

@bclozel
Copy link
Member

bclozel commented Feb 5, 2024

Did the reactor-netty dependency version change at all between both?
Can you replicate that on a smaller sample?

@bclozel bclozel added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Feb 5, 2024
@erimolNordnet
Copy link
Author

Only change was webflux, I done a lot of tests to be able to isolate which change caused the issue:
I explicit added this dependency and the only change was from 6.1.1 to 6.1.2

<dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-webflux</artifactId>
            <version>6.1.2</version>
        </dependency>

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 5, 2024
@bclozel bclozel added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Feb 5, 2024
@erimolNordnet
Copy link
Author

Can you replicate that on a smaller sample?

How do you mean like a test app? Or the amount of traffic?

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 5, 2024
@bclozel
Copy link
Member

bclozel commented Feb 5, 2024

I created a small project on start.spring.io and added the controller I shared in a comment above. This doesn't reproduce the memory leak after millions of requests. I'm thinking that I'm probably missing some part of the setup. Maybe a specific Reactor version is required, a specific configuration property? Or maybe the server part needs to be using MVC, not WebFlux as I was using in my sample? Maybe this only happens on a particular JVM version? Maybe an actual observation registry implementation needs to be involved?

I'm not saying this is not a bug on our side, but it's hard to fix a memory leak without reproducing it and seeing where the allocation happens. If you can reproduce the issue on a smaller sample, I'd be happy to have a look and work on a fix. Without that, my best bet would be to revert the change I introduced in 6.1.2 but this would give zero guarantee (you could try it for yourself in your application by using spring-webflux 6.1.2 and adding the former ClientRequestObservationContext implementation in your application to override the changes).

@bclozel bclozel added status: waiting-for-feedback We need additional information before we can continue and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Feb 5, 2024
@erimolNordnet
Copy link
Author

Totally understand, I had the ambition to recreate it locally as well without success, creating a smaller sample is therefore not as easy, have only seen this in live traffic after 2-3h running.

Also using Webflux in the server app. Could try your suggestion to include the former impl, but even if that is successful It's not the best approach going forward...

@bclozel
Copy link
Member

bclozel commented Feb 5, 2024

You previous screenshot points to HttpClientConfig and org.springframework.http.client.reactive.ReactorClientHttpConnector#connect being GC roots and keeping references to requests around (unless this screenshot is not about GC roots?).

Also using Webflux in the server app. Could try your suggestion to include the former impl, but even if that is successful It's not the best approach going forward...

I'm not suggesting that as a solution, but really another step to ensure that this change is the culprit, even though we can't reproduce it locally. To be honest, I'm not sure how the change in 6.1.2 could create a problem like this as the " reference cycle" existed already.

@spring-projects-issues spring-projects-issues removed the status: waiting-for-feedback We need additional information before we can continue label Feb 5, 2024
@erimolNordnet
Copy link
Author

I created a small project on start.spring.io and added the controller I shared in a comment above. This doesn't reproduce the memory leak after millions of requests. I'm thinking that I'm probably missing some part of the setup. Maybe a specific Reactor version is required, a specific configuration property?

I have run with lots of different versions of the other dependencies reactor, netty and so on .. but seems like this patch version is the one that triggers the behaviour. I guess It could be some other component that causes the issues, hard to tell

@bclozel bclozel added status: waiting-for-feedback We need additional information before we can continue status: waiting-for-triage An issue we've not yet triaged or decided on and removed status: feedback-provided Feedback has been provided labels Feb 5, 2024
@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 5, 2024
@bclozel bclozel added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Feb 5, 2024
@snicoll snicoll changed the title Bug: Memory leak ClientRequestObservationContext when bumping spring-webflux 6.1.1 -> 6.1.2 Memory leak ClientRequestObservationContext when bumping spring-webflux 6.1.1 -> 6.1.2 Feb 5, 2024
@erimolNordnet
Copy link
Author

Some update:

We have managed to avoid the memory leak by instead of using ReactorClientHttpConnector go with JettyClientHttpConnector, so to summarise it feels lika a combination of 3 different things:

  • Upgrading from version 6.1.1 -> 6.1.2
  • Using default ReactorClientHttpConnector
  • Our application flow (one thing that stod out was that we had quite a lot of cancellation on outgoing http calls, this component serves kind of a bff and doing multiple parallell requests), we got a lot of errors in our log once we shifted to Jetty connector. (Will try to look into that as well later on and to recreate)

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 12, 2024
@arana3
Copy link

arana3 commented Feb 16, 2024

@erimolNordnet -

In the snippet you shared here, it does not seem like you are using Spring Boot's managed WebClient builder. Per reference documentation, the OOTB managed builder has Observation component pre-attached:

https://docs.spring.io/spring-framework/reference/integration/observability.html#observability.http-client.webclient

Its a long shot, but I wonder if you use OOTB Spring managed builder if the issue is re-producible. Of course, the other challenge is that there is no reproducer app (yet).

@bclozel
Copy link
Member

bclozel commented Feb 16, 2024

I think this currently points to a "connection leak", or at least a setup where many connections are cancelled or closed and the connection pool is taking some time to invalidate and clean up resources. The change in Spring Framework is likely to make this behavior more obvious by attaching more memory to the connection attribute?

I'll close this issue for now since I still haven't found a way to reproduce this. We'll reopen as soon as we have a reproduced so we can investigate.

@bclozel bclozel closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2024
@bclozel bclozel added status: invalid An issue that we don't feel is valid and removed status: waiting-for-triage An issue we've not yet triaged or decided on status: feedback-provided Feedback has been provided labels Feb 16, 2024
@konner91
Copy link

Hi together,

I upgraded our application from Spring Boot 3.0.13 to 3.2.2 and I see the same issue as @erimolNordnet reported, too. The issue appears at the moment only in our background job where we make a lot of http requests and the cpu usage is very high. Before we make a request we load some data which could be multiple megabytes. In the screenshot the CancelableContinuationImpl contains the data which was loaded before. It looks like that the netty client underneath is not releasing the memory after the request. Our webclient config uses a connection pool with the lifo setting and the evictInBackground setting is set to 45 seconds. In our api instance we couldn`t see this issue. Maybe this helps to get an idea why this is happening. Our quickfix is to set spring-webflux to 6.1.1.

Best,

Konstantin

image

@bclozel
Copy link
Member

bclozel commented Feb 29, 2024

Can you provide a minimal sample application that demonstrates the problem?

You seem to imply that the connection pool does not release the connection but this is not managed by Spring so I don't see the link. I think that the leak might already be there and that reverting the Spring version might just make the problem less visible.

I'll revisit this anyway to check whether we can improve things. But just so you know we have not found any proof of leak in Spring Framework so far.

@konner91
Copy link

Unfortunately I was not able to reproduce it locally. But I keep trying. Do you know if there is a place where the request attributes are cleared before the connection is released to the pool?

@bclozel
Copy link
Member

bclozel commented Feb 29, 2024

The request attributed were never cleared as the request object instance is getting GC'd.

Since this bothers people that much we might remove the observation as a request attribute and just let people get it from the observation context. This probably won't solve the actual memory management issues but this might make it less visible.

@snicoll snicoll added status: waiting-for-triage An issue we've not yet triaged or decided on and removed status: invalid An issue that we don't feel is valid labels Mar 12, 2024
@bclozel bclozel added type: bug A general bug and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Mar 12, 2024
@bclozel bclozel added this to the 6.1.5 milestone Mar 12, 2024
@bclozel bclozel changed the title Memory leak ClientRequestObservationContext when bumping spring-webflux 6.1.1 -> 6.1.2 Remove Observation.Context from WebClient request Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) type: bug A general bug
Projects
None yet
Development

No branches or pull requests

7 participants