Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High percent of failures (timeout) under load from server-side WebClient requests [SPR-15584] #20143

Closed
spring-projects-issues opened this issue May 24, 2017 · 18 comments
Assignees
Labels
type: bug A general bug
Milestone

Comments

@spring-projects-issues
Copy link
Collaborator

spring-projects-issues commented May 24, 2017

Jakub Rutkowski opened SPR-15584 and commented

I just test by sample PoC project some blocking / non blocking solutions in simple common scenario.

Scenario:

  • There are rest blocking endpoint which is quite slow - each request tooks 200 ms.
  • There are other - client application, which call this slow endpoint.
    I have tested current (blocking) Spring boot client (tomcat), Spring Boot 2.0 (netty) with WebFlux - WebClient, Ratpack and Lagom. In each cases I have stressed client application by gatling test simple scenario (100-1000 users / second).

I have tested ratpack and lagom as reference non blocking io servers to compare results to spring boot (blocking and non blocking).

In all cases i have results as expected, except spring boot 2.0 test. Its working only for small load levels but even then with high latency. If load level rises up - all requests are time outed.
(see attachments)

WebClient usage :

@RestController
public class NonBlockingClientController {
    private WebClient client = WebClient.create("http://localhost:9000");

    @GetMapping("/client")
    public Mono<String> getData() {
        return client.get()
                .uri("/routing")
                .accept(TEXT_PLAIN)
                .exchange().timeout(Duration.ofSeconds(30))
                .flatMap(clientResponse -> clientResponse.bodyToMono(String.class));
    }
}

I have no idea what goes wrong or current M1 version just working that.

All sources published at https://github.com/rutkowskij/blocking-non-blocking-poc

blocking-service - slow blocking endpoint
non-blocking-client - Spring Boot 2.0M1 and WebClient based client

I have asked for this problem on


Affects: 5.0 RC1, 5.0 RC2, 5.0 RC3

Attachments:

Issue Links:

0 votes, 5 watchers

@spring-projects-issues
Copy link
Collaborator Author

Jakub Rutkowski commented

I've tested on 2.0.0.M2 spring boot version (spring-webflux, spring-core - 5.0.0RC2) and issue still exists (there are minimal progress but requests still failing) - see attachment

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

I've updated the title (originally "Spring WebFlux WebClient resilience and performance") to reflect the concrete issue to investigate.

The larger question of resilience and performance is valid too but we can't discuss much until we figure out the cause for the high failure count.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Can you please provide basic instructions for your test repository? Also the sample is currently at Boot 2.0 M1 (RC1) and a lot has happened since (we're RC3 as of today).

@spring-projects-issues
Copy link
Collaborator Author

Jakub Rutkowski commented

I have updated and pushed dependencies to Boot 2.0 M2

Steps to reproduce:

  1. Run blocking-service/BlockingServiceApplication (it will expose http://localhost:9000/routing endpoint - it sleeps 200ms in each request)
  2. Run non-blocking-client/NonBlockingClientApplication (it will expose http://localhost:8000/client endpoint which call above blocking service)
  3. Run gatling test - gatling-load-tests/mvn gatling:test

after test scenerio ~2min You have generated test report:

Please open the following file: PATH_TO_REPORT
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

@spring-projects-issues
Copy link
Collaborator Author

Jakub Rutkowski commented

I have seen that new RC3 was released today, so I wanted to test on it, but I need to wait for Boot 2.0 M3 (which will use SF RC3 I guess)

@spring-projects-issues
Copy link
Collaborator Author

Jakub Rutkowski commented

I just have tested on Boot 2.0.0-SNAPSHOT which uses RC3 - and it still happens.
There are many TimeoutExceptions in logs and some logs contains reference to reactor/reactor-netty#138

@spring-projects-issues
Copy link
Collaborator Author

Jakub Rutkowski commented

I've attached screen with resources utilization during test. In my opinion it seems like fixed thread pool cause to starvation... like in classic blocking approach

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Sorry but what are you basing that opinion on?

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

There are many TimeoutExceptions in logs and some logs contains reference to reactor/reactor-netty#138

Okay so that is a more likely explanation for the error count. We need to have that fixed first.

@spring-projects-issues
Copy link
Collaborator Author

Jakub Rutkowski commented

Sorry but what are you basing that opinion on?

  • It's look like default netty thread pool contains approximately 10 threads
  • It's working fine for single requests and low load
  • There are no high cpu load
  • There are many timeouts when load rises - The results are as expected if you will run the same test for example on tomcat with pool 10 threads

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

The problem is not with the thread pool size or and with non-blocking code there is no need for extra threads to handle concurrency. There is something else at play here.

I've been able to reproduce the problem when the load goes up to 1000 concurrent users (works with 500). When the load goes high enough, initially I see a few "Connection reset by peer" exceptions, then a few seconds later a flood of timeouts. Superficial observation is that something goes wrong and then all requests begin to time out. I can also confirm that with httpClientOptions.disablePool() it runs successfully at half the throughput and that a Servlet / Spring MVC server runs without any issues.

We'll probably need to wait for the investigation of reactor/reactor-netty#138. Either way the fact that disabling the connection pool makes a difference points strongly to an issue at the level of the Reactor Netty client (/cc smaldini, Violeta Georgieva).

Note also that that testing a scenario like this with 3 tiers on a single machine is likely to lead to strange issues. That said there is likely something more going on here so I'm scheduling this for resolution one way or another.

@spring-projects-issues
Copy link
Collaborator Author

Brian Clozel commented

Hello Jakub Rutkowski

Violeta Georgieva has changed a few things around the connection pool configuration, and some related issues are gone.
Could you rerun your benchmark to compare?

For that, you'll need to be on 100% SNAPSHOTs (living dangerously):

  • use Spring Boot 2.0.0.BUILD-SNAPSHOT
  • in your pom.xml, override two maven properties with <spring.version>5.0.0.BUILD-SNAPSHOT</spring.version> and <reactor.version>Bismuth.BUILD-SNAPSHOT</reactor.version>
  • in case your app is reporting strange ClassNotFoundExceptions, don't hesitate to clean your snapshots with mvn dependency:purge-local-repository

Let us know if you don't have time - the next Framework Milestone is around the corner.

Thanks!

@spring-projects-issues
Copy link
Collaborator Author

Jakub Rutkowski commented

Hi Brian
I've run test again on snapshot version, but results are same as before.
There are still a lot exceptions with reactor/reactor-netty#138 reference.

snapshot libraries:
spring-boot-2.0.0.BUILD-20170905.071138-928.jar
spring-context-5.0.0.BUILD-20170904.142806-524.jar

@spring-projects-issues
Copy link
Collaborator Author

Brian Clozel commented

Thanks a lot Jakub Rutkowski, this really helps.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Note that we ran into some issues with reactor-netty not being fully up-to-date with the latest reactor-core. This was just fixed and it might have impacted the testing. We can also give it another try on our side as well.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Using the latest snapshot in non-blocking-client and running the performance test, I no longer get any errors.

@spring-projects-issues
Copy link
Collaborator Author

Jakub Rutkowski commented

I confirm - It looks that everything ok now. (see attachment)

@spring-projects-issues
Copy link
Collaborator Author

Brian Clozel commented

Nice! I'm closing this issue now - we can still improve performance overall, but this particular problem is now gone.

Don't hesitate to keep an eye on your benchmarks and let us know - this is really useful.
Thanks Jakub Rutkowski for all the hard work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants