High percent of failures (timeout) under load from server-side WebClient requests [SPR-15584] #20143

spring-projects-issues · 2017-05-24T12:12:53Z

Jakub Rutkowski opened SPR-15584 and commented

I just test by sample PoC project some blocking / non blocking solutions in simple common scenario.

Scenario:

There are rest blocking endpoint which is quite slow - each request tooks 200 ms.
There are other - client application, which call this slow endpoint.
I have tested current (blocking) Spring boot client (tomcat), Spring Boot 2.0 (netty) with WebFlux - WebClient, Ratpack and Lagom. In each cases I have stressed client application by gatling test simple scenario (100-1000 users / second).

I have tested ratpack and lagom as reference non blocking io servers to compare results to spring boot (blocking and non blocking).

In all cases i have results as expected, except spring boot 2.0 test. Its working only for small load levels but even then with high latency. If load level rises up - all requests are time outed.
(see attachments)

WebClient usage :

@RestController
public class NonBlockingClientController {
    private WebClient client = WebClient.create("http://localhost:9000");

    @GetMapping("/client")
    public Mono<String> getData() {
        return client.get()
                .uri("/routing")
                .accept(TEXT_PLAIN)
                .exchange().timeout(Duration.ofSeconds(30))
                .flatMap(clientResponse -> clientResponse.bodyToMono(String.class));
    }
}

I have no idea what goes wrong or current M1 version just working that.

All sources published at https://github.com/rutkowskij/blocking-non-blocking-poc

blocking-service - slow blocking endpoint
non-blocking-client - Spring Boot 2.0M1 and WebClient based client

I have asked for this problem on

StackOverflow - https://stackoverflow.com/questions/43128467/spring-webflux-webclient-resilience-and-performance
SpringBoot Gitter
but nobody answer to this.

Affects: 5.0 RC1, 5.0 RC2, 5.0 RC3

Attachments:

springBoot2.png (63.85 kB)
springBoot2M2-Netty.png (67.49 kB)
SpringBoot2-Netty.png (65.16 kB)
spring-boot2-nonblocking-poc-resources.png (58.22 kB)
SpringBootRatpack.png (61.87 kB)
SpringBoot-Tomcat-1000threads.png (60.87 kB)

Issue Links:

Spring webflux app consumes more resources than non-reactive equivalent app implementation [SPR-15783] #20338 Spring webflux app consumes more resources than non-reactive equivalent app implementation

0 votes, 5 watchers

spring-projects-issues · 2017-06-19T11:03:14Z

Jakub Rutkowski commented

I've tested on 2.0.0.M2 spring boot version (spring-webflux, spring-core - 5.0.0RC2) and issue still exists (there are minimal progress but requests still failing) - see attachment

spring-projects-issues · 2017-07-24T13:31:46Z

Rossen Stoyanchev commented

I've updated the title (originally "Spring WebFlux WebClient resilience and performance") to reflect the concrete issue to investigate.

The larger question of resilience and performance is valid too but we can't discuss much until we figure out the cause for the high failure count.

spring-projects-issues · 2017-07-24T13:39:41Z

Rossen Stoyanchev commented

Can you please provide basic instructions for your test repository? Also the sample is currently at Boot 2.0 M1 (RC1) and a lot has happened since (we're RC3 as of today).

spring-projects-issues · 2017-07-24T14:10:23Z

Jakub Rutkowski commented

I have updated and pushed dependencies to Boot 2.0 M2

Steps to reproduce:

Run blocking-service/BlockingServiceApplication (it will expose http://localhost:9000/routing endpoint - it sleeps 200ms in each request)
Run non-blocking-client/NonBlockingClientApplication (it will expose http://localhost:8000/client endpoint which call above blocking service)
Run gatling test - gatling-load-tests/mvn gatling:test

after test scenerio ~2min You have generated test report:

Please open the following file: PATH_TO_REPORT
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

spring-projects-issues · 2017-07-24T14:14:49Z

Jakub Rutkowski commented

I have seen that new RC3 was released today, so I wanted to test on it, but I need to wait for Boot 2.0 M3 (which will use SF RC3 I guess)

spring-projects-issues · 2017-07-24T14:28:24Z

Jakub Rutkowski commented

I just have tested on Boot 2.0.0-SNAPSHOT which uses RC3 - and it still happens.
There are many TimeoutExceptions in logs and some logs contains reference to reactor/reactor-netty#138

spring-projects-issues · 2017-07-25T07:44:03Z

Jakub Rutkowski commented

I've attached screen with resources utilization during test. In my opinion it seems like fixed thread pool cause to starvation... like in classic blocking approach

spring-projects-issues · 2017-07-25T08:00:26Z

Rossen Stoyanchev commented

Sorry but what are you basing that opinion on?

spring-projects-issues · 2017-07-25T08:01:42Z

Rossen Stoyanchev commented

There are many TimeoutExceptions in logs and some logs contains reference to reactor/reactor-netty#138

Okay so that is a more likely explanation for the error count. We need to have that fixed first.

spring-projects-issues · 2017-07-25T08:16:48Z

Jakub Rutkowski commented

Sorry but what are you basing that opinion on?

It's look like default netty thread pool contains approximately 10 threads
It's working fine for single requests and low load
There are no high cpu load
There are many timeouts when load rises - The results are as expected if you will run the same test for example on tomcat with pool 10 threads

spring-projects-issues · 2017-07-31T09:22:48Z

Rossen Stoyanchev commented

The problem is not with the thread pool size or and with non-blocking code there is no need for extra threads to handle concurrency. There is something else at play here.

I've been able to reproduce the problem when the load goes up to 1000 concurrent users (works with 500). When the load goes high enough, initially I see a few "Connection reset by peer" exceptions, then a few seconds later a flood of timeouts. Superficial observation is that something goes wrong and then all requests begin to time out. I can also confirm that with httpClientOptions.disablePool() it runs successfully at half the throughput and that a Servlet / Spring MVC server runs without any issues.

We'll probably need to wait for the investigation of reactor/reactor-netty#138. Either way the fact that disabling the connection pool makes a difference points strongly to an issue at the level of the Reactor Netty client (/cc smaldini, Violeta Georgieva).

Note also that that testing a scenario like this with 3 tiers on a single machine is likely to lead to strange issues. That said there is likely something more going on here so I'm scheduling this for resolution one way or another.

spring-projects-issues · 2017-09-04T19:24:57Z

Brian Clozel commented

Hello Jakub Rutkowski

Violeta Georgieva has changed a few things around the connection pool configuration, and some related issues are gone.
Could you rerun your benchmark to compare?

For that, you'll need to be on 100% SNAPSHOTs (living dangerously):

use Spring Boot 2.0.0.BUILD-SNAPSHOT
in your pom.xml, override two maven properties with <spring.version>5.0.0.BUILD-SNAPSHOT</spring.version> and <reactor.version>Bismuth.BUILD-SNAPSHOT</reactor.version>
in case your app is reporting strange ClassNotFoundExceptions, don't hesitate to clean your snapshots with mvn dependency:purge-local-repository

Let us know if you don't have time - the next Framework Milestone is around the corner.

Thanks!

spring-projects-issues · 2017-09-05T11:49:07Z

Jakub Rutkowski commented

Hi Brian
I've run test again on snapshot version, but results are same as before.
There are still a lot exceptions with reactor/reactor-netty#138 reference.

snapshot libraries:
spring-boot-2.0.0.BUILD-20170905.071138-928.jar
spring-context-5.0.0.BUILD-20170904.142806-524.jar

spring-projects-issues · 2017-09-05T11:53:42Z

Brian Clozel commented

Thanks a lot Jakub Rutkowski, this really helps.

spring-projects-issues · 2017-09-05T20:48:50Z

Rossen Stoyanchev commented

Note that we ran into some issues with reactor-netty not being fully up-to-date with the latest reactor-core. This was just fixed and it might have impacted the testing. We can also give it another try on our side as well.

spring-projects-issues · 2017-09-06T18:30:53Z

Rossen Stoyanchev commented

Using the latest snapshot in non-blocking-client and running the performance test, I no longer get any errors.

spring-projects-issues · 2017-09-08T07:52:01Z

Jakub Rutkowski commented

I confirm - It looks that everything ok now. (see attachment)

spring-projects-issues · 2017-09-08T08:22:58Z

Brian Clozel commented

Nice! I'm closing this issue now - we can still improve performance overall, but this particular problem is now gone.

Don't hesitate to keep an eye on your benchmarks and let us know - this is really useful.
Thanks Jakub Rutkowski for all the hard work.

spring-projects-issues closed this as completed Sep 11, 2017

spring-projects-issues added the type: bug A general bug label Jan 11, 2019

spring-projects-issues added this to the 5.0 RC4 milestone Jan 11, 2019

spring-projects-issues assigned bclozel Jan 11, 2019

This was referenced Jan 11, 2019

Spring webflux app consumes more resources than non-reactive equivalent app implementation [SPR-15783] #20338

Closed

BLOCKED "reactor-http-nio-*" threads under load [SPR-15874] #20429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High percent of failures (timeout) under load from server-side WebClient requests [SPR-15584] #20143

High percent of failures (timeout) under load from server-side WebClient requests [SPR-15584] #20143

spring-projects-issues commented May 24, 2017 •

edited

Loading

spring-projects-issues commented Jun 19, 2017

spring-projects-issues commented Jul 24, 2017

spring-projects-issues commented Jul 24, 2017

spring-projects-issues commented Jul 24, 2017

spring-projects-issues commented Jul 24, 2017

spring-projects-issues commented Jul 24, 2017

spring-projects-issues commented Jul 25, 2017

spring-projects-issues commented Jul 25, 2017

spring-projects-issues commented Jul 25, 2017

spring-projects-issues commented Jul 25, 2017

spring-projects-issues commented Jul 31, 2017

spring-projects-issues commented Sep 4, 2017

spring-projects-issues commented Sep 5, 2017

spring-projects-issues commented Sep 5, 2017

spring-projects-issues commented Sep 5, 2017

spring-projects-issues commented Sep 6, 2017

spring-projects-issues commented Sep 8, 2017

spring-projects-issues commented Sep 8, 2017

High percent of failures (timeout) under load from server-side WebClient requests [SPR-15584] #20143

High percent of failures (timeout) under load from server-side WebClient requests [SPR-15584] #20143

Comments

spring-projects-issues commented May 24, 2017 • edited Loading

spring-projects-issues commented Jun 19, 2017

spring-projects-issues commented Jul 24, 2017

spring-projects-issues commented Jul 24, 2017

spring-projects-issues commented Jul 24, 2017

spring-projects-issues commented Jul 24, 2017

spring-projects-issues commented Jul 24, 2017

spring-projects-issues commented Jul 25, 2017

spring-projects-issues commented Jul 25, 2017

spring-projects-issues commented Jul 25, 2017

spring-projects-issues commented Jul 25, 2017

spring-projects-issues commented Jul 31, 2017

spring-projects-issues commented Sep 4, 2017

spring-projects-issues commented Sep 5, 2017

spring-projects-issues commented Sep 5, 2017

spring-projects-issues commented Sep 5, 2017

spring-projects-issues commented Sep 6, 2017

spring-projects-issues commented Sep 8, 2017

spring-projects-issues commented Sep 8, 2017

spring-projects-issues commented May 24, 2017 •

edited

Loading