-
Notifications
You must be signed in to change notification settings - Fork 38.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High percent of failures (timeout) under load from server-side WebClient requests [SPR-15584] #20143
Comments
Jakub Rutkowski commented I've tested on 2.0.0.M2 spring boot version (spring-webflux, spring-core - 5.0.0RC2) and issue still exists (there are minimal progress but requests still failing) - see attachment |
Rossen Stoyanchev commented I've updated the title (originally "Spring WebFlux WebClient resilience and performance") to reflect the concrete issue to investigate. The larger question of resilience and performance is valid too but we can't discuss much until we figure out the cause for the high failure count. |
Rossen Stoyanchev commented Can you please provide basic instructions for your test repository? Also the sample is currently at Boot 2.0 M1 (RC1) and a lot has happened since (we're RC3 as of today). |
Jakub Rutkowski commented I have updated and pushed dependencies to Boot 2.0 M2 Steps to reproduce:
after test scenerio ~2min You have generated test report: Please open the following file: PATH_TO_REPORT |
Jakub Rutkowski commented I have seen that new RC3 was released today, so I wanted to test on it, but I need to wait for Boot 2.0 M3 (which will use SF RC3 I guess) |
Jakub Rutkowski commented I just have tested on Boot 2.0.0-SNAPSHOT which uses RC3 - and it still happens. |
Jakub Rutkowski commented I've attached screen with resources utilization during test. In my opinion it seems like fixed thread pool cause to starvation... like in classic blocking approach |
Rossen Stoyanchev commented Sorry but what are you basing that opinion on? |
Rossen Stoyanchev commented
Okay so that is a more likely explanation for the error count. We need to have that fixed first. |
Jakub Rutkowski commented
|
Rossen Stoyanchev commented The problem is not with the thread pool size or and with non-blocking code there is no need for extra threads to handle concurrency. There is something else at play here. I've been able to reproduce the problem when the load goes up to 1000 concurrent users (works with 500). When the load goes high enough, initially I see a few "Connection reset by peer" exceptions, then a few seconds later a flood of timeouts. Superficial observation is that something goes wrong and then all requests begin to time out. I can also confirm that with We'll probably need to wait for the investigation of reactor/reactor-netty#138. Either way the fact that disabling the connection pool makes a difference points strongly to an issue at the level of the Reactor Netty client (/cc smaldini, Violeta Georgieva). Note also that that testing a scenario like this with 3 tiers on a single machine is likely to lead to strange issues. That said there is likely something more going on here so I'm scheduling this for resolution one way or another. |
Brian Clozel commented Hello Jakub Rutkowski Violeta Georgieva has changed a few things around the connection pool configuration, and some related issues are gone. For that, you'll need to be on 100% SNAPSHOTs (living dangerously):
Let us know if you don't have time - the next Framework Milestone is around the corner. Thanks! |
Jakub Rutkowski commented Hi Brian snapshot libraries: |
Brian Clozel commented Thanks a lot Jakub Rutkowski, this really helps. |
Rossen Stoyanchev commented Note that we ran into some issues with reactor-netty not being fully up-to-date with the latest reactor-core. This was just fixed and it might have impacted the testing. We can also give it another try on our side as well. |
Rossen Stoyanchev commented Using the latest snapshot in non-blocking-client and running the performance test, I no longer get any errors. |
Jakub Rutkowski commented I confirm - It looks that everything ok now. (see attachment) |
Brian Clozel commented Nice! I'm closing this issue now - we can still improve performance overall, but this particular problem is now gone. Don't hesitate to keep an eye on your benchmarks and let us know - this is really useful. |
Jakub Rutkowski opened SPR-15584 and commented
I just test by sample PoC project some blocking / non blocking solutions in simple common scenario.
Scenario:
I have tested current (blocking) Spring boot client (tomcat), Spring Boot 2.0 (netty) with WebFlux - WebClient, Ratpack and Lagom. In each cases I have stressed client application by gatling test simple scenario (100-1000 users / second).
I have tested ratpack and lagom as reference non blocking io servers to compare results to spring boot (blocking and non blocking).
In all cases i have results as expected, except spring boot 2.0 test. Its working only for small load levels but even then with high latency. If load level rises up - all requests are time outed.
(see attachments)
WebClient usage :
I have no idea what goes wrong or current M1 version just working that.
All sources published at https://github.com/rutkowskij/blocking-non-blocking-poc
blocking-service - slow blocking endpoint
non-blocking-client - Spring Boot 2.0M1 and WebClient based client
I have asked for this problem on
but nobody answer to this.
Affects: 5.0 RC1, 5.0 RC2, 5.0 RC3
Attachments:
Issue Links:
0 votes, 5 watchers
The text was updated successfully, but these errors were encountered: