-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread blocks and multiple vert.x context issues when using ksqldb in multi-node cluster #5600
Comments
I think the problem here is a new client instance and a new Vertx instance is being created for every pull query inter node request. Creating a whole client + Vert.x is quite heavyweight and it seems it can't keep up with the rate of requests. I think what we need to do is cache the client between subsequent requests. Also we should reuse the Vert.x instance from KsqlRestApplication and pass that in when creating KsqlClient instead of the client maintaining it's own one. In more detail: For each forwarded pull query a new KsqlClient and accordingly a new Vertx instance is being created. The Vertx instances owns a bunch of threads. The Vertx instance is being closed when the request completes, however the close is asynchronous and the owned thread pools get closed time later. It appears that the system was being driven with a high level of concurrency when the issue happened, as implied by the large number of worker threads seen in fast thread. In this situation if the vert.x thread pools are closed slower than new requests are handled then the number of overall threads will increase without bound - a very large number of Vert.x event loop threads can be seen in fastthread. Eventually there are too many threads for the system to handle and everything dies. |
Could you tell me the level of concurrency (how many concurrent requests) you were driving the system at from the load test tool when this occurred? |
Hey @purplefox We tested with minimum 200 and maximum 6000 concurrency for the test. |
Thanks, can you tell me how many connections you used? |
Hey @purplefox , By our understanding, Locust client has a 1:1 concurrency to connection mapping. So connections will be the same, 200-6000. |
Describe the bug
The vert.x client created for inter-node communication in ksqldb seems to have a leak, where KSQL creates multiple vert.x clients as more requests comes in. We think the problem here is a new client instance and a new Vertx instance is being created for every pull query inter node request.
To Reproduce
Steps to reproduce the behavior, include:
SELECT * FROM table WHERE key=123;
Pull query running over a period of time randomly going to one of the two servers will work, sent on/query
endpoint.Expected behavior
Since
advertised.listeners
property is set, both servers should be able to serve the requests with no errors. Asksql.query.pull.enable.standby.reads
are disabled, in case the data is unavailable on that particular ksqldb node, it should fetch data and serve it from another node.Actual behaviour
Using ksqldb docker image created from the latest master branch, I see a good 2000-3000rps when running a single-node KSQLDB Cluster with a load tester. But running the same image in multi-node KSQLDB cluster, running 2 or more different nodes(separate EC2 servers, so no CPU hogging), I see that the throughput drops in just 5-10 minutes and KSQLDB gives below the warnings and errors.
This log starts as soon as KSQL is run on the second instance and has started serving requests.
5-10 minutes later,
Additional context
Here are the server settings from KSQL:
ksqldb-server.properties.txt
Thread dumps from KSQLDB:
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMDYvMTEvLS1rc3FsLXRocmVhZGR1bXAudGFyLmd6LS0xMy0xNC01MA==&
The text was updated successfully, but these errors were encountered: