Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread blocks and multiple vert.x context issues when using ksqldb in multi-node cluster #5600

Closed
the-cybersapien opened this issue Jun 11, 2020 · 5 comments · Fixed by #5742
Assignees
Labels
bug P0 Denotes must-have for a given milestone
Milestone

Comments

@the-cybersapien
Copy link

Describe the bug
The vert.x client created for inter-node communication in ksqldb seems to have a leak, where KSQL creates multiple vert.x clients as more requests comes in. We think the problem here is a new client instance and a new Vertx instance is being created for every pull query inter node request.

To Reproduce
Steps to reproduce the behavior, include:

  1. Setup a ksqldb server on a single-node with the attached settings.
  2. Add another server to the ksqldb cluster with the right advertised_listener. Let the server get ready to serve requests.
  3. Start a load test using a load testing cluster such as Locust or jMeter, any simple SELECT * FROM table WHERE key=123; Pull query running over a period of time randomly going to one of the two servers will work, sent on /query endpoint.
  4. As soon as queries start, the “You’re already on a Vert.x context” logs start coming in.
  5. After a while (depending on instance size), the thread block issues will start appearing as well. We started noticing the issue within 15 minutes.

Expected behavior
Since advertised.listeners property is set, both servers should be able to serve the requests with no errors. As ksql.query.pull.enable.standby.reads are disabled, in case the data is unavailable on that particular ksqldb node, it should fetch data and serve it from another node.

Actual behaviour

Using ksqldb docker image created from the latest master branch, I see a good 2000-3000rps when running a single-node KSQLDB Cluster with a load tester. But running the same image in multi-node KSQLDB cluster, running 2 or more different nodes(separate EC2 servers, so no CPU hogging), I see that the throughput drops in just 5-10 minutes and KSQLDB gives below the warnings and errors.
This log starts as soon as KSQL is run on the second instance and has started serving requests.

Jun 11, 2020 6:28:14 AM io.vertx.core.impl.VertxImpl
WARNING: You're already on a Vert.x context, are you sure you want to create a new Vertx instance?
Jun 11, 2020 6:28:14 AM io.vertx.core.impl.VertxImpl
WARNING: You're already on a Vert.x context, are you sure you want to create a new Vertx instance?
Jun 11, 2020 6:28:14 AM io.vertx.core.impl.VertxImpl
WARNING: You're already on a Vert.x context, are you sure you want to create a new Vertx instance?

5-10 minutes later,

Jun 11, 2020 6:34:30 AM io.vertx.core.impl.BlockedThreadChecker
WARNING: Thread Thread[vert.x-eventloop-thread-5,5,main]=Thread[vert.x-eventloop-thread-5,5,main] has been blocked for 14128 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:717)
	at io.netty.util.concurrent.ThreadPerTaskExecutor.execute(ThreadPerTaskExecutor.java:32)
	at io.netty.util.internal.ThreadExecutorMap$1.execute(ThreadExecutorMap.java:57)
	at io.netty.util.concurrent.SingleThreadEventExecutor.doStartThread(SingleThreadEventExecutor.java:978)
	at io.netty.util.concurrent.SingleThreadEventExecutor.ensureThreadStarted(SingleThreadEventExecutor.java:961)
	at io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:663)
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:163)
	at io.vertx.core.impl.VertxImpl.lambda$deleteCacheDirAndShutdown$29(VertxImpl.java:843)
	at io.vertx.core.impl.VertxImpl$$Lambda$1402/987370749.handle(Unknown Source)
	at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:330)
	at io.vertx.core.impl.ContextImpl$$Lambda$1316/1941309664.handle(Unknown Source)
	at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:369)
	at io.vertx.core.impl.EventLoopContext.lambda$executeAsync$0(EventLoopContext.java:38)
	at io.vertx.core.impl.EventLoopContext$$Lambda$382/28145535.run(Unknown Source)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

Vertx threads go in a thread blocked state as soon as we add a second node to the ksqldb cluster. The issue automatically goes away after removing the second instance and restarting the first one.

Additional context
Here are the server settings from KSQL:
ksqldb-server.properties.txt
Thread dumps from KSQLDB:
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMDYvMTEvLS1rc3FsLXRocmVhZGR1bXAudGFyLmd6LS0xMy0xNC01MA==&

@purplefox
Copy link
Contributor

purplefox commented Jun 11, 2020

I think the problem here is a new client instance and a new Vertx instance is being created for every pull query inter node request. Creating a whole client + Vert.x is quite heavyweight and it seems it can't keep up with the rate of requests.

I think what we need to do is cache the client between subsequent requests. Also we should reuse the Vert.x instance from KsqlRestApplication and pass that in when creating KsqlClient instead of the client maintaining it's own one.

In more detail:

For each forwarded pull query a new KsqlClient and accordingly a new Vertx instance is being created. The Vertx instances owns a bunch of threads. The Vertx instance is being closed when the request completes, however the close is asynchronous and the owned thread pools get closed time later. It appears that the system was being driven with a high level of concurrency when the issue happened, as implied by the large number of worker threads seen in fast thread. In this situation if the vert.x thread pools are closed slower than new requests are handled then the number of overall threads will increase without bound - a very large number of Vert.x event loop threads can be seen in fastthread. Eventually there are too many threads for the system to handle and everything dies.

@purplefox
Copy link
Contributor

Hi @the-cybersapien

Could you tell me the level of concurrency (how many concurrent requests) you were driving the system at from the load test tool when this occurred?

@the-cybersapien
Copy link
Author

Hey @purplefox

We tested with minimum 200 and maximum 6000 concurrency for the test.
Results were similar, with almost proportionately increasing latency compared to concurrency

@purplefox
Copy link
Contributor

Thanks, can you tell me how many connections you used?

@the-cybersapien
Copy link
Author

Hey @purplefox ,

By our understanding, Locust client has a 1:1 concurrency to connection mapping. So connections will be the same, 200-6000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug P0 Denotes must-have for a given milestone
Projects
None yet
6 participants