Thread blocks and multiple vert.x context issues when using ksqldb in multi-node cluster #5600

the-cybersapien · 2020-06-11T13:52:23Z

Describe the bug
The vert.x client created for inter-node communication in ksqldb seems to have a leak, where KSQL creates multiple vert.x clients as more requests comes in. We think the problem here is a new client instance and a new Vertx instance is being created for every pull query inter node request.

To Reproduce
Steps to reproduce the behavior, include:

Setup a ksqldb server on a single-node with the attached settings.
Add another server to the ksqldb cluster with the right advertised_listener. Let the server get ready to serve requests.
Start a load test using a load testing cluster such as Locust or jMeter, any simple SELECT * FROM table WHERE key=123; Pull query running over a period of time randomly going to one of the two servers will work, sent on /query endpoint.
As soon as queries start, the “You’re already on a Vert.x context” logs start coming in.
After a while (depending on instance size), the thread block issues will start appearing as well. We started noticing the issue within 15 minutes.

Expected behavior
Since advertised.listeners property is set, both servers should be able to serve the requests with no errors. As ksql.query.pull.enable.standby.reads are disabled, in case the data is unavailable on that particular ksqldb node, it should fetch data and serve it from another node.

Actual behaviour

Using ksqldb docker image created from the latest master branch, I see a good 2000-3000rps when running a single-node KSQLDB Cluster with a load tester. But running the same image in multi-node KSQLDB cluster, running 2 or more different nodes(separate EC2 servers, so no CPU hogging), I see that the throughput drops in just 5-10 minutes and KSQLDB gives below the warnings and errors.
This log starts as soon as KSQL is run on the second instance and has started serving requests.

Jun 11, 2020 6:28:14 AM io.vertx.core.impl.VertxImpl
WARNING: You're already on a Vert.x context, are you sure you want to create a new Vertx instance?
Jun 11, 2020 6:28:14 AM io.vertx.core.impl.VertxImpl
WARNING: You're already on a Vert.x context, are you sure you want to create a new Vertx instance?
Jun 11, 2020 6:28:14 AM io.vertx.core.impl.VertxImpl
WARNING: You're already on a Vert.x context, are you sure you want to create a new Vertx instance?

5-10 minutes later,

Jun 11, 2020 6:34:30 AM io.vertx.core.impl.BlockedThreadChecker
WARNING: Thread Thread[vert.x-eventloop-thread-5,5,main]=Thread[vert.x-eventloop-thread-5,5,main] has been blocked for 14128 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:717)
	at io.netty.util.concurrent.ThreadPerTaskExecutor.execute(ThreadPerTaskExecutor.java:32)
	at io.netty.util.internal.ThreadExecutorMap$1.execute(ThreadExecutorMap.java:57)
	at io.netty.util.concurrent.SingleThreadEventExecutor.doStartThread(SingleThreadEventExecutor.java:978)
	at io.netty.util.concurrent.SingleThreadEventExecutor.ensureThreadStarted(SingleThreadEventExecutor.java:961)
	at io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:663)
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:163)
	at io.vertx.core.impl.VertxImpl.lambda$deleteCacheDirAndShutdown$29(VertxImpl.java:843)
	at io.vertx.core.impl.VertxImpl$$Lambda$1402/987370749.handle(Unknown Source)
	at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:330)
	at io.vertx.core.impl.ContextImpl$$Lambda$1316/1941309664.handle(Unknown Source)
	at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:369)
	at io.vertx.core.impl.EventLoopContext.lambda$executeAsync$0(EventLoopContext.java:38)
	at io.vertx.core.impl.EventLoopContext$$Lambda$382/28145535.run(Unknown Source)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

Vertx threads go in a thread blocked state as soon as we add a second node to the ksqldb cluster. The issue automatically goes away after removing the second instance and restarting the first one.

Additional context
Here are the server settings from KSQL:
ksqldb-server.properties.txt
Thread dumps from KSQLDB:
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMDYvMTEvLS1rc3FsLXRocmVhZGR1bXAudGFyLmd6LS0xMy0xNC01MA==&

The text was updated successfully, but these errors were encountered:

purplefox · 2020-06-11T17:00:21Z

I think the problem here is a new client instance and a new Vertx instance is being created for every pull query inter node request. Creating a whole client + Vert.x is quite heavyweight and it seems it can't keep up with the rate of requests.

I think what we need to do is cache the client between subsequent requests. Also we should reuse the Vert.x instance from KsqlRestApplication and pass that in when creating KsqlClient instead of the client maintaining it's own one.

In more detail:

For each forwarded pull query a new KsqlClient and accordingly a new Vertx instance is being created. The Vertx instances owns a bunch of threads. The Vertx instance is being closed when the request completes, however the close is asynchronous and the owned thread pools get closed time later. It appears that the system was being driven with a high level of concurrency when the issue happened, as implied by the large number of worker threads seen in fast thread. In this situation if the vert.x thread pools are closed slower than new requests are handled then the number of overall threads will increase without bound - a very large number of Vert.x event loop threads can be seen in fastthread. Eventually there are too many threads for the system to handle and everything dies.

purplefox · 2020-06-16T16:32:51Z

Hi @the-cybersapien

Could you tell me the level of concurrency (how many concurrent requests) you were driving the system at from the load test tool when this occurred?

the-cybersapien · 2020-06-16T17:02:38Z

Hey @purplefox

We tested with minimum 200 and maximum 6000 concurrency for the test.
Results were similar, with almost proportionately increasing latency compared to concurrency

purplefox · 2020-06-16T18:58:54Z

Thanks, can you tell me how many connections you used?

the-cybersapien · 2020-06-17T14:09:06Z

Hey @purplefox ,

By our understanding, Locust client has a 1:1 concurrency to connection mapping. So connections will be the same, 200-6000.

the-cybersapien added bug needs-triage labels Jun 11, 2020

MichaelDrogalis added the blocker label Jun 11, 2020

MichaelDrogalis added this to the 0.11.0 milestone Jun 11, 2020

purplefox assigned AlanConfluent Jun 11, 2020

purplefox mentioned this issue Jun 16, 2020

fix: Reuse KsqlClient instance for inter node requests #5624

Closed

2 tasks

agavra removed the needs-triage label Jun 24, 2020

apurvam added P0 Denotes must-have for a given milestone and removed blocker labels Jun 24, 2020

AlanConfluent mentioned this issue Jun 30, 2020

fix: Reuse KsqlClient instance for inter node requests #5742

Merged

2 tasks

AlanConfluent closed this as completed in #5742 Jul 6, 2020

lhxhappy mentioned this issue Apr 13, 2023

ksqlDB happens a Thread blocked issue when start 3 pods in the Kubernetes #9869

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread blocks and multiple vert.x context issues when using ksqldb in multi-node cluster #5600

Thread blocks and multiple vert.x context issues when using ksqldb in multi-node cluster #5600

the-cybersapien commented Jun 11, 2020

purplefox commented Jun 11, 2020 •

edited

Loading

purplefox commented Jun 16, 2020

the-cybersapien commented Jun 16, 2020

purplefox commented Jun 16, 2020

the-cybersapien commented Jun 17, 2020

Thread blocks and multiple vert.x context issues when using ksqldb in multi-node cluster #5600

Thread blocks and multiple vert.x context issues when using ksqldb in multi-node cluster #5600

Comments

the-cybersapien commented Jun 11, 2020

purplefox commented Jun 11, 2020 • edited Loading

purplefox commented Jun 16, 2020

the-cybersapien commented Jun 16, 2020

purplefox commented Jun 16, 2020

the-cybersapien commented Jun 17, 2020

purplefox commented Jun 11, 2020 •

edited

Loading