-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer timeout issue? #2634
Comments
It would be very useful with timestamps in the debug logs since this is a timing related issue |
My apologies, I accidentally stripped out the timestamps from the logs earlier. Logs with timestamp below:
and second one:
|
As soon as you see a line like this: .. the messages should be available to consume from the application and any blocking consume/poll calls should return a message. The two reasons I can think of that causes the long delay would be:
|
Hmm the first case cannot be true as we use 1200 as the max msg count:
When the consumer is consuming 1200 msgs/s, the time is very fast, under 50ms... The above only happens when the traffic is medium to low (in the 2 cases of the above, messages returned were less than 30)... And because it is not high traffic, i can also confirm the CPU is under 30% utilised. But your statement about the msgs should be available after CONSUME is very helpful, I will track this further. |
Since this is in a NodeJS environment, out of curiosity, what is the minimum number of threads required by librdkafka? As I have seen discussions about UV_THREADPOOL_SIZE in node-rdkafka. |
I don't know about node, but librdkafka will want at least 3 CPU (v)cores to be reasonably performant |
OK, I have tracked down where the issue it, below is the code in node-rdkafka. Basically, it is using the 1s as the timeout for each poll to librdkafka, not as the total timeout. So, if small number of messages trickle in, say 2 in 500ms, 5 in 800ms, 4 in 300ms, etc, the method will happily keep consuming until the it hits 0 messages in 1s. This also explains why a 100ms timeout has a smaller return time variation, because the chances of timing out (for low traffic) in 100ms will be greater than 1s. It also explains why at high traffic we see the method return in 50ms... This is not too good for us because this method is not flexible enough to cater for both high and low throughput conditions.
|
Just to confirm that method is causing the issue, I tried this modification on my local:
After this change, consume() always returns in less than 2 x timeout_ms, which is what we expect. Thank you so much for helping me track this down, much appreciated! |
After doing more testing, and re-reading your original 2 points, just wanted to say that you were absolutely spot on. Your first point is correct, and helped me track down the issue. I have now implemented a total timeout option, which the consume() must return by, it works well for low/medium traffic. Your second point is also correct, after doing a load test, I can confirm that the return time will be longer due to CPU starvation, but this can be controlled by throttling the consumption rate. Thank you once again for your time and help. |
Great analysis and fix! |
Description
In our node-rdkafka test app, we have a simple 1 second loop doing:
We use setDefaultConsumeTimeout(1000).
We can see the minimum consume time is 1000ms, which is expected, but quite often the time is around 3-5s and sometimes up to 8s..!
How to reproduce
We use a simple test app above which prints the elapsed time for consume() in each loop.
Checklist
1.2.2
2.1.3
"enable.auto.commit": false
debug=..
as necessary) from librdkafkaBelow is the logs for an expected result of 1000ms:
Below is the logs for a consume() which took 4.7s:
You can see there are 10 x CONSUME calls, making the total time 4.7s
If we used setDefaultConsumeTimeout(100). we can see the consume() time is always between 100ms and 200ms. The variation is much smaller compared to the above.
Is there an bug with the consume call which does not time out properly?
The text was updated successfully, but these errors were encountered: