-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
librdkafka crashes due to seemingly racing condition #1125
Comments
Thanks for your thorough analysis. There's been plenty of bugfixes since 0.9.2 and the metadata handling has been revamped so I'd like to ask you to try librdkafka v0.9.4 and see if it is reproducible there |
Thanks for the fast response. We definitely will upgrade our product to later version librdkafka. Right now, we will have to hotfix our product in the field, so we have to patch v0.9.2. Could you please comment on if we can move the read-write lock up a few lines in rdkafka_topic.c? Do you think it is a reasonable solution to ensure single access to rkt_topic in this function? From my inspection, there isn't any additional lock in those few lines above the original read-write lock, so I think it is not subject to any risk of deadlocking. Thanks. void rd_kafka_topic_destroy_final (rd_kafka_itopic_t *rkt) {
move lock below to here ====>
move this line up ====> rd_kafka_wrlock(rkt->rkt_rk); |
Your analysis looks sound, moving the wrlock up there should help. This looks like it is still an issue on v0.9.4 |
When the crash occur, at what line is your application thread currently executing? |
At line 112 of below, calling rd_kafka_destroy(...) 108 //get_msg_p(rk, rkt, tinfo); |
Identified and fixed by @benli123 #Changelog
Here's the fix I just pushed to master: |
Thanks for a great bug report (and fix!) |
Thanks for the quick fix. I will patch our 0.9.2 accordingly. BTW, I noticed that you still keep the following lines out of the read-write lock, just moved them after the lock. Could you please let me know the reason? I can imagine why it can work, but not for sure. Thanks if (rkt->rkt_topic) |
As soon as the rkt is taken off the rk_topics list (which is done inside the lock scope) no other code can reach it, so after that unlock we're free to do anything with the rkt since it is only accessible by the current thread. |
Description
We are using librdkafka (git Tag 0.9.2-RC1) for our product development. We currently run into a librdkafka library crash. When trying to access rkt->rkt_topic in rdkafka_request.c:rd_kafka_MetadataRequest0(...), we found that rkt_topic.len is corrupted with a unreasonably huge value.
Due to the requirement of our product, we use librdkafka as a Kafka consumer by continuously running in a loop. In each iteration, we create a rk, process message for a few second, and then destroy the rk.
After some research, I'm able to reproduce this problem with a simple Kafka consumer application. It seems that there is a race condition between terminating rk and processing metadata request (please see stacktrace of both thread at the end). In the crashing case, one thread in rdkafka_request.c:rd_kafka_MetadataRequest0(...) is holding readonly lock to rk, then start to use one of the topic in rkt->topics list. Meanwhile, another thread runs rdkafka_topic.c:rd_kafka_topic_destroy_final(...), which calls rd_kafkap_str_destroy(rkt->rkt_topic); (rdkafka_topic.c:107) to destroy the same topic, and then blocked on rd_kafka_wrlock(rkt->rkt_rk) (rdkafka_topic.c:112) for rd_kafka_MetadataRequest0(...) to finish.
It seems that rkt_topic is beening accessed here unprotected. Is it possible to solve this problem by moving the rd_kafka_wrlock line before calling rd_kafkap_str_destroy line in rdkafka_topic.c? Do you have any suggestions?
Thanks
How to reproduce
The problem is hard to reproduce. It normally take a few weeks to reproduce in our production environment. I wrote a simpler standalone program to reproduce like below:
========= reproducer code snippet =====
========= stack trace ===========
----------- thread causing crash ----------
---------- end of stack trace ----------------
Checklist
Please provide the following information:
git tag: 0.9.2-RC1
kafka_2.11-0.10.1.1
default consumer configuration
Centos 6.5
Yes
No
debug=..
as necessary) from librdkafkaNo logs
No
Yes.
The text was updated successfully, but these errors were encountered: