-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hang on rd_kafka_destroy() in repetitive, quick succession #747
Comments
Thanks for providing a reproducable program! You call start on 6 (0-5) partitions but you only call stop for the first one (0), adding the corresponding number of stop() calls solves the hang. |
Woops! Sorry for the oversight -- however, the issue still occurs within our actual product which definitely does close everything. I tried to pull out everything rdkafka that we do with the reproducer, but it is of course not adequate. Let me see if I can cause the problem again, closing everything. |
I was able to reproduce again, solving the close issue. This time I did 20 threads each consuming from 10 different topics (with 1 partition each). Check out the newly attached source to see if I made another oversight... also I checked the threads and they get a variety of backtraces when hung, but 2 common ones: (gdb) t 68 I realize the use case is completely far-fetched/crazy but it's the easiest way to cause the issue to show up. |
Awesome! (in one way) Can you also provide Thanks |
At this point in time, worker-13 and -14 were not making progress. |
I managed to reproduce this pretty quickly, I thought, but it turned out to be a lacking lock around the iteration_counter, with that in place I can't reproduce this with reproducer.cpp any more. |
Can I see what you added? Each thread should be writing to its own index in the counter, and the reader thread doesn't care about dirty reads (locks shouldn't be necessary unless there was some processor cache reads happening). It's very possible I missed something, of course, I was just scrambling this together. |
I dont have access to it now, will upload it later, but it was mainly a lock around all iteration_counter reads and writes as well as zero-initialization of that array. |
I was able to reproduce removing the reader thread completely; in fact I only added that to make it easier to tell when it was happening. Even if the writes were getting corrupted, threads should not be hanging. I suspect adding locks to every read/write operation causes sleeps/yields/etc that allow whatever underlying locking procedure/ref-counting is happening to proceed normally. Just a conjecture, though. |
I'm running your unmodified reproducer.cpp (well, I made sure to initialize the iteration_counters to zero.) on a 10 topics with 10 million messages in each, without hanging. (This is quite a good tool for pushing the brokers, the broker processes are soaring at around 250% CPU usage) |
Interesting... it usually only takes me 100 or so iterations for a thread or two to halt progress. My brokers are on separate machines, if that matters. I can do it regularly with 20 threads (amount I usually try). |
I just tested the reproducer on master and it doesn't appear to happen there... went back to my 0.9.1 branch and it happens. Perhaps it has been fixed somehow recently? Have you made changes that could've affected this? I will keep checking to make sure. |
Latest status: we switched to using the master branch for our latest tests using our system and a customer's settings to reproduce their issue and it ran successfully for awhile (the hang did not occur). The bad news is, we got a core after 6 hours of running instead, which looks similar to this fixed issue: #511 Found this line: *** rdkafka_buf.c:140:rd_kafka_buf_grow: assert: rkbuf->rkbuf_flags & RD_KAFKA_OP_F_FREE *** and backtrace: It did look like we were having difficulty connecting to brokers a minute or so prior to the crash. Thoughts? |
I just performed a somewhat large merge to master, what exact sha1 are you on? |
Not sure exactly, but it was probably a week old or so. I will try the latest today. |
This might have caused some hang-on-termination issues (#747)
This might have caused hang-on-termination too (#747)
After several more tests we've determined that we cannot reproduce the hang with the latest master branch so far, which is good! But, we are indeed now running into the issue I mentioned above pretty readily (same way we used to reproduce the hang). Should I create a new issue for that or keep this one open to track it? |
Great! Yes please, file a new issue |
Description
I'm pulling data from a 3-node Kafka cluster using legacy consumer: 1 topic, 6 partitions, starting at beginning for 1 second, then cycling over the whole procedure over and over.
This is just a reproducer of one of our customer issues. Using librdkafka 0.9.1 release.
Apache Kafka 0.9
CentOS 7
Let me know if you need any more information or if the reproducer isn't working for you (though I won't be answering until Monday: vacation).
Thanks!
How to reproduce
Attached C++ file:
reproducer.zip
I can reproduce easily in my environment with this using ./reproducer 10 1000 1000 (10 threads, 1 second runtime, 1000 iterations). Note: each thread is doing the same work. You'll want to change the hard-coded topic and broker names.
Checklist
Please provide the following information:
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: