Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hang on rd_kafka_destroy() in repetitive, quick succession #747

Closed
5 of 9 tasks
trthulhu opened this issue Aug 3, 2016 · 17 comments
Closed
5 of 9 tasks

hang on rd_kafka_destroy() in repetitive, quick succession #747

trthulhu opened this issue Aug 3, 2016 · 17 comments

Comments

@trthulhu
Copy link

trthulhu commented Aug 3, 2016

Description

I'm pulling data from a 3-node Kafka cluster using legacy consumer: 1 topic, 6 partitions, starting at beginning for 1 second, then cycling over the whole procedure over and over.

This is just a reproducer of one of our customer issues. Using librdkafka 0.9.1 release.

Apache Kafka 0.9
CentOS 7

Let me know if you need any more information or if the reproducer isn't working for you (though I won't be answering until Monday: vacation).

Thanks!

How to reproduce

Attached C++ file:
reproducer.zip

I can reproduce easily in my environment with this using ./reproducer 10 1000 1000 (10 threads, 1 second runtime, 1000 iterations). Note: each thread is doing the same work. You'll want to change the hard-coded topic and broker names.

Checklist

Please provide the following information:

  • librdkafka version (release number or git tag):
  • Apache Kafka version:
  • librdkafka client configuration:
  • Operating system:
  • Using the legacy Consumer
  • Using the high-level KafkaConsumer
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@edenhill
Copy link
Contributor

edenhill commented Aug 5, 2016

Thanks for providing a reproducable program!

You call start on 6 (0-5) partitions but you only call stop for the first one (0), adding the corresponding number of stop() calls solves the hang.

@trthulhu
Copy link
Author

trthulhu commented Aug 8, 2016

Woops! Sorry for the oversight -- however, the issue still occurs within our actual product which definitely does close everything. I tried to pull out everything rdkafka that we do with the reproducer, but it is of course not adequate. Let me see if I can cause the problem again, closing everything.

@trthulhu
Copy link
Author

trthulhu commented Aug 9, 2016

I was able to reproduce again, solving the close issue. This time I did 20 threads each consuming from 10 different topics (with 1 partition each). Check out the newly attached source to see if I made another oversight... also I checked the threads and they get a variety of backtraces when hung, but 2 common ones:

(gdb) t 68
[Switching to thread 68 (Thread 0x7fdc3ccc8700 (LWP 1662355))]
#0 0x0000003ed8a0822d in pthread_join () from /lib64/libpthread.so.0
(gdb) bt
#0 0x0000003ed8a0822d in pthread_join () from /lib64/libpthread.so.0
#1 0x0000000000449703 in thrd_join (thr=, res=0x0) at tinycthread.c:725
#2 0x000000000040af93 in rd_kafka_destroy_internal (rk=rk@entry=0x7fdbd40008c0) at rdkafka.c:550
#3 0x000000000040bf2d in rd_kafka_thread_main (arg=0x7fdbd40008c0) at rdkafka.c:937
#4 0x0000000000449329 in _thrd_wrapper_function (aArg=) at tinycthread.c:596
#5 0x0000003ed8a079d1 in start_thread () from /lib64/libpthread.so.0
#6 0x0000003ed86e88fd in clone () from /lib64/libc.so.6
(gdb) t 80
[Switching to thread 80 (Thread 0x7fdc121fc700 (LWP 1661400))]
#0 0x0000003ed8a0822d in pthread_join () from /lib64/libpthread.so.0
(gdb) bt
#0 0x0000003ed8a0822d in pthread_join () from /lib64/libpthread.so.0
#1 0x0000000000449703 in thrd_join (thr=thr@entry=140583889569536, res=res@entry=0x0) at tinycthread.c:725
#2 0x000000000040c878 in rd_kafka_destroy_app (blocking=1, rk=0x7fdbd40008c0) at rdkafka.c:466
#3 rd_kafka_destroy (rk=0x7fdbd40008c0) at rdkafka.c:474
#4 0x0000000000403cba in thread_func(void*) ()
#5 0x0000003ed8a079d1 in start_thread () from /lib64/libpthread.so.0
#6 0x0000003ed86e88fd in clone () from /lib64/libc.so.6

reproducer.zip

I realize the use case is completely far-fetched/crazy but it's the easiest way to cause the issue to show up.

@edenhill
Copy link
Contributor

edenhill commented Aug 9, 2016

Awesome! (in one way)

Can you also provide thread apply all bt from gdb? (preferably as a text file attachment, will grow big..)

Thanks

@trthulhu
Copy link
Author

trthulhu commented Aug 9, 2016

gdb.txt

At this point in time, worker-13 and -14 were not making progress.

@edenhill
Copy link
Contributor

I managed to reproduce this pretty quickly, I thought, but it turned out to be a lacking lock around the iteration_counter, with that in place I can't reproduce this with reproducer.cpp any more.

@trthulhu
Copy link
Author

trthulhu commented Aug 11, 2016

Can I see what you added? Each thread should be writing to its own index in the counter, and the reader thread doesn't care about dirty reads (locks shouldn't be necessary unless there was some processor cache reads happening). It's very possible I missed something, of course, I was just scrambling this together.

@edenhill
Copy link
Contributor

I dont have access to it now, will upload it later, but it was mainly a lock around all iteration_counter reads and writes as well as zero-initialization of that array.

@trthulhu
Copy link
Author

I was able to reproduce removing the reader thread completely; in fact I only added that to make it easier to tell when it was happening. Even if the writes were getting corrupted, threads should not be hanging. I suspect adding locks to every read/write operation causes sleeps/yields/etc that allow whatever underlying locking procedure/ref-counting is happening to proceed normally. Just a conjecture, though.

@edenhill
Copy link
Contributor

I'm running your unmodified reproducer.cpp (well, I made sure to initialize the iteration_counters to zero.) on a 10 topics with 10 million messages in each, without hanging.
How long does it typically take you to reproduce this?

(This is quite a good tool for pushing the brokers, the broker processes are soaring at around 250% CPU usage)

@trthulhu
Copy link
Author

Interesting... it usually only takes me 100 or so iterations for a thread or two to halt progress. My brokers are on separate machines, if that matters. I can do it regularly with 20 threads (amount I usually try).

@trthulhu
Copy link
Author

trthulhu commented Aug 12, 2016

I just tested the reproducer on master and it doesn't appear to happen there... went back to my 0.9.1 branch and it happens. Perhaps it has been fixed somehow recently? Have you made changes that could've affected this? I will keep checking to make sure.

@trthulhu
Copy link
Author

Latest status: we switched to using the master branch for our latest tests using our system and a customer's settings to reproduce their issue and it ran successfully for awhile (the hang did not occur). The bad news is, we got a core after 6 hours of running instead, which looks similar to this fixed issue: #511

Found this line: *** rdkafka_buf.c:140:rd_kafka_buf_grow: assert: rkbuf->rkbuf_flags & RD_KAFKA_OP_F_FREE ***

and backtrace:
#0 0x00007f99927dd989 in raise () from /lib64/libc.so.6
#1 0x00007f99927df098 in abort () from /lib64/libc.so.6
#2 0x00007f98cd404133 in rd_kafka_crash (file=, line=, function=, rk=0x0, reason=) at rdkafka.c:2389
#3 0x00007f98cd41f9c2 in rd_kafka_buf_grow (rkbuf=0x7f7d058ccec0, needed_len=) at rdkafka_buf.c:140
#4 0x00007f98cd430160 in rd_kafka_buf_write (len=84482498, data=0x7f9860009790, rkbuf=0x7f7d058ccec0) at rdkafka_buf.h:374
#5 rd_kafka_buf_write_kstr (kstr=0x7f9860009780, rkbuf=0x7f7d058ccec0) at rdkafka_buf.h:512
#6 rd_kafka_MetadataRequest0 (rkb=0x7f7e0b666f10, all_topics=, only_rkt=0x0, reason=) at rdkafka_request.c:1495
#7 0x00007f98cd40c64d in rd_kafka_broker_metadata_req_op (rkb=rkb@entry=0x7f7e0b666f10, rko=rko@entry=0x7f8485675670) at rdkafka_broker.c:775
#8 0x00007f98cd4137fb in rd_kafka_broker_op_serve (rkb=rkb@entry=0x7f7e0b666f10, rko=0x7f8485675670) at rdkafka_broker.c:2803
#9 0x00007f98cd4142cb in rd_kafka_broker_serve (rkb=rkb@entry=0x7f7e0b666f10, timeout_ms=timeout_ms@entry=10) at rdkafka_broker.c:3043
#10 0x00007f98cd414631 in rd_kafka_broker_ua_idle (rkb=rkb@entry=0x7f7e0b666f10, timeout_ms=timeout_ms@entry=0) at rdkafka_broker.c:3103
#11 0x00007f98cd414e7f in rd_kafka_broker_thread_main (arg=arg@entry=0x7f7e0b666f10) at rdkafka_broker.c:4415
#12 0x00007f98cd441db7 in _thrd_wrapper_function (aArg=) at tinycthread.c:613
#13 0x00007f999238fdf3 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f999289e3dd in clone () from /lib64/libc.so.6

It did look like we were having difficulty connecting to brokers a minute or so prior to the crash. Thoughts?

@edenhill
Copy link
Contributor

I just performed a somewhat large merge to master, what exact sha1 are you on?

@trthulhu
Copy link
Author

Not sure exactly, but it was probably a week old or so. I will try the latest today.

edenhill added a commit that referenced this issue Sep 5, 2016
This might have caused some hang-on-termination issues (#747)
edenhill added a commit that referenced this issue Sep 5, 2016
This might have caused hang-on-termination too (#747)
@trthulhu
Copy link
Author

trthulhu commented Sep 7, 2016

After several more tests we've determined that we cannot reproduce the hang with the latest master branch so far, which is good!

But, we are indeed now running into the issue I mentioned above pretty readily (same way we used to reproduce the hang). Should I create a new issue for that or keep this one open to track it?

@edenhill
Copy link
Contributor

edenhill commented Sep 7, 2016

Great!

Yes please, file a new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants