hang on rd_kafka_destroy() in repetitive, quick succession #747

trthulhu · 2016-08-03T19:59:10Z

Description

I'm pulling data from a 3-node Kafka cluster using legacy consumer: 1 topic, 6 partitions, starting at beginning for 1 second, then cycling over the whole procedure over and over.

This is just a reproducer of one of our customer issues. Using librdkafka 0.9.1 release.

Apache Kafka 0.9
CentOS 7

Let me know if you need any more information or if the reproducer isn't working for you (though I won't be answering until Monday: vacation).

Thanks!

How to reproduce

Attached C++ file:
reproducer.zip

I can reproduce easily in my environment with this using ./reproducer 10 1000 1000 (10 threads, 1 second runtime, 1000 iterations). Note: each thread is doing the same work. You'll want to change the hard-coded topic and broker names.

Checklist

Please provide the following information:

librdkafka version (release number or git tag):
Apache Kafka version:
librdkafka client configuration:
Operating system:
Using the legacy Consumer
Using the high-level KafkaConsumer
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

edenhill · 2016-08-05T15:06:09Z

Thanks for providing a reproducable program!

You call start on 6 (0-5) partitions but you only call stop for the first one (0), adding the corresponding number of stop() calls solves the hang.

trthulhu · 2016-08-08T14:12:07Z

Woops! Sorry for the oversight -- however, the issue still occurs within our actual product which definitely does close everything. I tried to pull out everything rdkafka that we do with the reproducer, but it is of course not adequate. Let me see if I can cause the problem again, closing everything.

trthulhu · 2016-08-09T20:06:54Z

I was able to reproduce again, solving the close issue. This time I did 20 threads each consuming from 10 different topics (with 1 partition each). Check out the newly attached source to see if I made another oversight... also I checked the threads and they get a variety of backtraces when hung, but 2 common ones:

(gdb) t 68
[Switching to thread 68 (Thread 0x7fdc3ccc8700 (LWP 1662355))]
#0 0x0000003ed8a0822d in pthread_join () from /lib64/libpthread.so.0
(gdb) bt
#0 0x0000003ed8a0822d in pthread_join () from /lib64/libpthread.so.0
#1 0x0000000000449703 in thrd_join (thr=, res=0x0) at tinycthread.c:725
#2 0x000000000040af93 in rd_kafka_destroy_internal (rk=rk@entry=0x7fdbd40008c0) at rdkafka.c:550
#3 0x000000000040bf2d in rd_kafka_thread_main (arg=0x7fdbd40008c0) at rdkafka.c:937
#4 0x0000000000449329 in _thrd_wrapper_function (aArg=) at tinycthread.c:596
#5 0x0000003ed8a079d1 in start_thread () from /lib64/libpthread.so.0
#6 0x0000003ed86e88fd in clone () from /lib64/libc.so.6
(gdb) t 80
[Switching to thread 80 (Thread 0x7fdc121fc700 (LWP 1661400))]
#0 0x0000003ed8a0822d in pthread_join () from /lib64/libpthread.so.0
(gdb) bt
#0 0x0000003ed8a0822d in pthread_join () from /lib64/libpthread.so.0
#1 0x0000000000449703 in thrd_join (thr=thr@entry=140583889569536, res=res@entry=0x0) at tinycthread.c:725
#2 0x000000000040c878 in rd_kafka_destroy_app (blocking=1, rk=0x7fdbd40008c0) at rdkafka.c:466
#3 rd_kafka_destroy (rk=0x7fdbd40008c0) at rdkafka.c:474
#4 0x0000000000403cba in thread_func(void*) ()
#5 0x0000003ed8a079d1 in start_thread () from /lib64/libpthread.so.0
#6 0x0000003ed86e88fd in clone () from /lib64/libc.so.6

reproducer.zip

I realize the use case is completely far-fetched/crazy but it's the easiest way to cause the issue to show up.

edenhill · 2016-08-09T20:14:03Z

Awesome! (in one way)

Can you also provide thread apply all bt from gdb? (preferably as a text file attachment, will grow big..)

Thanks

trthulhu · 2016-08-09T20:24:21Z

gdb.txt

At this point in time, worker-13 and -14 were not making progress.

edenhill · 2016-08-11T15:16:54Z

I managed to reproduce this pretty quickly, I thought, but it turned out to be a lacking lock around the iteration_counter, with that in place I can't reproduce this with reproducer.cpp any more.

trthulhu · 2016-08-11T17:31:53Z

Can I see what you added? Each thread should be writing to its own index in the counter, and the reader thread doesn't care about dirty reads (locks shouldn't be necessary unless there was some processor cache reads happening). It's very possible I missed something, of course, I was just scrambling this together.

edenhill · 2016-08-11T18:29:51Z

I dont have access to it now, will upload it later, but it was mainly a lock around all iteration_counter reads and writes as well as zero-initialization of that array.

trthulhu · 2016-08-11T19:11:01Z

I was able to reproduce removing the reader thread completely; in fact I only added that to make it easier to tell when it was happening. Even if the writes were getting corrupted, threads should not be hanging. I suspect adding locks to every read/write operation causes sleeps/yields/etc that allow whatever underlying locking procedure/ref-counting is happening to proceed normally. Just a conjecture, though.

edenhill · 2016-08-11T21:01:33Z

I'm running your unmodified reproducer.cpp (well, I made sure to initialize the iteration_counters to zero.) on a 10 topics with 10 million messages in each, without hanging.
How long does it typically take you to reproduce this?

(This is quite a good tool for pushing the brokers, the broker processes are soaring at around 250% CPU usage)

trthulhu · 2016-08-11T21:21:39Z

Interesting... it usually only takes me 100 or so iterations for a thread or two to halt progress. My brokers are on separate machines, if that matters. I can do it regularly with 20 threads (amount I usually try).

trthulhu · 2016-08-12T19:01:32Z

I just tested the reproducer on master and it doesn't appear to happen there... went back to my 0.9.1 branch and it happens. Perhaps it has been fixed somehow recently? Have you made changes that could've affected this? I will keep checking to make sure.

trthulhu · 2016-08-18T16:39:58Z

Latest status: we switched to using the master branch for our latest tests using our system and a customer's settings to reproduce their issue and it ran successfully for awhile (the hang did not occur). The bad news is, we got a core after 6 hours of running instead, which looks similar to this fixed issue: #511

Found this line: *** rdkafka_buf.c:140:rd_kafka_buf_grow: assert: rkbuf->rkbuf_flags & RD_KAFKA_OP_F_FREE ***

and backtrace:
#0 0x00007f99927dd989 in raise () from /lib64/libc.so.6
#1 0x00007f99927df098 in abort () from /lib64/libc.so.6
#2 0x00007f98cd404133 in rd_kafka_crash (file=, line=, function=, rk=0x0, reason=) at rdkafka.c:2389
#3 0x00007f98cd41f9c2 in rd_kafka_buf_grow (rkbuf=0x7f7d058ccec0, needed_len=) at rdkafka_buf.c:140
#4 0x00007f98cd430160 in rd_kafka_buf_write (len=84482498, data=0x7f9860009790, rkbuf=0x7f7d058ccec0) at rdkafka_buf.h:374
#5 rd_kafka_buf_write_kstr (kstr=0x7f9860009780, rkbuf=0x7f7d058ccec0) at rdkafka_buf.h:512
#6 rd_kafka_MetadataRequest0 (rkb=0x7f7e0b666f10, all_topics=, only_rkt=0x0, reason=) at rdkafka_request.c:1495
#7 0x00007f98cd40c64d in rd_kafka_broker_metadata_req_op (rkb=rkb@entry=0x7f7e0b666f10, rko=rko@entry=0x7f8485675670) at rdkafka_broker.c:775
#8 0x00007f98cd4137fb in rd_kafka_broker_op_serve (rkb=rkb@entry=0x7f7e0b666f10, rko=0x7f8485675670) at rdkafka_broker.c:2803
#9 0x00007f98cd4142cb in rd_kafka_broker_serve (rkb=rkb@entry=0x7f7e0b666f10, timeout_ms=timeout_ms@entry=10) at rdkafka_broker.c:3043
#10 0x00007f98cd414631 in rd_kafka_broker_ua_idle (rkb=rkb@entry=0x7f7e0b666f10, timeout_ms=timeout_ms@entry=0) at rdkafka_broker.c:3103
#11 0x00007f98cd414e7f in rd_kafka_broker_thread_main (arg=arg@entry=0x7f7e0b666f10) at rdkafka_broker.c:4415
#12 0x00007f98cd441db7 in _thrd_wrapper_function (aArg=) at tinycthread.c:613
#13 0x00007f999238fdf3 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f999289e3dd in clone () from /lib64/libc.so.6

It did look like we were having difficulty connecting to brokers a minute or so prior to the crash. Thoughts?

edenhill · 2016-08-18T17:01:47Z

I just performed a somewhat large merge to master, what exact sha1 are you on?

trthulhu · 2016-08-22T13:07:11Z

Not sure exactly, but it was probably a week old or so. I will try the latest today.

This might have caused some hang-on-termination issues (#747)

This might have caused hang-on-termination too (#747)

trthulhu · 2016-09-07T18:31:31Z

After several more tests we've determined that we cannot reproduce the hang with the latest master branch so far, which is good!

But, we are indeed now running into the issue I mentioned above pretty readily (same way we used to reproduce the hang). Should I create a new issue for that or keep this one open to track it?

edenhill · 2016-09-07T19:10:58Z

Great!

Yes please, file a new issue

edenhill added the consumer label Sep 5, 2016

edenhill added a commit that referenced this issue Sep 5, 2016

Properly loose toppar ref on failed offset query

343277c

This might have caused some hang-on-termination issues (#747)

edenhill added a commit that referenced this issue Sep 5, 2016

op_reply: Loose replyq reference

359870e

This might have caused hang-on-termination too (#747)

trthulhu mentioned this issue Sep 7, 2016

Core rd_kafka_buf_grow assertion failure during rapid usage of consumer #781

Closed

9 tasks

edenhill added the bug label Sep 7, 2016

edenhill closed this as completed Sep 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hang on rd_kafka_destroy() in repetitive, quick succession #747

hang on rd_kafka_destroy() in repetitive, quick succession #747

trthulhu commented Aug 3, 2016

edenhill commented Aug 5, 2016

trthulhu commented Aug 8, 2016

trthulhu commented Aug 9, 2016 •

edited

Loading

edenhill commented Aug 9, 2016 •

edited

Loading

trthulhu commented Aug 9, 2016

edenhill commented Aug 11, 2016

trthulhu commented Aug 11, 2016 •

edited

Loading

edenhill commented Aug 11, 2016

trthulhu commented Aug 11, 2016

edenhill commented Aug 11, 2016

trthulhu commented Aug 11, 2016

trthulhu commented Aug 12, 2016 •

edited

Loading

trthulhu commented Aug 18, 2016

edenhill commented Aug 18, 2016

trthulhu commented Aug 22, 2016

trthulhu commented Sep 7, 2016 •

edited

Loading

edenhill commented Sep 7, 2016

hang on rd_kafka_destroy() in repetitive, quick succession #747

hang on rd_kafka_destroy() in repetitive, quick succession #747

Comments

trthulhu commented Aug 3, 2016

Description

How to reproduce

Checklist

edenhill commented Aug 5, 2016

trthulhu commented Aug 8, 2016

trthulhu commented Aug 9, 2016 • edited Loading

edenhill commented Aug 9, 2016 • edited Loading

trthulhu commented Aug 9, 2016

edenhill commented Aug 11, 2016

trthulhu commented Aug 11, 2016 • edited Loading

edenhill commented Aug 11, 2016

trthulhu commented Aug 11, 2016

edenhill commented Aug 11, 2016

trthulhu commented Aug 11, 2016

trthulhu commented Aug 12, 2016 • edited Loading

trthulhu commented Aug 18, 2016

edenhill commented Aug 18, 2016

trthulhu commented Aug 22, 2016

trthulhu commented Sep 7, 2016 • edited Loading

edenhill commented Sep 7, 2016

trthulhu commented Aug 9, 2016 •

edited

Loading

edenhill commented Aug 9, 2016 •

edited

Loading

trthulhu commented Aug 11, 2016 •

edited

Loading

trthulhu commented Aug 12, 2016 •

edited

Loading

trthulhu commented Sep 7, 2016 •

edited

Loading