-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
librdkafka on IBM machine with native compiler #837
Comments
You did the right thing disabling HAVE_ATOMIC* which makes it fall back to lock-based atomics instead. Can you run your program with helgrind (valgrind) if that's available to see what lock is blocking? |
I attached GDB and this is what I see: (gdb) where (gdb) p err |
Try reconfiguring without optimization and see if it provides better info: |
I did and I see following. Basically I set BPs at two locations. These are two locations which print " Broker: Unknown topic or partition (actions 0x4)" error, which I see when I run tests: (gdb) break rdkafka_partition.c:423 %7|1476394628.101|STATE|0001_multiobj#producer-1| [thrd:10.122.71.85:9092/bootstrap]: 10.122.140.20:9092/bootstrap: Broker changed state INIT -> CONNECT Breakpoint 1, rd_kafka_toppar_new0 (rkt=0xc5300000, partition=0, func=0x1012c6a8 <dbsubn+13452> "0001", line=-258295376) at rdkafka_partition.c:423 |
How did you make out where to put the breakpoints? It should be enough with starting your program inside gdb ( |
OK, Here is what I see. No BP this time. Only ctrl-C on hang and apply all bt: (gdb) thread apply all bt Thread 11 (Thread 2314): Thread 10 (Thread 2057): Thread 9 (Thread 1800): Thread 8 (Thread 1543): Thread 7 (Thread 1286): Thread 6 (Thread 1029): Thread 5 (Thread 772): Thread 4 (Thread 515): Thread 3 (Thread 258): Thread 2 (Thread 1): Thread 1 (process 42533896): |
Weird, there should be a bunch of broker threads there. Btw, did you set up a test.conf to point out your own brokers? |
|
Can you run something simpler, like examples/rdkafka_example -b -L ? |
I truncated the output for readability: $ ./example -L -b 10.122.140.84:9092,10.122.140.20:9092,10.122.71.85:9092
|
Try producing and consuming with rdkafka_example too: consume: You can also try running the tests sequencially(instead of parallelized) if threading is a problem: |
BTW I was running only one test by using following command: I will try out with rdkafka_example and let you know soon |
Simple example works:
|
My apologies, earlier stack was from 0.9.1 branch...following is correct stack on branch master when hang happens:
|
Do this in gdb:
|
Is there a way to look members?
|
|
Seems like your toolchain doesnt provide debugging info. |
I'm specifiying following for xlc compiler for generating gdb compatible debug info Another weird thing is: After I built again, I see the original stack again which I thought was not from master branch. I mean I don't see any broker thread. I tried it multiple times and never see broker threads. Am I missing something obvious?
|
I'm sorry but I don't have any experience with that toolchain |
Thats OK...I'm just looking at relevant code: Question: I understand that it is poping all available ops from a queue and call the provided
|
That loop waits for an op to become available (be enqueued) on rkq_q, up to timeout_ms |
I did further debugging and here is what I observe on why hang is happening:
|
Hey Magnus, Any insight you could provide would be great. It does not seem to be an endian issue as I verified that IBM is big endian and code is taking correct execution path when reading int16, int32 and int64. More over, issue is not consistent. Happens most of the time but not always. It can't be broker issue because kafka client on sun machine with same broker works fine. |
Turns out that I was ignoring warning on __thread keyword not being recognized on IBM. I had to provide -qtls option while compiling the source code. That fixed the issue. Now tests pass except the 0017-compression, which fails on both IBM and SUN. |
@abhisharma Great that you found the issue! Which compression algorithm fails on test 0017? |
I think it is snappy as following assertion failure causes test to crash: |
The Snappy code base is not trivial to grasp, but there's been previous porting problems with that code (for Win32). |
Description
I tried to build and test 32 bit librdkafka on IBM machine using native compiler. Here is what I observed:
Librdkafka-0.9.1
Librdkafka-master
386258.376|METADATA|0001_multiobj#producer-4| 10.122.140.84:9092/bootstrap: Topic bib_rnd64b000005215_0001 partition 1 Leader 0
%7|1476386258.376|METADATA|0001_multiobj#producer-4| 10.122.140.84:9092/bootstrap: Topic bib_rnd64b000005215_0001 partition 3 Leader 2
%7|1476386258.376|METADATA|0001_multiobj#producer-4| 10.122.140.84:9092/bootstrap: Topic bib_rnd64b000005215_0001 partition 0 Leader 2
%7|1476386258.376|METADATA|0001_multiobj#producer-4| 10.122.140.84:9092/bootstrap: Requested topic bib_rnd64b000005215_0001 seen in metadata
%7|1476386259.297|PRODUCE|0001_multiobj#producer-4| mrplnjdmrsmr02:9092/2: produce messageset with 1 messages (64 bytes)
%7|1476386259.299|MSGSET|0001_multiobj#producer-4| mrplnjdmrsmr02:9092/2: MessageSet with 1 message(s) delivered
%7|1476386259.299|REQERR|0001_multiobj#producer-4| mrplnjdmrsmr02:9092/2: ProduceRequest failed: Broker: Unknown topic or partition: explicit actions 0x4
%7|1476386259.299|MSGSET|0001_multiobj#producer-4| mrplnjdmrsmr02:9092/2: MessageSet with 1 message(s) encountered error: Broker: Unknown topic or partition (actions 0x4)
%7|1476386259.299|METADATA|0001_multiobj#producer-4| mrplnydmrsmr02:9092/0: Request metadata for bib_rnd64b000005215_0001: leader query: scheduled: not in broker thread
%7|1476386259.377|METADATA|0001_multiobj#producer-4| mrplnydmrsmr02:9092/0: Request metadata for bib_rnd64b000005215_0001: leader query
%7|1476386259.377|METADATA|0001_multiobj#producer-4| mrplnydmrsmr02:9092/0: Request metadata for bib_rnd64b000005215_0001: leader query
%7|1476386259.377|METADATA|0001_multiobj#producer-4| mrplnydmrsmr02:9092/0: ===== Received metadata =====
%7|1476386259.378|METADATA|0001_multiobj#producer-4| mrplnydmrsmr02:9092/0: 3 brokers, 1 topics
I noticed that in response to following issue,
#319
fixes were done for Solaris for memory alignment. Was the same fix ever tested for IBM machines? Any help will be appreciated.
How to reproduce
Checklist
Please provide the following information:
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: