-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault in rd_kafka_msgq_set_metadata() from /lib64/librdkafka.so.1 (corrupt TAILQ) #3559
Comments
Interesting, never seen that crash before. Is it the last msg_t in that queue that has the invalid next? |
IIRC the queue had 36 entries and the invalid next pointer was around entry 26. |
Do you have any idea if there are any other events around the same time this happens? Are you doing anything special in your application? |
I just realized I may have the wrong timestamp on the logs, give me a few minutes to correct that. |
Ok, I corrected the timestamp and included the correct section of broker logs. Our application uses a single Zookeeper instance co-located with a single Kafka broker, on the same machine as this producer. I don't see anything obvious happening around this point in time in the other parts of the system. FWIW, syslog shows this at the time of the crash:
|
Could it be OOM? |
Not an OOM, we crash on OOM, also here are the sar memory stats for that system around the time of the segfault:
|
BTW, I don't know if it matters or is helpful, but the thread that generated the core was named "rdk:broker0". |
Additional context @edenhill - the assert that fired was exactly this one:
We have multiple cores on this issue (same company as the original poster here - John Cagle). The one I looked at showed the TAILQ pointers as:
You can see here that both first and last pointers look ok, but first points to the same value as rkm:
But last (the last element's next pointer) points to null:
And the address of that pointer is:
which is not in the first element's space:
You can see these are far enough apart in the address space that it's highly likely there's more than one element in the queue. This tells me the queue itself is not necessarily corrupt, but rather, the current message itself got corrupted:
I hope this information is helpful. |
@edenhill - here's some additional insightful info - it occurred to me today that if this is a memory stomp, we might be able to get a clue about where it came from by just dumping the memory of the rkm object in hex byte format:
It kind of looks like a memory stomp to me. The first two lines are:
The first two fields of the object are not even close to the same type, yet the first and second 8 byte sets contain very similar data. Too close to be a coincidence (imho). I will admit this could be us doing the memory stomp as much as it could be you. I'm posting it here in case the values look familiar to you. |
Is it possible to reproduce this with asan enabled, or by using valgrind? |
@edenhill - it's very difficult to reproduce it at all. These issues actually occurred on two separate customer sites at around the same time (coincidentally), so it's unlikely we'd be able to get it to happen in the lab under ASAN or Valgrind. |
@edenhill - a colleague also looking at the issue pointed out to me that what I earlier referred to as a memory stomp looks more like a use-after-free scenario. The ptr-sized words at the start of the corrupted object look very much like what you'd find in a block that had been freed and re-added to a free-block list on the heap. Take those two 8-byte pointer values and flip them around and you have:
These look like addresses to me. Assuming this is the problem, this could only be caused by a bug in librdkafka. |
Description
I'm seeing this issue on multiple systems, this is just one occurrence (of the same traceback):
Time of occurrence: 2021-08-29 23:02:09 UTC
Core inspection seems to indicate that kafka is walking through an internal queue when it hits an invalid pointer:
Following it gives us the invalid pointer that we were looking at above.
How to reproduce
This doesn't happen very often, so I am unsure how to reproduce it.
broker log excerpts
Checklist
v1.7.0
0.11.0.1
producer:
queue.buffering.max.messages=500000,message.send.max.retries=3,retry.backoff.ms=500
CentOS7
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: