-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starting with 8e20e1ee, after broker goes down and back up, rd_kafka_destroy
of groupconsumer hangs
#4674
Comments
@Quuxplusone thanks for the description of the problem. Will try to reproduce it, maybe it's linked to stopping and restarting the broker. Could you try it without |
With vanilla librdkafka v2.3.0, but with our custom |
@Quuxplusone we're also facing similar issues in our use-cases, but it's very sporadic. We're working on a stable reproducer to move further. Checking if you have a public reproducer that you could share? Currently, I'm trying with the following test-setup, but the issue is reproduced very rarely:
Any ideas for improving the reproducer would be very helpful. |
Nope, I never came up with a reproducer. But we haven't seen the problem at all since we applied #4667. |
Thanks for the inputs @Quuxplusone. Using the steps mentioned in my previous comment, I could sporadically reproduce the issue. I tried collecting
Is it of any help in debugging the root cause? This memory leak isn't present when the test-program exits cleanly. Noting that I've also used @Mekk's patch, to trigger the memory leak otherwise it'd be very difficult to identify from the dozen's of objects in the heap. |
@Quuxplusone could you verify these things?
|
Could you explain how to verify that? (Especially since in this case we have multiple independent |
@Quuxplusone given 8e20e1e adds a reference from an I've created PR #4724 to fix that case I was mentioning in #4667 . You can try it to see if it fixes your case. |
@Quuxplusone can this be closed, after merging the fix, is #4724 solving your case? |
@emasab: I haven't tested #4724. At the moment, it has merge conflicts. Our working hypothesis is still that #4667 fixes a leak of one When there's a new release of librdkafka, we'll certainly try upgrading to it, and then we can see whether it's safe for us to revert #4667 or whether we still see the problem. But #4724 isn't even in master yet, let alone a new release, right? I suggest that you should at least try to figure out whether #4724 is correct; if it is correct, then merge it; if it's incorrect, don't merge it. This should be done completely independently of whether it happens to fix #4674 or not. |
@Quuxplusone #4667 fixes the leak because it removes the reference to the toppar but that was introduced to fix a bug because if the barrier doesn't contain the toppar but only the epoch, the code removes messages with that epoch even if from different partition, that are never delivered. So the fix cannot be reverted. #4724 is correct in the sense that fixes the hang on destroy that happened sporadically in test 0113 and it can fix cases similar to that test. I'll also try to reproduce it before or after the fix from the instructions from @Quuxplusone or @Mrigank11. |
I'm pulling this out into a new Issue, because we keep discussing it in different random PRs/issues and it'll help to have just one place to discuss it.
On #4667 I wrote:
Then in #4669 (comment) I wrote:
Then @emasab asked:
I can't get super detailed (because the layers of abstraction between librdkafka and the level of the test itself are all three of confusing/internal/proprietary), but I'm pretty sure we don't know anything about "assignors" so we're doing whatever the default is, there. The structure of our test is more or less:
rd_kafka_t
s in the same process: one a groupconsumer on topic T1 and broker B1, and the other a producer on topic T2 on broker B2. (Actually, make that four: we have another (non-group) consumer running, and another producer, too, for other purposes. I think it's just four total.)enable.auto.commit=false
,enable.auto.offset.store=false
,auto.offset.reset=smallest
. Also,rd_kafka_conf_set_rebalance_cb
to a callback that looks forRD_KAFKA_RESP_ERR__ASSIGN_PARTITIONS
and remembers how many partitions it saw.rd_kafka_consumer_poll
until either it returns NULL or until it has returnedRD_KAFKA_RESP_ERR__PARTITION_EOF
on T1 as many times as there are partitions. (I suspect this second condition, and thus the rebalance callback, is irrelevant; but I don't know.)rd_kafka_commit(rk, toppar, true)
wheretoppar
is freshly created withrd_kafka_topic_partition_list_new
and destroyed withrd_kafka_topic_partition_list_destroy
immediately afterward.)rd_kafka_consumer_close(rk)
. This seems to returnRD_KAFKA_RESP_ERR_NO_ERROR
.rd_kafka_destroy(rk)
. This blocks here...rd_kafka_thread_main
to exit, but it's blocked...rd_kafka_broker_thread_main
but I'm not sure which thread. We have a total of 35 threads still extant at this point. Remember that the nongroup consumer and the two producers are all still running.Anyway, I don't think this gives you enough to go on, but at least we have a central place to talk about the hang now, instead of scattering it over different PRs.
The text was updated successfully, but these errors were encountered: