-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demonstrate and try to fix a possible bug with a "lost" fiber #3444
Conversation
I'm inclined to say this is the root of the bug, yes. I'm trying to think where to fix it exactly. The other alternative is to modify |
Great work @durban. Thank you so much. |
This most likely won't work because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outstanding work. Genuinely. I'm publishing a snapshot right now to test with the OG reproduction.
The other real alternative solution is to keep |
I think the performance impact of that in the happy path would be slightly higher than this approach, which only penalizes |
Published this branch as |
This fix is great and we most likely should go through with it, pending some more execution time just in case in an ec2 instance. However, we should probably reexamine whether Furthermore, the worker thread run loop has gained a few new features, which could make the benchmark win with |
The CI failure seems unrelated ( |
Yeah we have some sneaky-flaky tests right now. Need to fix. |
I swapped the |
We're pretty sure this bug caused many of our production services to constantly freeze up. The services were We've deployed this CE version to multiple consumers and report back. |
@CremboC That definitely sounds like a smoking gun to me! Please do report back with your experiences, positive or (even more importantly!) negative. |
@djspiewak unfortunately it seems like it happened again, even with 3.4.8. It just stopped consuming from Kafka and didn't recover (although I have implemented an automate restarting if it does get stuck, so maybe it would have recovered..?) |
Concerning. There may be another issue here. Can you open a new ticket? The PR definitely fixes this reproducer (and a legit bug). I think the first step would be to try to get a fiber dump while your consumer is hung |
This might be related to #3392, or may be something else entirely (I couldn't reproduce that one, so I can't check).
The problem:
First 2 commits (f204467 and b39cdd0) demonstrate the possible problem with a quick and dirty program; run it with
coreJVM/Test/run
(recommended toset core.jvm/Test/fork := true
before, because it needs killing). This will print a number every second. In some runs it will be always increasing (this is correct):But sometimes it will always print the same number (this is incorrect):
There is a fiber
fib1
which should forever increment the number. So this probably means, that for some reason this fiber is not scheduled (any more). Heapdump shows references to this fiber from the main thread (thefib1
local) and a_join
(probably the joining fiber in line 33) and a circle through probably its own_cancel
.What I think happens:
After the forever incrementing fiber autocedes (filling
cedeBypass
), theWorkerThread
can go to state0
, and deque ablocking
task from the external queue. It runs thisblocking
, and passes its internal structures to a new/cached thread to take its place. However: itscedeBypass
is "logically" part of the local queue, but it is not passed to the replacement thread. After doing the blocking, it becomes cached, and (in the test after 3 seconds) it shuts itself down. Thus, itscedeBypass
becomes forever "lost".A possible/incomplete fix:
The 3rd and 4th commit (b9dabc5 and 9e106d9) tries to fix this problem. The
WorkerThread
before passing its things to the replacement thread, checks if there is acedeBypass
, if yes, it enqueues it to the local queue, thus safely(?) passing it onto the replacement thread.This seems to fix the problem in the simple demonstration, but I did not do a lot of testing otherwise.
One possible problem with the fix, that this doesn't make the
cedeBypass
having higher priority than the external queue; I don't know if that's needed.