Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demonstrate and try to fix a possible bug with a "lost" fiber #3444

Merged
merged 8 commits into from
Feb 21, 2023

Conversation

durban
Copy link
Contributor

@durban durban commented Feb 20, 2023

This might be related to #3392, or may be something else entirely (I couldn't reproduce that one, so I can't check).

The problem:

First 2 commits (f204467 and b39cdd0) demonstrate the possible problem with a quick and dirty program; run it with coreJVM/Test/run (recommended to set core.jvm/Test/fork := true before, because it needs killing). This will print a number every second. In some runs it will be always increasing (this is correct):

[info] running (fork) cats.effect.unsafe.WstpSpec 
[info] Hello!
[info] 9682305
[info] 21700806
[info] 33978091
...

But sometimes it will always print the same number (this is incorrect):

[info] running (fork) cats.effect.unsafe.WstpSpec 
[info] Hello!
[info] 120
[info] 120
[info] 120
[info] 120
[info] 120
[info] 120
...

There is a fiber fib1 which should forever increment the number. So this probably means, that for some reason this fiber is not scheduled (any more). Heapdump shows references to this fiber from the main thread (the fib1 local) and a _join (probably the joining fiber in line 33) and a circle through probably its own _cancel.

What I think happens:

After the forever incrementing fiber autocedes (filling cedeBypass), the WorkerThread can go to state 0, and deque a blocking task from the external queue. It runs this blocking, and passes its internal structures to a new/cached thread to take its place. However: its cedeBypass is "logically" part of the local queue, but it is not passed to the replacement thread. After doing the blocking, it becomes cached, and (in the test after 3 seconds) it shuts itself down. Thus, its cedeBypass becomes forever "lost".

A possible/incomplete fix:

The 3rd and 4th commit (b9dabc5 and 9e106d9) tries to fix this problem. The WorkerThread before passing its things to the replacement thread, checks if there is a cedeBypass, if yes, it enqueues it to the local queue, thus safely(?) passing it onto the replacement thread.

This seems to fix the problem in the simple demonstration, but I did not do a lot of testing otherwise.

One possible problem with the fix, that this doesn't make the cedeBypass having higher priority than the external queue; I don't know if that's needed.

@vasilmkd
Copy link
Member

I'm inclined to say this is the root of the bug, yes. I'm trying to think where to fix it exactly.

The other alternative is to modify WorkerThread#reschedule to also take into consideration the state & ExternalQueueTicksMask value and only mutate the cedeBypass if safe to do so (if we're sure we're not about to inspect the other queues).

@vasilmkd
Copy link
Member

Great work @durban. Thank you so much.

@vasilmkd
Copy link
Member

The other alternative is to modify WorkerThread#reschedule to also take into consideration the state & ExternalQueueTicksMask value and only mutate the cedeBypass if safe to do so (if we're sure we're not about to inspect the other queues).

This most likely won't work because state is a local variable in run.

djspiewak
djspiewak previously approved these changes Feb 20, 2023
Copy link
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding work. Genuinely. I'm publishing a snapshot right now to test with the OG reproduction.

core/jvm/src/test/scala/cats/effect/unsafe/WstpSpec.scala Outdated Show resolved Hide resolved
@vasilmkd
Copy link
Member

The other real alternative solution is to keep reschedule as it is, and explicitly boot cedeBypass to the local queue in case 0. All other cases should be covered because they originate from a failed dequeue in case _ or after being woken up from sleep.

@djspiewak
Copy link
Member

I think the performance impact of that in the happy path would be slightly higher than this approach, which only penalizes blockOn

@djspiewak
Copy link
Member

djspiewak commented Feb 20, 2023

Published this branch as 3.4.7-34-defc3a4

@vasilmkd
Copy link
Member

vasilmkd commented Feb 20, 2023

This fix is great and we most likely should go through with it, pending some more execution time just in case in an ec2 instance.

However, we should probably reexamine whether cedeBypass is actually worth the maintenance headaches, compared to the small benchmark win that it provides in return. I'm personally not comfortable that it is knowingly breaking (my personal) expectations of when it is set and unset.

Furthermore, the worker thread run loop has gained a few new features, which could make the benchmark win with cedeBypass irrelevant.

vasilmkd
vasilmkd previously approved these changes Feb 20, 2023
@durban durban dismissed stale reviews from vasilmkd and djspiewak via 7c6e69a February 20, 2023 23:55
@djspiewak
Copy link
Member

#3445

@durban
Copy link
Contributor Author

durban commented Feb 21, 2023

The CI failure seems unrelated (IOAppSpec).

@djspiewak
Copy link
Member

djspiewak commented Feb 21, 2023

The CI failure seems unrelated (IOAppSpec).

Yeah we have some sneaky-flaky tests right now. Need to fix.

@djspiewak
Copy link
Member

Okay so thanks to @vasilmkd's creative experimentation, we have really good evidence that this bug is indeed happening in #3392. I have a couple EC2 instances churning on reproductions with various patches, but we're pretty certain this is the problem. Outstanding work, @durban!

djspiewak
djspiewak previously approved these changes Feb 21, 2023
@djspiewak
Copy link
Member

I swapped the main for an automated test. It actually wasn't that hard to reproduce in this fashion, though notably if you formulate the whole thing as an IO, it doesn't work.

@djspiewak djspiewak marked this pull request as ready for review February 21, 2023 02:29
@djspiewak djspiewak merged commit 5fada62 into typelevel:series/3.4.x Feb 21, 2023
@CremboC
Copy link

CremboC commented Mar 17, 2023

We're pretty sure this bug caused many of our production services to constantly freeze up. The services were fs2-kafka consumers, but notably they would only freeze up after implementing a groupWithin and a commitBatch logic. The symptoms were suddenly drop to 0% CPU, but no error raised whatsoever. Even Kafka itself reported the consumer as "healthy" (implying heartbeats were still being sent).

We've deployed this CE version to multiple consumers and report back.

@djspiewak
Copy link
Member

@CremboC That definitely sounds like a smoking gun to me! Please do report back with your experiences, positive or (even more importantly!) negative.

@CremboC
Copy link

CremboC commented Mar 20, 2023

@djspiewak unfortunately it seems like it happened again, even with 3.4.8. It just stopped consuming from Kafka and didn't recover (although I have implemented an automate restarting if it does get stuck, so maybe it would have recovered..?)

@djspiewak
Copy link
Member

Concerning. There may be another issue here. Can you open a new ticket? The PR definitely fixes this reproducer (and a legit bug). I think the first step would be to try to get a fiber dump while your consumer is hung

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CPU starvation causes problems with scheduling fairness (compared to CE2)
4 participants