Demonstrate and try to fix a possible bug with a "lost" fiber #3444

durban · 2023-02-20T22:51:44Z

This might be related to #3392, or may be something else entirely (I couldn't reproduce that one, so I can't check).

The problem:

First 2 commits (f204467 and b39cdd0) demonstrate the possible problem with a quick and dirty program; run it with coreJVM/Test/run (recommended to set core.jvm/Test/fork := true before, because it needs killing). This will print a number every second. In some runs it will be always increasing (this is correct):

[info] running (fork) cats.effect.unsafe.WstpSpec 
[info] Hello!
[info] 9682305
[info] 21700806
[info] 33978091
...

But sometimes it will always print the same number (this is incorrect):

[info] running (fork) cats.effect.unsafe.WstpSpec 
[info] Hello!
[info] 120
[info] 120
[info] 120
[info] 120
[info] 120
[info] 120
...

There is a fiber fib1 which should forever increment the number. So this probably means, that for some reason this fiber is not scheduled (any more). Heapdump shows references to this fiber from the main thread (the fib1 local) and a _join (probably the joining fiber in line 33) and a circle through probably its own _cancel.

What I think happens:

After the forever incrementing fiber autocedes (filling cedeBypass), the WorkerThread can go to state 0, and deque a blocking task from the external queue. It runs this blocking, and passes its internal structures to a new/cached thread to take its place. However: its cedeBypass is "logically" part of the local queue, but it is not passed to the replacement thread. After doing the blocking, it becomes cached, and (in the test after 3 seconds) it shuts itself down. Thus, its cedeBypass becomes forever "lost".

A possible/incomplete fix:

The 3rd and 4th commit (b9dabc5 and 9e106d9) tries to fix this problem. The WorkerThread before passing its things to the replacement thread, checks if there is a cedeBypass, if yes, it enqueues it to the local queue, thus safely(?) passing it onto the replacement thread.

This seems to fix the problem in the simple demonstration, but I did not do a lot of testing otherwise.

One possible problem with the fix, that this doesn't make the cedeBypass having higher priority than the external queue; I don't know if that's needed.

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

vasilmkd · 2023-02-20T23:23:12Z

I'm inclined to say this is the root of the bug, yes. I'm trying to think where to fix it exactly.

The other alternative is to modify WorkerThread#reschedule to also take into consideration the state & ExternalQueueTicksMask value and only mutate the cedeBypass if safe to do so (if we're sure we're not about to inspect the other queues).

vasilmkd · 2023-02-20T23:23:55Z

Great work @durban. Thank you so much.

vasilmkd · 2023-02-20T23:26:34Z

The other alternative is to modify WorkerThread#reschedule to also take into consideration the state & ExternalQueueTicksMask value and only mutate the cedeBypass if safe to do so (if we're sure we're not about to inspect the other queues).

This most likely won't work because state is a local variable in run.

djspiewak

Outstanding work. Genuinely. I'm publishing a snapshot right now to test with the OG reproduction.

core/jvm/src/test/scala/cats/effect/unsafe/WstpSpec.scala

vasilmkd · 2023-02-20T23:42:23Z

The other real alternative solution is to keep reschedule as it is, and explicitly boot cedeBypass to the local queue in case 0. All other cases should be covered because they originate from a failed dequeue in case _ or after being woken up from sleep.

djspiewak · 2023-02-20T23:44:45Z

I think the performance impact of that in the happy path would be slightly higher than this approach, which only penalizes blockOn

djspiewak · 2023-02-20T23:44:55Z

Published this branch as 3.4.7-34-defc3a4

vasilmkd · 2023-02-20T23:51:02Z

This fix is great and we most likely should go through with it, pending some more execution time just in case in an ec2 instance.

However, we should probably reexamine whether cedeBypass is actually worth the maintenance headaches, compared to the small benchmark win that it provides in return. I'm personally not comfortable that it is knowingly breaking (my personal) expectations of when it is set and unset.

Furthermore, the worker thread run loop has gained a few new features, which could make the benchmark win with cedeBypass irrelevant.

djspiewak · 2023-02-21T00:11:11Z

#3445

durban · 2023-02-21T00:17:42Z

The CI failure seems unrelated (IOAppSpec).

djspiewak · 2023-02-21T00:34:14Z

The CI failure seems unrelated (IOAppSpec).

Yeah we have some sneaky-flaky tests right now. Need to fix.

djspiewak · 2023-02-21T00:58:18Z

Okay so thanks to @vasilmkd's creative experimentation, we have really good evidence that this bug is indeed happening in #3392. I have a couple EC2 instances churning on reproductions with various patches, but we're pretty certain this is the problem. Outstanding work, @durban!

djspiewak · 2023-02-21T02:29:02Z

I swapped the main for an automated test. It actually wasn't that hard to reproduce in this fashion, though notably if you formulate the whole thing as an IO, it doesn't work.

CremboC · 2023-03-17T11:27:00Z

We're pretty sure this bug caused many of our production services to constantly freeze up. The services were fs2-kafka consumers, but notably they would only freeze up after implementing a groupWithin and a commitBatch logic. The symptoms were suddenly drop to 0% CPU, but no error raised whatsoever. Even Kafka itself reported the consumer as "healthy" (implying heartbeats were still being sent).

We've deployed this CE version to multiple consumers and report back.

djspiewak · 2023-03-17T15:19:27Z

@CremboC That definitely sounds like a smoking gun to me! Please do report back with your experiences, positive or (even more importantly!) negative.

CremboC · 2023-03-20T14:04:45Z

@djspiewak unfortunately it seems like it happened again, even with 3.4.8. It just stopped consuming from Kafka and didn't recover (although I have implemented an automate restarting if it does get stuck, so maybe it would have recovered..?)

djspiewak · 2023-03-20T16:49:41Z

Concerning. There may be another issue here. Can you open a new ticket? The PR definitely fixes this reproducer (and a legit bug). I think the first step would be to try to get a fiber dump while your consumer is hung

durban added 4 commits February 20, 2023 23:15

Demonstrate that a fiber can be "lost"

f204467

Don't use null

b39cdd0

Try to fix the problem

b9dabc5

scalafmt

9e106d9

durban mentioned this pull request Feb 20, 2023

CPU starvation causes problems with scheduling fairness (compared to CE2) #3392

Closed

vasilmkd reviewed Feb 20, 2023

View reviewed changes

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala Show resolved Hide resolved

scalafix

defc3a4

djspiewak previously approved these changes Feb 20, 2023

View reviewed changes

core/jvm/src/test/scala/cats/effect/unsafe/WstpSpec.scala Outdated Show resolved Hide resolved

durban added 2 commits February 21, 2023 00:46

Move example program under tests/jvm/

3428d32

Add header and comment to example program

7c6e69a

vasilmkd previously approved these changes Feb 20, 2023

View reviewed changes

durban dismissed stale reviews from vasilmkd and djspiewak via 7c6e69a February 20, 2023 23:55

djspiewak previously approved these changes Feb 21, 2023

View reviewed changes

djspiewak linked an issue Feb 21, 2023 that may be closed by this pull request

CPU starvation causes problems with scheduling fairness (compared to CE2) #3392

Closed

Convert PoC main into automated spec

4dc927c

djspiewak dismissed their stale review via 4dc927c February 21, 2023 02:28

djspiewak marked this pull request as ready for review February 21, 2023 02:29

djspiewak approved these changes Feb 21, 2023

View reviewed changes

djspiewak merged commit 5fada62 into typelevel:series/3.4.x Feb 21, 2023

biochimia mentioned this pull request Aug 30, 2023

Stuck partitions - question about the pause/resume logic fd4s/fs2-kafka#1230

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demonstrate and try to fix a possible bug with a "lost" fiber #3444

Demonstrate and try to fix a possible bug with a "lost" fiber #3444

durban commented Feb 20, 2023

vasilmkd commented Feb 20, 2023

vasilmkd commented Feb 20, 2023

vasilmkd commented Feb 20, 2023

djspiewak left a comment

vasilmkd commented Feb 20, 2023

djspiewak commented Feb 20, 2023

djspiewak commented Feb 20, 2023 •

edited

Loading

vasilmkd commented Feb 20, 2023 •

edited

Loading

djspiewak commented Feb 21, 2023

durban commented Feb 21, 2023

djspiewak commented Feb 21, 2023 •

edited

Loading

djspiewak commented Feb 21, 2023

djspiewak commented Feb 21, 2023

CremboC commented Mar 17, 2023

djspiewak commented Mar 17, 2023

CremboC commented Mar 20, 2023

djspiewak commented Mar 20, 2023

Demonstrate and try to fix a possible bug with a "lost" fiber #3444

Demonstrate and try to fix a possible bug with a "lost" fiber #3444

Conversation

durban commented Feb 20, 2023

The problem:

What I think happens:

A possible/incomplete fix:

vasilmkd commented Feb 20, 2023

vasilmkd commented Feb 20, 2023

vasilmkd commented Feb 20, 2023

djspiewak left a comment

Choose a reason for hiding this comment

vasilmkd commented Feb 20, 2023

djspiewak commented Feb 20, 2023

djspiewak commented Feb 20, 2023 • edited Loading

vasilmkd commented Feb 20, 2023 • edited Loading

djspiewak commented Feb 21, 2023

durban commented Feb 21, 2023

djspiewak commented Feb 21, 2023 • edited Loading

djspiewak commented Feb 21, 2023

djspiewak commented Feb 21, 2023

CremboC commented Mar 17, 2023

djspiewak commented Mar 17, 2023

CremboC commented Mar 20, 2023

djspiewak commented Mar 20, 2023

djspiewak commented Feb 20, 2023 •

edited

Loading

vasilmkd commented Feb 20, 2023 •

edited

Loading

djspiewak commented Feb 21, 2023 •

edited

Loading