Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The MPSC channel has a deadlock #6

Closed
mratsim opened this issue Nov 17, 2019 · 5 comments
Closed

The MPSC channel has a deadlock #6

mratsim opened this issue Nov 17, 2019 · 5 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Nov 17, 2019

The MPSC channel for StealRequest has a deadlock

Reproducible with 5 workers on Fibonacci.

Also it might be worth it to move to a lock-free design even though the lock-based design is not a bottleneck.

@mratsim
Copy link
Owner Author

mratsim commented Nov 24, 2019

Not sure what caused the deadlock as it's just a streamlined implementation of the legacy channels, could it have been false sharing?
Anyway, unused after #21 and e41c927

@mratsim mratsim closed this as completed Nov 24, 2019
@aprell
Copy link

aprell commented Nov 24, 2019

Is this deadlock reproducible with the C version?

@mratsim
Copy link
Owner Author

mratsim commented Nov 24, 2019

No it's only my own channel reimplementation.

When I switched to the 1-to-1 reimplementation of your Nim channels (04c355d) it disappeared.

@mratsim
Copy link
Owner Author

mratsim commented Nov 26, 2019

So I manage to get it again while changing the implementation of the MPSC queue to yet another one lock-free queue (suitable for batching for my memory manager)

The main difference is that it uses compare-and-swap which are more heavy on cache lines so maybe behave worse under contention. Steal backoff might help here

It's probably a livelock and not a deadlock, it may be unrelated but it's very annoying nonetheless.

It happens reliably even on just a hello world

weave/weave/async.nim

Lines 138 to 151 in e3a1f29

proc display_int(x: int) =
stdout.write(x)
stdout.write(" - SUCCESS\n")
proc main() =
init(Runtime)
spawn display_int(123456)
spawn display_int(654321)
sync(Runtime)
exit(Runtime)
# main()

The main thread always get stucks in this loop in the runtime barrier

weave/weave/runtime.nim

Lines 124 to 138 in 9f4abc4

# 2. Run out-of-task, become a thief and help other threads
# to reach the barrier faster
debug: log("Worker %d: globalsync 2 - becoming a thief\n", myID())
trySteal(isOutOfTasks = true)
ascertain: myThefts().outstanding > 0
var task: Task
profile(idle):
while not recv(task, isOutOfTasks = true):
ascertain: myWorker().deque.isEmpty()
ascertain: myThefts().outstanding > 0
declineAll()
if localCtx.runtimeIsQuiescent:
# Goto breaks profiling, but the runtime is still idle
break EmptyLocalQueue

adding some printing in declining or the steal request relay (or uncommenting the debugTermination log) will unstuck the runtime for a 20% perf loss :P.

proc declineAll*() =
var req: StealRequest
profile_stop(idle)
if recv(req):
if req.thiefID == myID() and req.state == Working:
req.state = Stealing
decline(req)
profile_start(idle)

weave/weave/thieves.nim

Lines 96 to 108 in 9f4abc4

proc findVictimAndRelaySteal*(req: sink StealRequest) =
# Note:
# Nim manual guarantees left-to-right function evaluation.
# Hence in the following:
# `req.findVictim().relaySteal(req)`
# findVictim request update should be done before relaySteal
#
# but C and C++ do not provides that guarantee
# and debugging that in a multithreading runtime
# would probably be very painful.
let target = findVictim(req)
debugTermination: log("Worker %d: relay steal request to %d from %d\n", myID(), target, req.thiefID)
target.relaySteal(req)

@mratsim
Copy link
Owner Author

mratsim commented Nov 26, 2019

False alarm, it's my queue that has an issue, I can reproduce the bug in an isolated bench.

I think I have an ABA problem with the producers' side using exchange and the consumer side using compare-exchange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants