-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR #32599 introduced deadlocks #35441
Comments
You could also try building julia with thread sanitizer. That way if it's a data race, you might be more likely to see it. |
Do you mean #27167? I thought it's still WIP. |
Or I can just put |
Paging @vchuravy for the latest work on tsan. |
I just tried this anyway. But
failed with
|
I got it to the point where one could sanitize user code, but then got stuck on making it user-friendly. Rebasing that PR shouldn't be that bad, especially since the ASAN fixes have since landed. |
We went through it in person, it's just very hard to imagine all the possible cases and interactions. Thanks for your help in debugging this. |
Sorry, I should've realized that reviews and discussions can happen behind the scene, even when it's invisible when just looking at github. I also can imagine that it's very hard to catch multithreading bugs in a review. I hope something like TSAN can help us here. |
fixed by #36785 |
I think it is very likely that PR #32599 introduced a bug that causes deadlocks. I have two MEWs below.
Since these MWEs require running for a very long time and there is a non-negligible change that the MWEs cause the segfault fixed by #35387, I back-ported #35387 to the version before and after #32599:
pre32599-pr35387
(tkf@247e385): [LateGCLowering] Fix skipped Select lifting #35387 is merged to the commit just before threads: further optimize scheduler based on #32551 #32599 is mergedpost32599-pr35387
(tkf@eff0664): [LateGCLowering] Fix skipped Select lifting #35387 is merged to the commit where threads: further optimize scheduler based on #32551 #32599 is mergedThe MWEs below cause a deadlock in
post32599-pr35387
andmaster
but not inpre32599-pr35387
(the MWEs run until the end).(I used
JULIA_NUM_THREADS=13
when running the following MWEs. The specific number of threads to invoke this deadlock seem to be different for each machine. @chriselrod seems to need onlyJULIA_NUM_THREADS=8
.)MWE 1
This is the snippet based on the one #35387 (comment) by @chriselrod. In my machine, single
@benchmark tproduct_fm($x)
didn't cause the deadlock so I put it in a loop. In the first run, it caused a deadlock after printingi = 84
.Example backtrace after Ctrl-C
MWE 2
This is the MWE I've reported in #35341. In the first run, it caused a deadlock after printing
seed = 287
.Example backtrace after Ctrl-C
How to solve this?
The next step for me is to try
rr
as it helped a lot to fixed one of the bugs I reported in #35341.Meanwhile, it'd be nice if core devs try reproducing the deadlock in their machines. I know @Keno tried doing this but I don't know if he tried different
JULIA_NUM_THREADS
. I'll also try to find some relationship betweenJULIA_NUM_THREADS
and the average time/probability of the deadlock (e.g., does it have to be within a "hotspot" or is it just have to be large enough).Also, it'd be nice if people familiar with threading internals can review #32599 again. I think it's fair to say that this is not the most thoroughly reviewed PR. (Needless to say, this is not a complaint as I understand there are only very few people who can review it and I imagine they are very busy.)
Finally, I think it makes sense to consider taking this bug into account in the release strategy of 1.5. Maybe revert #32599 if it cannot be fixed before the feature freeze? Or, in 1.5-RC announcement, ask people with multi-threaded code base to try stress-test with the RC?
The text was updated successfully, but these errors were encountered: