-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC crashes on 1.10 with multithreaded code #52256
Comments
A lot of these seem to be coming from For instance,
We should probably audit which of them can actually be used with the multi-threaded GC. |
When enabling threads it makes sense to have many GC threads too, and it's currently buggy. But does it make sense to have a single thread with the exception of having many GC threads? If that makes sense and is not buggy, then GC threads could be disabled otherwise to not block release of 1.10? The fix for multithreaded could then be backported, or not... |
The code in gc-debug.c that I mentioned is not enabled unless you build Julia with debug flags (which the user above did). This code is not very thoroughly tested since most people don't build with these flags, so it probably just rot, but fixing it should be in-scope for a backported bugfix IMO. |
Thank you for looking into this! So, as far as I understand, these crashes stem from a part of julia that is not usually built unless specifically required, so they should not affect people at large. The issue may be removed from the 1.10 milestone then. However, I should stress that what I have been trying to tackle in the last weeks is a crash that comes from a race-condition that usually surfaces as a segfault in the gc, in a normal build of julia (on the backports-release-1.10 branch or master). It could be a true race condition coming from my code so I am reluctant to open an issue as long as I haven't minimized it to a point where I am confident it is a julia bug ; but minimizing it is tricky, since it only occurs every now and then, and the codebase is large. This is why I have been trying to use a debug build, with all assertions turned on, in the hope that whatever may be causing the crash would surface more reliably (I also tried to build with TSAN, that's #51774). Apparently, this debug build is causing other unrelated errors. I think it would still be valuable to fix them for the release of 1.10, because trying to identify bugs possibly stemming from the multi-threaded gc is particularly difficult at the moment. If you need me to give you more stacktraces or rr traces or whatever I can do to help, please let me know. |
Julia can't protect against such in general (nor even Rust, actually it only protects from data races, not race conditions) for non-GC related, but I think (almost) whatever you do regarding allocations should be fine, right? GC is not in your hands, nor should you be able to screw it up in any way, unless if you write out of bounds. I think that, a heap corruption would be the only way. So do you still have the problem if you run with: julia --check-bounds=yes |
Perhaps, but in general if I am causing a race condition, the rule of thumb is that anything goes (it's literally undefined behavior), and I believe a GC corruption, or at least a crash stemming from GC, is not unthinkable. In any case, I am indeed still seeing that crash with |
Yes, it's a rule of thumb, UB (so-called "nasal daemons" released, but no, not literally anything can happen...), but I think an exception in this case:
If you allocate directly or indirectly then I don't think it should matter. You don't strictly do it directly (with one syntax, e.g. malloc as in C), though you might with e.g. But even if your user code has a race condition, I think calling some Julia APIs should be safe. Yes, likely not all of them, but at least [I think I can see race-condition issues with [I'm not sure reentrant applies, but at least] the allocation APIs of Julia are thread-safe, presumably, and not just because the underlying libc implementation which is sometimes (not always!) called is: https://stackoverflow.com/questions/855763/is-malloc-thread-safe
I didn't look up if free is thread-safe too, I suppose so, but it's not called from your threads anyway, but from the one GC thread (or as of recently there are many GC threads, so from potentially all of them, should be still safe, otherwise a new bug, what you are hitting?). [I'm not doubting you have a "double free or corruption (!prev)", I mean I can see that potential bug in Julia, I'm saying it would be a bug in Julia that could and should(?) be fixed, and your code should be able to run without GC issues. You might still have other issue related to race conditions, so might want to fix that anyway.] |
Are you on Windows? This might be ok elsewhere, could someone check? I don't know what Julia does, i.e. this:
I did try your code in Julia (also 1.9.4 and .3) with 4 or more threads, 10000 or: $ julia +1.10.0-rc1 -t 16 repr.jl 100000 but no issue, which is good, though I didn't use a debug build, the only one failing for you? Does this make sense to you or get rid of the problem:
Do you have an idea where the problem is related? I thought related to Channel, and likely put! for it (or take!?). |
I'm pretty sure the allocations APIs of Julia are not thread-safe in the presence of race condition unfortunately, since the documentation states:
in addition to the entire https://docs.julialang.org/en/v1/manual/multi-threading/#Caveats paragraph. I would love for that guarantee to exist, but it's very complicated to implement in a systematic fashion, and anyway I think the performance implications are much too great to warrant that trade-off in the common case.
No, this is on linux.
Indeed, the particular minimal code above only fails with the debug build, which is not too surprising since the errors mostly stem from But let's not sidetrack this issue too much from the initial report, which focuses on the debug build. I'll post the stacktrace separately in a new issue once I have a small-ish reproducer! |
It does not really make sense to me: adding an allocation (by making the struct
Unfortunately, good old
with no stacktrace. I had managed to get some kind of stacktrace from looking at it in gdb and I remember it started from some place in the gc (which again, doesn't prove that the gc is buggy), but I don't have it at the moment. Anyway, again, let's keep that problem separate from the issue here, since it could well be my particular code which is at fault in that case, and I will open an issue in due time once I can be sure it is not the case. |
Disabled by #48600 |
Setup
I have been investigating for a few weeks a crash that appears in my multithreaded code starting from v1.10. My setup is that reported in #52184, i.e. on commit 1ddd6da of the backports-release-1.10 branch and compiling with a Make.user made of
Compilation may fail at first (that's #52184) but usually it ends up compiling fine if you retry it a few times.
Once I have this "bug-aware" julia, I run this minimized example:
reproducer.jl (click to develop)
I use
-t 4
and give it10000
asARGS
.Failures
Executing the file multiple times yields multiple results: sometimes it just works, and sometimes it crashes. Unfortunately it needs setting
--num-cores
to something above 1 to be visible inrr
and I didn't manage to pass that argument throughBugReporting
so I did not use the integrated--bug-report=rr
flag of julia. Instead I simply recorded the execution withSo far I have seen the following kinds of crash (click to view the full output and the link to the
rr
trace when available):Assertion `!freedall` failed
https://julialang-dumps.s3.amazonaws.com/reports/2023-11-21T15-20-51-Liozou.tar.zst
segfault in gc_scrub_task
I don't have an rr trace at the moment, I'll change this to a link when (if) I manage to obtain one.
segfault in realloc, from gc_scrub_record_task
I don't have an rr trace at the moment, I'll change this to a link when (if) I manage to obtain one.
double free or corruption
https://julialang-dumps.s3.amazonaws.com/reports/2023-11-21T15-12-25-Liozou.tar.zst
corrupted size vs. prev_size
I don't have an rr trace at the moment, I'll change this to a link when (if) I manage to obtain one.
corrupted size vs. prev_size while consolidating
https://julialang-dumps.s3.amazonaws.com/reports/2023-11-21T15-17-20-Liozou.tar.zst
...as well as that thing (is it a crash of `rr` iself?)
https://julialang-dumps.s3.amazonaws.com/reports/2023-11-21T15-24-07-Liozou.tar.zst
I'm opening this as another issue instead of pushing it on #52184 because I don't know if the causes are the same, and it's a different context that does not affect the build of julia itself.
The text was updated successfully, but these errors were encountered: