-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducible Memory Corruption in atexit_hooks
(on MacOS) leading to GC corruption and/or segfault
#49746
Comments
Not sure it's helpful or even related, but I've also seen the following on rare occasions, at termination:
|
Okay here's something else weird... and it matches up with what you saw, @kpamnany!: I ran the tests for like 100 iterations, with GC disabled, and then printed the atexit_hooks: julia> Base.atexit_hooks
374-element Vector{Union{Function, Type}}:
#3 (generic function with 1 method)
#3 (generic function with 1 method)
#3 (generic function with 1 method)
#3 (generic function with 1 method)
#3 (generic function with 1 method)
#3 (generic function with 1 method)
⋮
#undef
#undef
#undef
#undef
#undef Somehow the last 5 elements are julia> Base.atexit_hooks[end-6:end]
7-element Vector{Union{Function, Type}}:
#3 (generic function with 1 method)
#3 (generic function with 1 method)
#undef
#undef
#undef
#undef
#undef Which shouldn't be possible because the only code in the whole julia codebase that modifies atexit_hooks is this line: atexit(f::Function) = (pushfirst!(atexit_hooks, f); nothing) So it should only be calling And also, somehow now when I run GC, it is no longer marking the original And indeed, when I finally exited the process, I see the same thing you saw: atexit hook threw an error: UndefRefError()
getindex at ./essentials.jl:13 [inlined]
popfirst! at ./array.jl:1520 [inlined]
_atexit at ./initdefs.jl:381
jfptr__atexit_55947 at /Users/nathandaly/src/julia/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/nathandaly/src/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/nathandaly/src/julia/src/gf.c:3069
jl_apply at /Users/nathandaly/src/julia/src/./julia.h:1958 [inlined]
ijl_atexit_hook at /Users/nathandaly/src/julia/src/init.c:280
jl_repl_entrypoint at /Users/nathandaly/src/julia/src/jlapi.c:735 |
OH DUHHHHHHHH!!!!!!!!!!!!!!!!!!! It's not threadsafe, my friends!!! We're calling atexit from multiple threads, which is pushing into the vector without a lock 💡 duh! Phew. I can't believe how long that took us to figure out. PR incoming |
Should we have a global lock around pushing to the vector? |
Yeah, exactly Dilum. 👍 That's what I did in #49774. 👍 |
- atexit(f) mutates global shared state. - atexit(f) can be called anytime by any thread. - Accesses & mutations to global shared state must be locked if they can be accessed from multiple threads. Fixes #49746
- atexit(f) mutates global shared state. - atexit(f) can be called anytime by any thread. - Accesses & mutations to global shared state must be locked if they can be accessed from multiple threads. Fixes JuliaLang#49746
Quite possibly related to #43567.
Here is a small reproducer https://github.com/NHDaly/ThreadingUtils.jl
Uploaded here:
ThreadingUtils.zip
When running the tests for this small repro, on macOS, I can reliably get a crash around 1/20 times. (this repro was distilled down from our real codebase, where the crash happens more like 1/5 times.)
We would love to get an
rr
trace, but we sadly haven't been able to repro on linux yet 😢We've seen crashes on both our internal build of 1.8.2, as well as on the latest master, built as of yesterday, e204e20.
To repro:
I eventually see crashes, such as:
or
or
(Sorry, the line numbers will be wrong, because i've been adding printlns for debugging. This is the line on that gc_mark_outrefs):
julia/src/gc.c
Line 2569 in 5521212
We added some println debugging (in #49741), and we got some more information, which is what led us to the fact that the corruption is happening in the
atexit_hooks
object:and it matches up here:
Note:
0x000000011f632d40
andparent 0x11f632d40
.So, somehow, one of the functions in the
atexit_hooks
vector is getting trashed. Here are some of the times it's managed to dump the vector viajl_
:or also:
In the original code, the
atexit() do
hooks were using aWeakRef
to ensure that the hook itself isn't unnecessarily keeping the object alive. We've removed that from the version of the reproducer provided above, since it reproduces without the WeakRefs, so the WeakRefs themselves are probably a red herring.That said, it seemed to reproduce slightly more frequently when WeakRefs were involved as well. The diff to put them back is here:
Phew! I'm not sure how actionable this is, but it's a pretty surprisingly small reproducer and it can surface a crash, so this should be looked into! I'm not sure who would be best positioned to investigate further. I'm happy to help if anyone wants to pair.
Thanks!
The text was updated successfully, but these errors were encountered: