Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional segfaults when running with @threads 3 #45196

Closed
dpinol opened this issue May 5, 2022 · 7 comments
Closed

Occasional segfaults when running with @threads 3 #45196

dpinol opened this issue May 5, 2022 · 7 comments

Comments

@dpinol
Copy link

dpinol commented May 5, 2022

This is a followup from #44460 (comment)

My real application often crashes when executed normally. Once I got this error (without valgrind):

signal (11): Segmentation fault
in expression starting at none:0
in expression starting at none:0
in expression starting at none:0
in expression starting at none:0
in expression starting at none:0
in expression starting at none:0
jl_gc_alloc at /home/dani/dev/julia/julia/usr/bin/../lib/libjulia-internal.so.1 (unknown line)
unknown function (ip: 0x7fab2a7537d4)
ijl_process_events at /home/dani/dev/julia/julia/src/jl_uv.c:210
ijl_task_get_next at /home/dani/dev/julia/julia/src/partr.c:420
jl_array_grow_at_end at /home/dani/dev/julia/julia/src/array.c:893
_growend! at ./array.jl:1006 [inlined]
push! at ./array.jl:1053 [inlined]
.....
#1261#threadsfor_fun#248 at ./threadingconstructs.jl:84
#1261#threadsfor_fun at ./threadingconstructs.jl:52 [inlined]
#1 at ./threadingconstructs.jl:30
unknown function (ip: 0x7faa6da9bacf)
jl_mutex_unlock at /home/dani/dev/julia/julia/src/julia_locks.h:129 [inlined]
ijl_process_events at /home/dani/dev/julia/julia/src/jl_uv.c:217
ijl_task_get_next at /home/dani/dev/julia/julia/src/partr.c:420
ijl_task_get_next at /home/dani/dev/julia/julia/src/partr.c:420
jl_apply at /home/dani/dev/julia/julia/src/julia.h:1838 [inlined]
start_task at /home/dani/dev/julia/julia/src/task.c:931
Allocations: 46370570 (Pool: 46340026; Big: 30544); GC: 64

The MWE below (which just created arrays from threads) never crashes when run normally or with rr.

println("Run $VERSION with $(Threads.nthreads()) threads")
for i in 1:100_000_0
    i % 1000 == 0 && @info "i" i
    Threads.@threads for t in 1:30
        a = Float32[]
        for i in 1:rand(1:2000)
            push!(a, 0.3f0)
        end
    end
end

However, it always quickly crashes with valgrind. I could not run it with julia 1.9 due to this, so I used julia 1.8.3. I use a sysimg with cpu_target=generic.

valgrind --max-stackframe=115947807968  --smc-check=all-non-file --suppressions=$HOME/dev/julia/julia1.8/contrib/valgrind-julia.supp julia --sysimage=$HOME/Desktop/crash-gym/valgrind-sysimg-183.so  --threads=6 -O3 test-profile/threads-push-crash.jl

I get different error reports depending on the execution, but it usually ends with

==538524== Thread 7:
==538524== Invalid read of size 8
==538524==    at 0x55A0C9A: jl_gc_state_set (julia_threads.h:340)
==538524==    by 0x55A0C9A: ijl_task_get_next (partr.c:593)
==538524==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Once I got these more informative errors about mutexes & GC.

==538524== Thread 1:
==538524== Conditional jump or move depends on uninitialised value(s)
==538524==    at 0x558FA24: jl_mutex_trylock_nogc (julia_locks.h:93)
==538524==    by 0x558FA24: jl_mutex_trylock (julia_locks.h:106)
==538524==    by 0x558FA24: ijl_process_events (jl_uv.c:207)
==538524==    by 0x11BA768C: process_events; (libuv.jl:104)
==538524==    by 0x11BA768C: julia_wait_42555 (task.jl:930)
==538524==    by 0x11BA6933: julia_wait_42528 (condition.jl:124)
==538524==    by 0x11BA6A75: jfptr_wait_42529 (in /home/dani/Desktop/crash-gym/valgrind-sysimg-183.so)
==538524==    by 0x554CB69: _jl_invoke (gf.c:2358)
==538524==    by 0x554CB69: ijl_apply_generic (gf.c:2540)
==538524==    by 0x119B195B: julia__wait_27387 (task.jl:304)
==538524==    by 0x558F6A0C: ???
==538524==    by 0x558F6BB4: ???
==538524==    by 0x558F623F: ???
==538524==    by 0x558DB48: jl_toplevel_eval_flex (toplevel.c:897)
==538524==    by 0x558DA3D: jl_toplevel_eval_flex (toplevel.c:850)
==538524==    by 0x558EBA9: ijl_toplevel_eval_in (toplevel.c:965)
==538524== 
==538524== Conditional jump or move depends on uninitialised value(s)
==538524==    at 0x558FA2D: jl_mutex_trylock_nogc (julia_locks.h:97)
==538524==    by 0x558FA2D: jl_mutex_trylock (julia_locks.h:106)
==538524==    by 0x558FA2D: ijl_process_events (jl_uv.c:207)
==538524==    by 0x11BA768C: process_events; (libuv.jl:104)
==538524==    by 0x11BA768C: julia_wait_42555 (task.jl:930)
==538524==    by 0x11BA6933: julia_wait_42528 (condition.jl:124)
==538524==    by 0x11BA6A75: jfptr_wait_42529 (in /home/dani/Desktop/crash-gym/valgrind-sysimg-183.so)
==538524==    by 0x554CB69: _jl_invoke (gf.c:2358)
==538524==    by 0x554CB69: ijl_apply_generic (gf.c:2540)
==538524==    by 0x119B195B: julia__wait_27387 (task.jl:304)
==538524==    by 0x558F6A0C: ???
==538524==    by 0x558F6BB4: ???
==538524==    by 0x558F623F: ???
==538524==    by 0x558DB48: jl_toplevel_eval_flex (toplevel.c:897)
==538524==    by 0x558DA3D: jl_toplevel_eval_flex (toplevel.c:850)
==538524==    by 0x558EBA9: ijl_toplevel_eval_in (toplevel.c:965)
==538524== 
==538524== Invalid read of size 8
==538524==    at 0x55A8939: gc_read_stack (gc.c:1682)
==538524==    by 0x55A8939: gc_mark_loop (gc.c:2711)
==538524==    by 0x55AB4C7: _jl_gc_collect (gc.c:3078)
==538524==    by 0x55ACD7B: ijl_gc_collect (gc.c:3307)
==538524==    by 0x55ADCC2: maybe_collect (gc.c:884)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_inner (gc.c:949)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_noinline (gc.c:988)
==538524==    by 0x55ADCC2: jl_gc_alloc_ (julia_internal.h:373)
==538524==    by 0x55ADCC2: jl_gc_alloc (gc.c:3352)
==538524==    by 0x5572979: jl_gc_alloc_buf (julia_internal.h:403)
==538524==    by 0x5572979: array_resize_buffer (array.c:698)
==538524==    by 0x55745F6: jl_array_grow_at_end (array.c:893)
==538524==    by 0x55745F6: ijl_array_grow_end (array.c:955)
==538524==    by 0x558F6F3A: ???
==538524==    by 0x558F7128: ???
==538524==    by 0x556F9B0: jl_apply (julia.h:1831)
==538524==    by 0x556F9B0: start_task (task.c:931)
==538524==  Address 0xa8253a50 is on thread 1's stack
==538524== 
==538524== Invalid read of size 8
==538524==    at 0x55A6981: gc_read_stack (gc.c:1682)
==538524==    by 0x55A6981: gc_mark_loop (gc.c:2330)
==538524==    by 0x55AB4C7: _jl_gc_collect (gc.c:3078)
==538524==    by 0x55ACD7B: ijl_gc_collect (gc.c:3307)
==538524==    by 0x55ADCC2: maybe_collect (gc.c:884)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_inner (gc.c:949)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_noinline (gc.c:988)
==538524==    by 0x55ADCC2: jl_gc_alloc_ (julia_internal.h:373)
==538524==    by 0x55ADCC2: jl_gc_alloc (gc.c:3352)
==538524==    by 0x5572979: jl_gc_alloc_buf (julia_internal.h:403)
==538524==    by 0x5572979: array_resize_buffer (array.c:698)
==538524==    by 0x55745F6: jl_array_grow_at_end (array.c:893)
==538524==    by 0x55745F6: ijl_array_grow_end (array.c:955)
==538524==    by 0x558F6F3A: ???
==538524==    by 0x558F7128: ???
==538524==    by 0x556F9B0: jl_apply (julia.h:1831)
==538524==    by 0x556F9B0: start_task (task.c:931)
==538524==  Address 0xa8253a60 is on thread 1's stack
==538524== 
==538524== Invalid read of size 8
==538524==    at 0x55A87FB: gc_read_stack (gc.c:1682)
==538524==    by 0x55A87FB: gc_mark_loop (gc.c:2345)
==538524==    by 0x55AB4C7: _jl_gc_collect (gc.c:3078)
==538524==    by 0x55ACD7B: ijl_gc_collect (gc.c:3307)
==538524==    by 0x55ADCC2: maybe_collect (gc.c:884)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_inner (gc.c:949)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_noinline (gc.c:988)
==538524==    by 0x55ADCC2: jl_gc_alloc_ (julia_internal.h:373)
==538524==    by 0x55ADCC2: jl_gc_alloc (gc.c:3352)
==538524==    by 0x5572979: jl_gc_alloc_buf (julia_internal.h:403)
==538524==    by 0x5572979: array_resize_buffer (array.c:698)
==538524==    by 0x55745F6: jl_array_grow_at_end (array.c:893)
==538524==    by 0x55745F6: ijl_array_grow_end (array.c:955)
==538524==    by 0x558F6F3A: ???
==538524==    by 0x558F7128: ???
==538524==    by 0x556F9B0: jl_apply (julia.h:1831)
==538524==    by 0x556F9B0: start_task (task.c:931)
==538524==  Address 0xa8253a58 is on thread 1's stack
==538524== 
==538524== Invalid read of size 8
==538524==    at 0x55A8826: gc_mark_loop (gc.c:2350)
==538524==    by 0x55AB4C7: _jl_gc_collect (gc.c:3078)
==538524==    by 0x55ACD7B: ijl_gc_collect (gc.c:3307)
==538524==    by 0x55ADCC2: maybe_collect (gc.c:884)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_inner (gc.c:949)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_noinline (gc.c:988)
==538524==    by 0x55ADCC2: jl_gc_alloc_ (julia_internal.h:373)
==538524==    by 0x55ADCC2: jl_gc_alloc (gc.c:3352)
==538524==    by 0x5572979: jl_gc_alloc_buf (julia_internal.h:403)
==538524==    by 0x5572979: array_resize_buffer (array.c:698)
==538524==    by 0x55745F6: jl_array_grow_at_end (array.c:893)
==538524==    by 0x55745F6: ijl_array_grow_end (array.c:955)
==538524==    by 0x558F6F3A: ???
==538524==    by 0x558F7128: ???
==538524==    by 0x556F9B0: jl_apply (julia.h:1831)
==538524==    by 0x556F9B0: start_task (task.c:931)
==538524==  Address 0xa8253a90 is on thread 1's stack
==538524== 
==538524== Invalid read of size 8
==538524==    at 0x55A69C8: gc_read_stack (gc.c:1682)
==538524==    by 0x55A69C8: gc_mark_loop (gc.c:2355)
==538524==    by 0x55AB4C7: _jl_gc_collect (gc.c:3078)
==538524==    by 0x55ACD7B: ijl_gc_collect (gc.c:3307)
==538524==    by 0x55ADCC2: maybe_collect (gc.c:884)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_inner (gc.c:949)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_noinline (gc.c:988)
==538524==    by 0x55ADCC2: jl_gc_alloc_ (julia_internal.h:373)
==538524==    by 0x55ADCC2: jl_gc_alloc (gc.c:3352)
==538524==    by 0x5572979: jl_gc_alloc_buf (julia_internal.h:403)
==538524==    by 0x5572979: array_resize_buffer (array.c:698)
==538524==    by 0x55745F6: jl_array_grow_at_end (array.c:893)
==538524==    by 0x55745F6: ijl_array_grow_end (array.c:955)
==538524==    by 0x558F6F3A: ???
==538524==    by 0x558F7128: ???
==538524==    by 0x556F9B0: jl_apply (julia.h:1831)
==538524==    by 0x556F9B0: start_task (task.c:931)
==538524==  Address 0xa8253a98 is on thread 1's stack
==538524== 
==538524== Invalid read of size 8
==538524==    at 0x55A69F2: gc_mark_loop (gc.c:2361)
==538524==    by 0x55AB4C7: _jl_gc_collect (gc.c:3078)
==538524==    by 0x55ACD7B: ijl_gc_collect (gc.c:3307)
==538524==    by 0x55ADCC2: maybe_collect (gc.c:884)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_inner (gc.c:949)
==538524==    by 0x55ADCC2: jl_gc_big_alloc_noinline (gc.c:988)
==538524==    by 0x55ADCC2: jl_gc_alloc_ (julia_internal.h:373)
==538524==    by 0x55ADCC2: jl_gc_alloc (gc.c:3352)
==538524==    by 0x5572979: jl_gc_alloc_buf (julia_internal.h:403)
==538524==    by 0x5572979: array_resize_buffer (array.c:698)
==538524==    by 0x55745F6: jl_array_grow_at_end (array.c:893)
==538524==    by 0x55745F6: ijl_array_grow_end (array.c:955)
==538524==    by 0x558F6F3A: ???
==538524==    by 0x558F7128: ???
==538524==    by 0x556F9B0: jl_apply (julia.h:1831)
==538524==    by 0x556F9B0: start_task (task.c:931)
==538524==  Address 0xa8253b10 is on thread 1's stack
==538524== 
==538524== Thread 3:
==538524== Use of uninitialised value of size 8
==538524==    at 0x490A190: futex_fatal_error (futex-internal.h:87)
==538524==    by 0x490A190: futex_wait (futex-internal.h:162)
==538524==    by 0x490A190: __lll_lock_wait (lowlevellock.c:50)
==538524== 
==538524== Use of uninitialised value of size 8
==538524==    at 0x4910138: __pthread_mutex_cond_lock (pthread_mutex_lock.c:89)
==538524== 
==538524== Use of uninitialised value of size 8
==538524==    at 0x491007E: __pthread_mutex_cond_lock (pthread_mutex_lock.c:165)
==538524== 
==538524== Use of uninitialised value of size 8
==538524==    at 0x4910089: __pthread_mutex_cond_lock (pthread_mutex_lock.c:173)
==538524== 
==538524== Invalid read of size 8
==538524==    at 0x490A190: futex_fatal_error (futex-internal.h:87)
==538524==    by 0x490A190: futex_wait (futex-internal.h:162)
==538524==    by 0x490A190: __lll_lock_wait (lowlevellock.c:50)
==538524==  Address 0xa7e538d8 is in a rw- anonymous segment
==538524== 
==538524== Invalid read of size 8
==538524==    at 0x4910133: __pthread_mutex_cond_lock (pthread_mutex_lock.c:89)
==538524==  Address 0xa7e538e8 is in a rw- anonymous segment
==538524== 
==538524== Invalid read of size 8
==538524==    at 0x4910088: __pthread_mutex_cond_lock (pthread_mutex_lock.c:173)
==538524==  Address 0xa7e538f0 is in a rw- anonymous segment
==538524== 
==538524== Invalid read of size 8
==538524==    at 0x4910089: __pthread_mutex_cond_lock (pthread_mutex_lock.c:173)
==538524==  Address 0xa7e538f8 is in a rw- anonymous segment
==538524== 
==538524== Thread 7:
==538524== Invalid read of size 8
==538524==    at 0x55A0C9A: jl_gc_state_set (julia_threads.h:340)
==538524==    by 0x55A0C9A: ijl_task_get_next (partr.c:593)
==538524==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==538524== 

signal (11): Segmentation fault
in expression starting at test-profile/threads-push-crash.jl:13
==538524== 
==538524== Process terminating with default action of signal 11 (SIGSEGV)
==538524==  Access not within mapped region at address 0x0
==538524==    at 0x55A0C9A: jl_gc_state_set (julia_threads.h:340)
==538524==    by 0x55A0C9A: ijl_task_get_next (partr.c:593)
==538524==  If you believe this happened as a result of a stack
==538524==  overflow in your program's main thread (unlikely but
==538524==  possible), you can try to increase the size of the
==538524==  main thread stack using the --main-stacksize= flag.
==538524==  The main thread stack size used in this run was 8388608.
@dpinol
Copy link
Author

dpinol commented May 6, 2022

For the record, with this code I cannot reproduce the crash nor the valgrind errors

    NUM_THREADS = 30
    c=Channel()
    function worker()
        while true
            take!(c)
            a = Float32[]
            for i in 1:rand(1:2000)
                 push!(a, 0.3f0)
             end
        end
    end
    tasks = [schedule(Task(worker)) for i in 1:NUM_THREADS]
    for i in 1:100_000_0
        i % 1000 == 0 && @info "i" i
        for t in 1:NUM_THREADS
            put!(c, nothing)
        end
    end

EDIT: The code above does not reproduce the crash because all workers run on the same thread. With the code below I can also reproduce the issues. So, the problem is not when recreating the threads, but when allocating memory on a spawned thread.

    function worker()
        while true
            a = Float64[]
            push!(a, 0.42)
        end
    end
    t= Threads.@spawn worker()
    wait(t)

@JeffBezanson
Copy link
Member

My real application often crashes when executed normally. Once I got this error (without valgrind):

Have you tried it under rr? If you can catch the crash there and it still appears to be a bug in the julia runtime, you may have to send me the trace privately.

@dpinol
Copy link
Author

dpinol commented May 6, 2022

@JeffBezanson The problem is that:

  • I only get the crash when running the code above from python with pycall. I tried passing --bug-report=rr-local tough pycall, but it doesn't work even if I patch pycall to accept this argument
  • If I run the code above with only julia, it only crashes with valgrind. Should I send you the rr results even if it doesn't crash?

thanks

@dpinol
Copy link
Author

dpinol commented May 9, 2022

Hi,
I found out that Julia only crashes if pygame is imported after importing julia. The MWE that causes the crash is a python application which just loads pygame after pyjulia (I get the same crash using PythonCall) and then allocates memory from a Julia thread. Swapping the 2 import lines avoids the issue

from julia import Main
import pygame

Main.eval("""
    function worker()
        while true
            a = Float64[]
            push!(a, 0.42)
        end
    end
    t= Threads.@spawn worker()
    wait(t)
""")

@JeffBezanson
Copy link
Member

Wow, interesting. Thanks for digging in to that.

@mkitti
Copy link
Contributor

mkitti commented Feb 7, 2023

Doss pygame do anything with signals?

@vtjnash
Copy link
Member

vtjnash commented Feb 10, 2024

It is likely a PythonCall issue, since I think that package currently disables Julia's signal handling (signal handling is mandatory to use threads)

@vtjnash vtjnash closed this as completed Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants