Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC crashes on 1.10 with multithreaded code #52256

Closed
Liozou opened this issue Nov 21, 2023 · 11 comments
Closed

GC crashes on 1.10 with multithreaded code #52256

Liozou opened this issue Nov 21, 2023 · 11 comments
Labels
bug Indicates an unexpected problem or unintended behavior GC Garbage collector rr trace included

Comments

@Liozou
Copy link
Member

Liozou commented Nov 21, 2023

Setup

I have been investigating for a few weeks a crash that appears in my multithreaded code starting from v1.10. My setup is that reported in #52184, i.e. on commit 1ddd6da of the backports-release-1.10 branch and compiling with a Make.user made of

FORCE_ASSERTIONS=1
LLVM_ASSERTIONS=1
override WITH_GC_VERIFY=1
override WITH_GC_DEBUG_ENV=1

Compilation may fail at first (that's #52184) but usually it ends up compiling fine if you retry it a few times.

Once I have this "bug-aware" julia, I run this minimized example:

reproducer.jl (click to develop)
using Base.Threads

struct LoadBalancer{T}
    channel::Channel{T}
    tasks::Vector{Task}
    event::Event
end

function LoadBalancer{T}(f, n::Int) where T
    event = Event(true)
    channel = Channel{T}(Inf)
    tasks = [errormonitor(@spawn while true
        x = take!($channel)
        $f(x)
        notify($event)
    end) for _ in 1:n]
    LoadBalancer{T}(channel, tasks, event)
end

Base.put!(lb::LoadBalancer, x) = put!(lb.channel, x)

function Base.wait(lb::LoadBalancer)
    while !isempty(lb.channel)
        wait(lb.event)
    end
end


function run(setup::Vector{Float64}, lb::LoadBalancer)
    for _ in 1:10
        pos = copy(setup)
        put!(lb, pos)
        yield()
    end
    wait(lb)
end


function main(ARGS)
    for _ in 1:parse(Int, ARGS[1])
        setup = rand(300)
        lb = LoadBalancer{Vector{Float64}}(_ -> Float64[rand()], 9)
        run(setup, lb)
    end
end

main(ARGS)

I use -t 4 and give it 10000 as ARGS.

Failures

Executing the file multiple times yields multiple results: sometimes it just works, and sometimes it crashes. Unfortunately it needs setting --num-cores to something above 1 to be visible in rr and I didn't manage to pass that argument through BugReporting so I did not use the integrated --bug-report=rr flag of julia. Instead I simply recorded the execution with

rr record --chaos --num-cores=4 /home/liozou/julia-1.10-bugtrack/usr/bin/julia-debug --startup-file=no -t4 ~/Desktop/reproducer.jl 10000

So far I have seen the following kinds of crash (click to view the full output and the link to the rr trace when available):

Assertion `!freedall` failed

https://julialang-dumps.s3.amazonaws.com/reports/2023-11-21T15-20-51-Liozou.tar.zst

Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
julia-debug: /home/liozou/julia-1.10-bugtrack/src/gc.c:1442: gc_sweep_page: Assertion `!freedall' failed.

[7670] signal (6.-6): Aborted
in expression starting at /home/liozou/Desktop/reproducer.jl:47
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7fdf9ffa571a)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
gc_sweep_page at /home/liozou/julia-1.10-bugtrack/src/gc.c:1442
gc_sweep_pool_page at /home/liozou/julia-1.10-bugtrack/src/gc.c:1499
gc_sweep_pool at /home/liozou/julia-1.10-bugtrack/src/gc.c:1585
_jl_gc_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:3309
ijl_gc_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:3451
maybe_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:935
jl_gc_big_alloc_inner at /home/liozou/julia-1.10-bugtrack/src/gc.c:1006
jl_gc_big_alloc_noinline at /home/liozou/julia-1.10-bugtrack/src/gc.c:1043
jl_gc_alloc_ at /home/liozou/julia-1.10-bugtrack/src/julia_internal.h:480
jl_gc_alloc at /home/liozou/julia-1.10-bugtrack/src/gc.c:3503
_new_array_ at /home/liozou/julia-1.10-bugtrack/src/array.c:134
ijl_array_copy at /home/liozou/julia-1.10-bugtrack/src/array.c:1181
copy at ./array.jl:411 [inlined]
run at /home/liozou/Desktop/reproducer.jl:31
main at /home/liozou/Desktop/reproducer.jl:43
unknown function (ip: 0x7fdf9fb448e5)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2892
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
jl_apply at /home/liozou/julia-1.10-bugtrack/src/julia.h:1976
do_call at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:125
eval_value at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:222
eval_stmt_value at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:173
eval_body at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:616
jl_interpret_toplevel_thunk at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:774
jl_toplevel_eval_flex at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:934
jl_toplevel_eval_flex at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:877
ijl_toplevel_eval at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:943
ijl_toplevel_eval_in at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:985
eval at ./boot.jl:383 [inlined]
include_string at ./loading.jl:2070
jl_fptr_args at /home/liozou/julia-1.10-bugtrack/src/gf.c:2534
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
_include at ./loading.jl:2130
include at ./Base.jl:495
jfptr_include_46681 at /home/liozou/julia-1.10-bugtrack/usr/lib/julia/sys-debug.so (unknown line)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_83011 at /home/liozou/julia-1.10-bugtrack/usr/lib/julia/sys-debug.so (unknown line)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
jl_apply at /home/liozou/julia-1.10-bugtrack/src/julia.h:1976
true_main at /home/liozou/julia-1.10-bugtrack/src/jlapi.c:582
jl_repl_entrypoint at /home/liozou/julia-1.10-bugtrack/src/jlapi.c:731
jl_load_repl at /home/liozou/julia-1.10-bugtrack/cli/loader_lib.c:568
main at /home/liozou/julia-1.10-bugtrack/cli/loader_exe.c:58
unknown function (ip: 0x7fdf9ffa6d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/liozou/julia-1.10-bugtrack/usr/bin/julia-debug (unknown line)
Allocations: 760384 (Pool: 0; Other: 760384); GC: 2
Allocations: 760384 (Pool: 0; Other: 760384); GC: 2
Aborted
segfault in gc_scrub_task

I don't have an rr trace at the moment, I'll change this to a link when (if) I manage to obtain one.

Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC

[11696] signal (11.1): Segmentation fault
in expression starting at /home/liozou/Desktop/reproducer.jl:47
gc_scrub_task at /home/liozou/julia-1.10-bugtrack/src/gc-debug.c:561
gc_scrub at /home/liozou/julia-1.10-bugtrack/src/gc-debug.c:590
_jl_gc_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:3307
ijl_gc_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:3451
maybe_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:935
jl_gc_pool_alloc_inner at /home/liozou/julia-1.10-bugtrack/src/gc.c:1291
ijl_gc_pool_alloc at /home/liozou/julia-1.10-bugtrack/src/gc.c:1339
IntrusiveLinkedList at ./linked_list.jl:7 [inlined]
GenericCondition at ./condition.jl:67 [inlined]
Task at ./task.jl:5 [inlined]
Task at ./task.jl:5 [inlined]
#1 at ./threadingconstructs.jl:439 [inlined]
#1 at ./none:0
iterate at ./generator.jl:47 [inlined]
collect_to! at ./array.jl:892 [inlined]
collect_to_with_first! at ./array.jl:870 [inlined]
collect at ./array.jl:844
LoadBalancer at /home/liozou/Desktop/reproducer.jl:12 [inlined]
main at /home/liozou/Desktop/reproducer.jl:42
unknown function (ip: 0x7fed0b85c8d5)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2892
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
jl_apply at /home/liozou/julia-1.10-bugtrack/src/julia.h:1976
do_call at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:125
eval_value at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:222
eval_stmt_value at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:173
eval_body at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:616
jl_interpret_toplevel_thunk at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:774
jl_toplevel_eval_flex at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:934
jl_toplevel_eval_flex at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:877
ijl_toplevel_eval at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:943
ijl_toplevel_eval_in at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:985
eval at ./boot.jl:383 [inlined]
include_string at ./loading.jl:2070
jl_fptr_args at /home/liozou/julia-1.10-bugtrack/src/gf.c:2534
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
_include at ./loading.jl:2130
include at ./Base.jl:495
jfptr_include_46681 at /home/liozou/julia-1.10-bugtrack/usr/lib/julia/sys-debug.so (unknown line)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_83011 at /home/liozou/julia-1.10-bugtrack/usr/lib/julia/sys-debug.so (unknown line)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
jl_apply at /home/liozou/julia-1.10-bugtrack/src/julia.h:1976
true_main at /home/liozou/julia-1.10-bugtrack/src/jlapi.c:582
jl_repl_entrypoint at /home/liozou/julia-1.10-bugtrack/src/jlapi.c:731
jl_load_repl at /home/liozou/julia-1.10-bugtrack/cli/loader_lib.c:568
main at /home/liozou/julia-1.10-bugtrack/cli/loader_exe.c:58
unknown function (ip: 0x7fed23259d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/liozou/julia-1.10-bugtrack/usr/bin/julia-debug (unknown line)
Allocations: 1522107 (Pool: 0; Other: 1522107); GC: 7
Allocations: 1522107 (Pool: 0; Other: 1522107); GC: 7
Segmentation fault (core dumped)
segfault in realloc, from gc_scrub_record_task

I don't have an rr trace at the moment, I'll change this to a link when (if) I manage to obtain one.

Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC

[15187] signal (11.2): Segmentation fault
in expression starting at /home/liozou/Desktop/reproducer.jl:47
unknown function (ip: 0x7f8dcd0cbb72)
realloc at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
arraylist_grow at /home/liozou/julia-1.10-bugtrack/src/support/arraylist.c:58
arraylist_push at /home/liozou/julia-1.10-bugtrack/src/support/arraylist.c:69
gc_scrub_record_task at /home/liozou/julia-1.10-bugtrack/src/gc-debug.c:529
gc_mark_outrefs at /home/liozou/julia-1.10-bugtrack/src/gc.c:2442 [inlined]
gc_mark_and_steal at /home/liozou/julia-1.10-bugtrack/src/gc.c:2743
gc_mark_loop_parallel at /home/liozou/julia-1.10-bugtrack/src/gc.c:2812
gc_mark_loop at /home/liozou/julia-1.10-bugtrack/src/gc.c:2833
_jl_gc_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:3154
ijl_gc_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:3451
maybe_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:935
jl_gc_big_alloc_inner at /home/liozou/julia-1.10-bugtrack/src/gc.c:1006
jl_gc_big_alloc_noinline at /home/liozou/julia-1.10-bugtrack/src/gc.c:1043
jl_gc_alloc_ at /home/liozou/julia-1.10-bugtrack/src/julia_internal.h:480
jl_gc_alloc at /home/liozou/julia-1.10-bugtrack/src/gc.c:3503
_new_array_ at /home/liozou/julia-1.10-bugtrack/src/array.c:134
ijl_array_copy at /home/liozou/julia-1.10-bugtrack/src/array.c:1181
copy at ./array.jl:411 [inlined]
run at /home/liozou/Desktop/reproducer.jl:31
main at /home/liozou/Desktop/reproducer.jl:43
unknown function (ip: 0x7f8db565c8d5)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2892
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
jl_apply at /home/liozou/julia-1.10-bugtrack/src/julia.h:1976
do_call at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:125
eval_value at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:222
eval_stmt_value at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:173
eval_body at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:616
jl_interpret_toplevel_thunk at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:774
jl_toplevel_eval_flex at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:934
jl_toplevel_eval_flex at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:877
ijl_toplevel_eval at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:943
ijl_toplevel_eval_in at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:985
eval at ./boot.jl:383 [inlined]
include_string at ./loading.jl:2070
jl_fptr_args at /home/liozou/julia-1.10-bugtrack/src/gf.c:2534
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
_include at ./loading.jl:2130
include at ./Base.jl:495
jfptr_include_46681 at /home/liozou/julia-1.10-bugtrack/usr/lib/julia/sys-debug.so (unknown line)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_83011 at /home/liozou/julia-1.10-bugtrack/usr/lib/julia/sys-debug.so (unknown line)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
jl_apply at /home/liozou/julia-1.10-bugtrack/src/julia.h:1976
true_main at /home/liozou/julia-1.10-bugtrack/src/jlapi.c:582
jl_repl_entrypoint at /home/liozou/julia-1.10-bugtrack/src/jlapi.c:731
jl_load_repl at /home/liozou/julia-1.10-bugtrack/cli/loader_lib.c:568
main at /home/liozou/julia-1.10-bugtrack/cli/loader_exe.c:58
unknown function (ip: 0x7f8dcd050d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/liozou/julia-1.10-bugtrack/usr/bin/julia-debug (unknown line)
Allocations: 767199 (Pool: 0; Other: 767199); GC: 2
Allocations: 767199 (Pool: 0; Other: 767199); GC: 2
Segmentation fault (core dumped)
double free or corruption

https://julialang-dumps.s3.amazonaws.com/reports/2023-11-21T15-12-25-Liozou.tar.zst

double free or corruption (!prev)

[4440] signal (6.-6): Aborted
in expression starting at /home/liozou/Desktop/reproducer.jl:47
Allocations: 458133 (Pool: 0; Other: 458133); GC: 0
Allocations: 458133 (Pool: 0; Other: 458133); GC: 0
Aborted
corrupted size vs. prev_size

I don't have an rr trace at the moment, I'll change this to a link when (if) I manage to obtain one.

Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
corrupted size vs. prev_size

[11552] signal (6.-6): Aborted
in expression starting at /home/liozou/Desktop/reproducer.jl:47
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f513ba1e675)
unknown function (ip: 0x7f513ba35cfb)
unknown function (ip: 0x7f513ba367e1)
unknown function (ip: 0x7f513ba39c2b)
realloc at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
arraylist_grow at /home/liozou/julia-1.10-bugtrack/src/support/arraylist.c:58
arraylist_push at /home/liozou/julia-1.10-bugtrack/src/support/arraylist.c:69
gc_scrub_record_task at /home/liozou/julia-1.10-bugtrack/src/gc-debug.c:529
gc_mark_outrefs at /home/liozou/julia-1.10-bugtrack/src/gc.c:2442 [inlined]
gc_mark_and_steal at /home/liozou/julia-1.10-bugtrack/src/gc.c:2743
gc_mark_loop_parallel at /home/liozou/julia-1.10-bugtrack/src/gc.c:2812
gc_mark_loop at /home/liozou/julia-1.10-bugtrack/src/gc.c:2833
_jl_gc_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:3154
ijl_gc_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:3451
maybe_collect at /home/liozou/julia-1.10-bugtrack/src/gc.c:935
jl_gc_big_alloc_inner at /home/liozou/julia-1.10-bugtrack/src/gc.c:1006
jl_gc_big_alloc_noinline at /home/liozou/julia-1.10-bugtrack/src/gc.c:1043
jl_gc_alloc_ at /home/liozou/julia-1.10-bugtrack/src/julia_internal.h:480
jl_gc_alloc at /home/liozou/julia-1.10-bugtrack/src/gc.c:3503
_new_array_ at /home/liozou/julia-1.10-bugtrack/src/array.c:134
ijl_array_copy at /home/liozou/julia-1.10-bugtrack/src/array.c:1181
copy at ./array.jl:411 [inlined]
run at /home/liozou/Desktop/reproducer.jl:31
main at /home/liozou/Desktop/reproducer.jl:43
unknown function (ip: 0x7f512405c8d5)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2892
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
jl_apply at /home/liozou/julia-1.10-bugtrack/src/julia.h:1976
do_call at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:125
eval_value at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:222
eval_stmt_value at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:173
eval_body at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:616
jl_interpret_toplevel_thunk at /home/liozou/julia-1.10-bugtrack/src/interpreter.c:774
jl_toplevel_eval_flex at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:934
jl_toplevel_eval_flex at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:877
ijl_toplevel_eval at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:943
ijl_toplevel_eval_in at /home/liozou/julia-1.10-bugtrack/src/toplevel.c:985
eval at ./boot.jl:383 [inlined]
include_string at ./loading.jl:2070
jl_fptr_args at /home/liozou/julia-1.10-bugtrack/src/gf.c:2534
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
_include at ./loading.jl:2130
include at ./Base.jl:495
jfptr_include_46681 at /home/liozou/julia-1.10-bugtrack/usr/lib/julia/sys-debug.so (unknown line)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_83011 at /home/liozou/julia-1.10-bugtrack/usr/lib/julia/sys-debug.so (unknown line)
_jl_invoke at /home/liozou/julia-1.10-bugtrack/src/gf.c:2873
ijl_apply_generic at /home/liozou/julia-1.10-bugtrack/src/gf.c:3074
jl_apply at /home/liozou/julia-1.10-bugtrack/src/julia.h:1976
true_main at /home/liozou/julia-1.10-bugtrack/src/jlapi.c:582
jl_repl_entrypoint at /home/liozou/julia-1.10-bugtrack/src/jlapi.c:731
jl_load_repl at /home/liozou/julia-1.10-bugtrack/cli/loader_lib.c:568
main at /home/liozou/julia-1.10-bugtrack/cli/loader_exe.c:58
unknown function (ip: 0x7f513b9bed8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/liozou/julia-1.10-bugtrack/usr/bin/julia-debug (unknown line)
Allocations: 767132 (Pool: 0; Other: 767132); GC: 2
Allocations: 767132 (Pool: 0; Other: 767132); GC: 2
Aborted (core dumped)
corrupted size vs. prev_size while consolidating

https://julialang-dumps.s3.amazonaws.com/reports/2023-11-21T15-17-20-Liozou.tar.zst

Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
corrupted size vs. prev_size while consolidating

[6952] signal (6.-6): Aborted
in expression starting at /home/liozou/Desktop/reproducer.jl:47
Allocations: 1052429 (Pool: 0; Other: 1052429); GC: 4
Allocations: 1052429 (Pool: 0; Other: 1052429); GC: 4
Aborted
...as well as that thing (is it a crash of `rr` iself?)

https://julialang-dumps.s3.amazonaws.com/reports/2023-11-21T15-24-07-Liozou.tar.zst

Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
Warn. GC verify disabled in multi-threaded GC
[FATAL ./src/RecordSession.cc:1840:process_syscall_entry()] 
 (task 7941 (rec:7941) at time 90889)
 -> Assertion `t->desched_rec() || is_rrcall_notify_syscall_hook_exit_syscall( t->regs().original_syscallno(), t->arch()) || t->ip() == t->vm() ->privileged_traced_syscall_ip() .increment_by_syscall_insn_length(t->arch())' failed to hold. Stashed signal pending on syscall entry when it shouldn't be: {signo:SIGSTKFLT,errno:SUCCESS,code:sicode(1)}; IP=0x7f5b7ddff117
Tail of trace dump:
{
  real_time:1768.347155 global_time:90869, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:7942, ticks:2037366
rax:0xffffffffffffffda rbx:0x0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x189 rdi:0x6e59680049c8 rbp:0x6e59680049a0 rsp:0x20c107582540 r8:0x0 r9:0xffffffff r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x2c r15:0x6e59680049c8 rip:0x7f5b7ddff117 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f5b7c6fd640 gs_base:0x0
}
{
  real_time:1768.347192 global_time:90870, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:7937, ticks:15469507831
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x7f5b840049c8 rbp:0x0 rsp:0x681ffd90 r8:0x7f5b840049a0 r9:0x7f5b840049c0 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x7f5b840049c8 r15:0x7f5b840049a0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f5b7df96b80 gs_base:0x0
}
{
  real_time:1768.347210 global_time:90871, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:7937, ticks:15469507831
rax:0x1 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x7f5b840049c8 rbp:0x0 rsp:0x681ffd90 r8:0x7f5b840049a0 r9:0x7f5b840049c0 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x7f5b840049c8 r15:0x7f5b840049a0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f5b7df96b80 gs_base:0x0
}
{
  real_time:1768.347234 global_time:90872, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:7937, ticks:15469507877
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x7f5b880049cc rbp:0x1 rsp:0x681ffd90 r8:0x7f5b880049a0 r9:0x7f5b880049c0 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x7f5b880049cc r15:0x7f5b880049a4 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f5b7df96b80 gs_base:0x0
}
{
  real_time:1768.347252 global_time:90873, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:7937, ticks:15469507877
rax:0x1 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x7f5b880049cc rbp:0x1 rsp:0x681ffd90 r8:0x7f5b880049a0 r9:0x7f5b880049c0 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x7f5b880049cc r15:0x7f5b880049a4 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f5b7df96b80 gs_base:0x0
}
{
  real_time:1768.347258 global_time:90874, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:7941, ticks:310086
rax:0x0 rbx:0x0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x189 rdi:0x7f5b880049cc rbp:0x7f5b880049a0 rsp:0x49494da4b080 r8:0x0 r9:0xffffffff r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x23 r15:0x7f5b880049cc rip:0x7f5b7ddff117 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x3ef133665640 gs_base:0x0
}
{
  real_time:1768.347278 global_time:90875, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:7941, ticks:310113
rax:0xffffffffffffffda rbx:0x7f5b7cafefa0 rcx:0xffffffffffffffff rdx:0x2 rsi:0x80 rdi:0x7f5b88004978 rbp:0x7f5b880049a0 rsp:0x7f5b7cafed90 r8:0x0 r9:0x1 r10:0x0 r11:0x246 r12:0x80 r13:0x2 r14:0x7f5b88004978 r15:0x7f5b880049cc rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x3ef133665640 gs_base:0x0
}
{
  real_time:1768.347297 global_time:90876, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:7940, ticks:3997551
rax:0x0 rbx:0x0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x189 rdi:0x7f5b840049c8 rbp:0x7f5b840049a0 rsp:0x4c6551f41080 r8:0x0 r9:0xffffffff r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x28 r15:0x7f5b840049c8 rip:0x7f5b7ddff117 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7ffd2d1a4640 gs_base:0x0
}
{
  real_time:1768.347322 global_time:90877, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:7940, ticks:3997587
rax:0xffffffffffffffda rbx:0x7f5b7d3fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x7f5b84004978 rbp:0x4c6551f411b0 rsp:0x7f5b7d3ffd90 r8:0x0 r9:0x0 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x7f5b84004978 r15:0x4c6551f41798 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7ffd2d1a4640 gs_base:0x0
}
{
  real_time:1768.347338 global_time:90878, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:7940, ticks:3997587
rax:0x0 rbx:0x7f5b7d3fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x7f5b84004978 rbp:0x4c6551f411b0 rsp:0x7f5b7d3ffd90 r8:0x0 r9:0x0 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x7f5b84004978 r15:0x4c6551f41798 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7ffd2d1a4640 gs_base:0x0
}
{
  real_time:1768.347506 global_time:90879, event:`SCHED' tid:7940, ticks:4008220
rax:0xff rbx:0x7f5b7e3456a0 rcx:0xbad57accbad67aff rdx:0x0 rsi:0x0 rdi:0x7f5b81bd3b40 rbp:0x3f8053966370 rsp:0x3f8053966330 r8:0x3f8053966020 r9:0x3f8053966040 r10:0x0 r11:0xca r12:0x1 r13:0x7f5b81b39fc0 r14:0x71114f764855 r15:0x7f5b7e32a6d8 rip:0x71114ec35498 eflags:0x202 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7ffd2d1a4640 gs_base:0x0
}
{
  real_time:1768.347528 global_time:90880, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:7937, ticks:15469507896
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x7f5b88004978 rbp:0x7ffd2c91b870 rsp:0x681ffd90 r8:0x7f5b880049a0 r9:0x7f5b880049c0 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x7f5b88004978 r15:0x1 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f5b7df96b80 gs_base:0x0
}
{
  real_time:1768.347546 global_time:90881, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:7937, ticks:15469507896
rax:0x1 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x7f5b88004978 rbp:0x7ffd2c91b870 rsp:0x681ffd90 r8:0x7f5b880049a0 r9:0x7f5b880049c0 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x7f5b88004978 r15:0x1 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f5b7df96b80 gs_base:0x0
}
{
  real_time:1768.347568 global_time:90882, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:7937, ticks:15469507935
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x6e59680049c8 rbp:0x0 rsp:0x681ffd90 r8:0x6e59680049a0 r9:0x6e59680049c0 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x6e59680049c8 r15:0x6e59680049a0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f5b7df96b80 gs_base:0x0
}
{
  real_time:1768.347586 global_time:90883, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:7937, ticks:15469507935
rax:0x1 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x6e59680049c8 rbp:0x0 rsp:0x681ffd90 r8:0x6e59680049a0 r9:0x6e59680049c0 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x6e59680049c8 r15:0x6e59680049a0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f5b7df96b80 gs_base:0x0
}
{
  real_time:1768.347613 global_time:90884, event:`SCHED' tid:7937, ticks:15469507944
rax:0x0 rbx:0xffffffffffffffff rcx:0xffffffffffffffff rdx:0x1 rsi:0x0 rdi:0x6e5968004978 rbp:0x7ffd2c91b870 rsp:0x7ffd2c91b840 r8:0x6e59680049a0 r9:0x6e59680049c0 r10:0x0 r11:0xca r12:0x13250c19c008 r13:0x7f5b7e704080 r14:0x7f5b7e32a6d8 r15:0x1 rip:0x7f5b7de07aa6 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f5b7df96b80 gs_base:0x0
}
{
  real_time:1768.350280 global_time:90885, event:`SCHED' tid:7940, ticks:4278344
rax:0xff rbx:0x7f5b7e3456a0 rcx:0xbad57accbad67aff rdx:0x0 rsi:0x0 rdi:0x7f5b81bd3b40 rbp:0x3f8053966370 rsp:0x3f8053966330 r8:0x3f8053966020 r9:0x3f8053966040 r10:0x0 r11:0xca r12:0x1 r13:0x7f5b81b39fc0 r14:0x71114f764855 r15:0x7f5b7e32a6d8 rip:0x71114ec35498 eflags:0x202 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7ffd2d1a4640 gs_base:0x0
}
{
  real_time:1768.350285 global_time:90886, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:7941, ticks:310113
rax:0x0 rbx:0x7f5b7cafefa0 rcx:0xffffffffffffffff rdx:0x2 rsi:0x80 rdi:0x7f5b88004978 rbp:0x7f5b880049a0 rsp:0x7f5b7cafed90 r8:0x0 r9:0x1 r10:0x0 r11:0x246 r12:0x80 r13:0x2 r14:0x7f5b88004978 r15:0x7f5b880049cc rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x3ef133665640 gs_base:0x0
}
{
  real_time:1768.350309 global_time:90887, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:7941, ticks:310137
rax:0xffffffffffffffda rbx:0x7f5b7cafefa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x7f5b88004978 rbp:0x49494da4b1b0 rsp:0x7f5b7cafed90 r8:0x0 r9:0x1 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x7f5b88004978 r15:0x49494da4b798 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x3ef133665640 gs_base:0x0
}
{
  real_time:1768.350328 global_time:90888, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:7941, ticks:310137
rax:0x0 rbx:0x7f5b7cafefa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x81 rdi:0x7f5b88004978 rbp:0x49494da4b1b0 rsp:0x7f5b7cafed90 r8:0x0 r9:0x1 r10:0x0 r11:0x246 r12:0x81 r13:0x1 r14:0x7f5b88004978 r15:0x49494da4b798 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x3ef133665640 gs_base:0x0
}
=== Start rr backtrace:
rr(_ZN2rr13dump_rr_stackEv+0x5a)[0x55c9597f908a]
rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x4b5)[0x55c9596f4e15]
rr(+0xa4833)[0x55c9596fe833]
rr(+0xa5d8f)[0x55c9596ffd8f]
rr(_ZN2rr13RecordSession21process_syscall_entryEPNS_10RecordTaskEPNS0_9StepStateEPNS0_12RecordResultENS_13SupportedArchE+0x3d1)[0x55c959724e51]
rr(_ZN2rr13RecordSession29handle_seccomp_traced_syscallEPNS_10RecordTaskEPNS0_9StepStateEPNS0_12RecordResultEPb+0x616)[0x55c959726036]
rr(_ZN2rr13RecordSession19handle_ptrace_eventEPPNS_10RecordTaskEPNS0_9StepStateEPNS0_12RecordResultEPb+0x5d1)[0x55c959726a31]
rr(_ZN2rr13RecordSession11record_stepEv+0x2f7)[0x55c95972bb17]
rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0xd35)[0x55c959719895]
rr(main+0x138)[0x55c959698f88]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f0c26d47d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f0c26d47e40]
rr(_start+0x25)[0x55c95969bb15]
=== End rr backtrace
Launch gdb with
  gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:7941' /home/liozou/julia-1.10-bugtrack/usr/bin/julia-debug

I'm opening this as another issue instead of pushing it on #52184 because I don't know if the causes are the same, and it's a different context that does not affect the build of julia itself.

@Liozou Liozou added bug Indicates an unexpected problem or unintended behavior GC Garbage collector labels Nov 21, 2023
@vchuravy vchuravy added this to the 1.10 milestone Nov 21, 2023
@d-netto
Copy link
Member

d-netto commented Nov 21, 2023

A lot of these seem to be coming from gc-debug.c functions which don't seem to be well-suited to be used with multiple GC threads.

For instance, gc_scrub_task uses the thread's current task, but GC threads have no tasks associated with them.

gc_scrub_record_task pushes into an array-list, which is not a thread-safe operation.

We should probably audit which of them can actually be used with the multi-threaded GC.

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Nov 21, 2023

When enabling threads it makes sense to have many GC threads too, and it's currently buggy. But does it make sense to have a single thread with the exception of having many GC threads?

If that makes sense and is not buggy, then GC threads could be disabled otherwise to not block release of 1.10? The fix for multithreaded could then be backported, or not...

@d-netto
Copy link
Member

d-netto commented Nov 21, 2023

The code in gc-debug.c that I mentioned is not enabled unless you build Julia with debug flags (which the user above did).

This code is not very thoroughly tested since most people don't build with these flags, so it probably just rot, but fixing it should be in-scope for a backported bugfix IMO.

@Liozou
Copy link
Member Author

Liozou commented Nov 22, 2023

Thank you for looking into this! So, as far as I understand, these crashes stem from a part of julia that is not usually built unless specifically required, so they should not affect people at large. The issue may be removed from the 1.10 milestone then.

However, I should stress that what I have been trying to tackle in the last weeks is a crash that comes from a race-condition that usually surfaces as a segfault in the gc, in a normal build of julia (on the backports-release-1.10 branch or master). It could be a true race condition coming from my code so I am reluctant to open an issue as long as I haven't minimized it to a point where I am confident it is a julia bug ; but minimizing it is tricky, since it only occurs every now and then, and the codebase is large. This is why I have been trying to use a debug build, with all assertions turned on, in the hope that whatever may be causing the crash would surface more reliably (I also tried to build with TSAN, that's #51774).

Apparently, this debug build is causing other unrelated errors. I think it would still be valuable to fix them for the release of 1.10, because trying to identify bugs possibly stemming from the multi-threaded gc is particularly difficult at the moment. If you need me to give you more stacktraces or rr traces or whatever I can do to help, please let me know.

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Nov 22, 2023

It could be a true race condition coming from my code

Julia can't protect against such in general (nor even Rust, actually it only protects from data races, not race conditions) for non-GC related, but I think (almost) whatever you do regarding allocations should be fine, right? GC is not in your hands, nor should you be able to screw it up in any way, unless if you write out of bounds. I think that, a heap corruption would be the only way. So do you still have the problem if you run with: julia --check-bounds=yes =no

@Liozou
Copy link
Member Author

Liozou commented Nov 22, 2023

GC is not in your hands, nor should you be able to screw it up in any way, unless if you write out of bounds.

Perhaps, but in general if I am causing a race condition, the rule of thumb is that anything goes (it's literally undefined behavior), and I believe a GC corruption, or at least a crash stemming from GC, is not unthinkable.
For instance, if I have two threads trying to resize! the same array and it requires allocating new memory, I would assume both threads could try to free the initial memory after the resize, but that could cause a double-free if the resize! is racy. And if it's the GC that's in charge of actually freeing that memory, then the crash will appear to come from GC, although the race condition is completely unrelated to actual GC.
Of course this is just an example and I don't know if array resizing works like that, but I would assume stuff like this could happen.

In any case, I am indeed still seeing that crash with --check-bounds=yes (I assume you mean yes instead of no?).

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Nov 22, 2023

if I am causing a race condition, the rule of thumb is that anything goes (it's literally undefined behavior), and I believe a GC corruption, or at least a crash stemming from GC, is not unthinkable.

Yes, it's a rule of thumb, UB (so-called "nasal daemons" released, but no, not literally anything can happen...), but I think an exception in this case:

For instance, if I have two threads trying to resize! the same array and it requires allocating new memory, I would assume both threads could try to free the initial memory after the resize, but that could cause a double-free

If you allocate directly or indirectly then I don't think it should matter. You don't strictly do it directly (with one syntax, e.g. malloc as in C), though you might with e.g. similar. Also with resize! (indirectly) that people might not realize sometimes, not always, needs to allocate.

But even if your user code has a race condition, I think calling some Julia APIs should be safe. Yes, likely not all of them, but at least resize! (and all meant just to do allocation for you, so also similar too)? If not, its docs should have a warning about that?! And I only see this warning (without saying it is...) "the new elements are not guaranteed to be initialized."

[I think I can see race-condition issues with resize!d arrays, about the contents of the array, true, but I mean I think the allocation itself should be safe, also deallocation, no double-free.]

[I'm not sure reentrant applies, but at least] the allocation APIs of Julia are thread-safe, presumably, and not just because the underlying libc implementation which is sometimes (not always!) called is:

https://stackoverflow.com/questions/855763/is-malloc-thread-safe

Question: "is malloc reentrant"?
Answer: no, it is not. Here is one definition of what makes a routine reentrant.

None of the common versions of malloc allow you to re-enter it (e.g. from a signal handler). Note that a reentrant routine may not use locks, and almost all malloc versions in existence do use locks (which makes them thread-safe), or global/static variables (which makes them thread-unsafe and non-reentrant).

All the answers so far answer "is malloc thread-safe?", which is an entirely different question. To that question the answer is it depends on your runtime library, and possibly on the compiler flags you use. On any modern UNIX, you'll get a thread-safe malloc by default. On Windows, use /MT, /MTd, /MD or /MDd flags to get thread-safe runtime library.

I didn't look up if free is thread-safe too, I suppose so, but it's not called from your threads anyway, but from the one GC thread (or as of recently there are many GC threads, so from potentially all of them, should be still safe, otherwise a new bug, what you are hitting?).

[I'm not doubting you have a "double free or corruption (!prev)", I mean I can see that potential bug in Julia, I'm saying it would be a bug in Julia that could and should(?) be fixed, and your code should be able to run without GC issues. You might still have other issue related to race conditions, so might want to fix that anyway.]

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Nov 22, 2023

Are you on Windows? This might be ok elsewhere, could someone check? I don't know what Julia does, i.e. this:

On Windows, use /MT, /MTd, /MD or /MDd flags to get thread-safe runtime library.

I did try your code in Julia (also 1.9.4 and .3) with 4 or more threads, 10000 or:

$ julia +1.10.0-rc1 -t 16 repr.jl 100000

but no issue, which is good, though I didn't use a debug build, the only one failing for you?

Does this make sense to you or get rid of the problem:

mutable struct LoadBalancer{T}
    @atomic channel::Channel{T}

Do you have an idea where the problem is related? I thought related to Channel, and likely put! for it (or take!?).

@Liozou
Copy link
Member Author

Liozou commented Nov 22, 2023

I'm pretty sure the allocations APIs of Julia are not thread-safe in the presence of race condition unfortunately, since the documentation states:

Additionally, Julia is not memory safe in the presence of a data race. Be very careful about reading any data if another thread might write to it!

in addition to the entire https://docs.julialang.org/en/v1/manual/multi-threading/#Caveats paragraph. I would love for that guarantee to exist, but it's very complicated to implement in a systematic fashion, and anyway I think the performance implications are much too great to warrant that trade-off in the common case.

Are you on Windows?

No, this is on linux.

I didn't use a debug build, the only one failing for you?

Indeed, the particular minimal code above only fails with the debug build, which is not too surprising since the errors mostly stem from gc-debug.c (except the "Assertion !freedall failed" one).

But let's not sidetrack this issue too much from the initial report, which focuses on the debug build. I'll post the stacktrace separately in a new issue once I have a small-ish reproducer!

@Liozou
Copy link
Member Author

Liozou commented Nov 22, 2023

Does this make sense to you or get rid of the problem:

mutable struct LoadBalancer{T}
    @atomic channel::Channel{T}

It does not really make sense to me: adding an allocation (by making the struct mutable) is likely to only add problems, and the access to the channel has no reason to be @atomic if the binding of the channel field is constant (which it is, since my LoadBalancer is not mutable). But in any case, I tried and still observe the same failures, with and without the @atomic and on both my unreported issue with the normal build and on the issue at hand here with the debug build.

what you are hitting?

Unfortunately, good old

[142932] signal (11.1): Segmentation fault
in expression starting at none:1
Allocations: 18974010 (Pool: 18953473; Big: 20537); GC: 30
Segmentation fault (core dumped)

with no stacktrace. I had managed to get some kind of stacktrace from looking at it in gdb and I remember it started from some place in the gc (which again, doesn't prove that the gc is buggy), but I don't have it at the moment. Anyway, again, let's keep that problem separate from the issue here, since it could well be my particular code which is at fault in that case, and I will open an issue in due time once I can be sure it is not the case.

@vtjnash
Copy link
Member

vtjnash commented Aug 15, 2024

Disabled by #48600

@vtjnash vtjnash closed this as completed Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior GC Garbage collector rr trace included
Projects
None yet
Development

No branches or pull requests

6 participants