Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't segfault when running atexit before jl_threads_init #45765

Merged
merged 3 commits into from
Jun 29, 2022

Conversation

vchuravy
Copy link
Member

Co-debugged with @staticfloat on ci-dev call

@vchuravy vchuravy added backport 1.6 Change should be backported to release-1.6 backport 1.7 backport 1.8 Change should be backported to release-1.8 labels Jun 20, 2022
@vchuravy vchuravy requested a review from vtjnash June 20, 2022 18:47
@giordano giordano linked an issue Jun 20, 2022 that may be closed by this pull request
test/cmdlineargs.jl Outdated Show resolved Hide resolved
@vtjnash vtjnash removed backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Jun 21, 2022
@vtjnash
Copy link
Member

vtjnash commented Jun 21, 2022

CI apparently does not like this. It also might be good to stop running unsupported configurations on buildkite. Though we should start to see threads being supported in releases soon.

cmdlineargs                               (8) |         failed at 2022-06-20T19:59:01.658
Test Failed at /cache/build/default-amdci4-0/julialang/julia-master/julia-72ed6e5bcd/share/julia/test/cmdlineargs.jl:680
  Expression: errors_not_signals(p)
Test Failed at /cache/build/default-amdci4-0/julialang/julia-master/julia-72ed6e5bcd/share/julia/test/cmdlineargs.jl:682
  Expression: p.exitcode == 1
   Evaluated: -9223372036854775808 == 1
Test Failed at /cache/build/default-amdci4-0/julialang/julia-master/julia-72ed6e5bcd/share/julia/test/cmdlineargs.jl:680
  Expression: errors_not_signals(p)
Test Failed at /cache/build/default-amdci4-0/julialang/julia-master/julia-72ed6e5bcd/share/julia/test/cmdlineargs.jl:682
  Expression: p.exitcode == 1
   Evaluated: -9223372036854775808 == 1
Test Failed at /cache/build/default-amdci4-0/julialang/julia-master/julia-72ed6e5bcd/share/julia/test/cmdlineargs.jl:691
  Expression: errors_not_signals(p)
Test Failed at /cache/build/default-amdci4-0/julialang/julia-master/julia-72ed6e5bcd/share/julia/test/cmdlineargs.jl:693
  Expression: p.exitcode == 1
   Evaluated: -9223372036854775808 == 1
      From worker 37:	ERROR: option `-e/--eval` is missing an argument
      From worker 37:	ERROR: option `-e/--eval` is missing an argument
      From worker 37:	ERROR: option `-E/--print` is missing an argument
      From worker 37:	ERROR: option `-E/--print` is missing an argument
      From worker 37:	ERROR: option `-L/--load` is missing an argument
      From worker 37:	ERROR: option `-L/--load` is missing an argument
      From worker 37:	fatal: error thrown and no exception handler available.
      From worker 37:	ErrorException("Invalid CPU name "invalidtarget".")
      From worker 37:
      From worker 37:	signal (11): Segmentation fault
      From worker 37:	in expression starting at none:0
      From worker 37:
      From worker 37:	signal (6): Aborted
      From worker 37:	in expression starting at none:0
      From worker 37:	fatal: error thrown and no exception handler available.
      From worker 37:	ErrorException("Invalid CPU name "invalidtarget".")
      From worker 37:
      From worker 37:	signal (11): Segmentation fault
      From worker 37:	in expression starting at none:0
      From worker 37:
      From worker 37:	signal (6): Aborted
      From worker 37:	in expression starting at none:0

@vchuravy
Copy link
Member Author

How is setting JULIA_NUM_THREADS an unsupported configuration?

CI apparently does not like this.

So yeah this isn't complete :)

@vtjnash
Copy link
Member

vtjnash commented Jun 21, 2022

How is setting JULIA_NUM_THREADS an unsupported configuration?

CI apparently does not like this.

Because CI apparently does not like this

@staticfloat staticfloat force-pushed the vc/safe_atexit branch 2 times, most recently from 8aa99fe to c54bc7f Compare June 22, 2022 19:40
test/cmdlineargs.jl Outdated Show resolved Hide resolved
@staticfloat staticfloat force-pushed the vc/safe_atexit branch 3 times, most recently from 04ee022 to 178d6f4 Compare June 24, 2022 17:21
@staticfloat
Copy link
Member

Alright, that latest fix has solved the majority of problems, but i686-linux-gnu still segfaults on julia -C invalid.

Digging in with a debugger, we fail at a different point on i686-linux-gnu, because we don't hit the CX16 failure branch, but instead hit the "Invalid CPU Name" error. This is much later in the bringup process, but still not early enough that we can do things like call jl_DI_for_fptr, which is what we try to do when we print backtraces due to calling jl_errorf().

I've now pushed a commit that avoids printing backtraces when jl_errorf() is called, in a similar manner to how jl_error() avoids backtraces, but I had to extend the guard condition to use other data structures, as it turns out that the existing guard condition was insufficient. I don't know if this is the proper guard condition (maybe we should create a jl_can_collect_backtraces sentinel global value?) but CI will tell us if this works or not.

@staticfloat staticfloat force-pushed the vc/safe_atexit branch 2 times, most recently from 03527d4 to e6d4289 Compare June 27, 2022 19:15
@staticfloat
Copy link
Member

Alright, this is now fixing the identified problems with running atexit hooks, however I'm quite unsatisfied with my solution of using jl_error_sym as a sentinel for the fact that LLVM has been initialized or not. Is there a more direct method I can use, or should I add a jl_llvm_initialized sentinel value or something?

@JeffBezanson
Copy link
Member

Maybe we should have jl_init_llvm handle filling in the libjulia-codegen function pointers, so the system just behaves as if codegen is not available before it's initialized?

@staticfloat
Copy link
Member

Following Jeff's advice on Slack, I have removed the potentially more controversial parts of this PR, manually inlining the "print and exit" logic from jl_error* in the cases where we know it is too early to backtrace. Looking on CI, I got a natural green flush, and have verified that there are no mysterious core files uploaded, so I'm calling this a victory.

We accidentally ignored some test failures because we only tested for `!success(p)`, which passes even if `p` segfaulted.
When calling `jl_error()` or `jl_errorf()`, we must check to see if we
are so early in the bringup process that it is dangerous to attempt to
construct a backtrace because the data structures used to provide line
information are not properly setup.

This can be easily triggered by running:

```
julia -C invalid
```

On an `i686-linux-gnu` build, this will hit the "Invalid CPU Name"
branch in `jitlayers.cpp`, which calls `jl_errorf()`.  This in turn
calls `jl_throw()`, which will eventually call `jl_DI_for_fptr` as part
of the backtrace printing process, which fails as the object maps are
not fully initialized.  See the below `gdb` stacktrace for details:

```
$ gdb -batch -ex 'r' -ex 'bt' --args ./julia -C invalid
...
fatal: error thrown and no exception handler available.
ErrorException("Invalid CPU name "invalid".")

Thread 1 "julia" received signal SIGSEGV, Segmentation fault.
0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277
1277    /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h: No such file or directory.
 #0  0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277
 #1  std::map<unsigned int, JITDebugInfoRegistry::ObjectInfo, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__x=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_map.h:1258
 #2  jl_DI_for_fptr (fptr=4155049385, symsize=symsize@entry=0xffffcfa8, slide=slide@entry=0xffffcfa0, Section=Section@entry=0xffffcfb8, context=context@entry=0xffffcf94) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1181
 #3  0xf75c056a in jl_getFunctionInfo_impl (frames_out=0xffffd03c, pointer=4155049385, skipC=0, noInline=0) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1210
 #4  0xf7a6ca98 in jl_print_native_codeloc (ip=4155049385) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:636
 #5  0xf7a6cd54 in jl_print_bt_entry_codeloc (bt_entry=0xf0798018) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:657
 #6  jlbacktrace () at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:1090
 #7  0xf7a3cd2b in ijl_no_exc_handler (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:605
 #8  0xf7a3d10a in throw_internal (ct=ct@entry=0xf070c010, exception=<optimized out>, exception@entry=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:638
 #9  0xf7a3d330 in ijl_throw (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:654
 #10 0xf7a905aa in ijl_errorf (fmt=fmt@entry=0xf7647cd4 "Invalid CPU name \"%s\".") at /cache/build/default-amdci5-4/julialang/julia-master/src/rtutils.c:77
 #11 0xf75a4b22 in (anonymous namespace)::createTargetMachine () at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:823
 #12 JuliaOJIT::JuliaOJIT (this=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:1044
 #13 0xf7531793 in jl_init_llvm () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8585
 #14 0xf75318a8 in jl_init_codegen_impl () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8648
 #15 0xf7a51a52 in jl_restore_system_image_from_stream (f=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2131
 #16 0xf7a55c03 in ijl_restore_system_image_data (buf=0xe859c1c0 <jl_system_image_data> "8'\031\003", len=125161105) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2184
 #17 0xf7a55cf9 in jl_load_sysimg_so () at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:424
 #18 ijl_restore_system_image (fname=0x80a0900 "/build/bk_download/julia-d78fdad601/lib/julia/sys.so") at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2157
 #19 0xf7a3bdfc in _finish_julia_init (rel=rel@entry=JL_IMAGE_JULIA_HOME, ct=<optimized out>, ptls=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:741
 #20 0xf7a3c8ac in julia_init (rel=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:728
 #21 0xf7a7f61d in jl_repl_entrypoint (argc=<optimized out>, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/src/jlapi.c:705
 #22 0x080490a7 in main (argc=3, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/cli/loader_exe.c:59
```

To prevent this, we simply avoid calling `jl_errorf` this early in the
process, punting the problem to a later PR that can update guard
conditions within `jl_error*`.
@staticfloat staticfloat merged commit b11ccae into master Jun 29, 2022
@staticfloat staticfloat deleted the vc/safe_atexit branch June 29, 2022 03:11
@staticfloat
Copy link
Member

Further discussion about making jl_error* safe can be found in this issue: #45847

@vchuravy
Copy link
Member Author

Thanks Elliot for getting this over the line!

@KristofferC KristofferC removed the backport 1.8 Change should be backported to release-1.8 label Aug 17, 2022
staticfloat added a commit that referenced this pull request Apr 24, 2023
This is the same as #45765 where
we use `jl_error()` too early to get backtraces, but too late to fail
the supposed guard `if` statement that should prevent us from trying to
take a backtrace.  X-ref: #45847
staticfloat added a commit that referenced this pull request Apr 24, 2023
This is the same as #45765 where
we use `jl_error()` too early to get backtraces, but too late to fail
the supposed guard `if` statement that should prevent us from trying to
take a backtrace.  X-ref: #45847
staticfloat added a commit that referenced this pull request Apr 24, 2023
This is the same as #45765 where
we use `jl_error()` too early to get backtraces, but too late to fail
the supposed guard `if` statement that should prevent us from trying to
take a backtrace.  X-ref: #45847
staticfloat added a commit that referenced this pull request Apr 24, 2023
This is the same as #45765 where
we use `jl_error()` too early to get backtraces, but too late to fail
the supposed guard `if` statement that should prevent us from trying to
take a backtrace.  X-ref: #45847

(cherry picked from commit fa21589)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Your CPU does not support the CX16 instruction" error segfaults?
6 participants