Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.3.0-dev: Segfault in rb_profile_frames with N:M threads enabled #221

Open
casperisfine opened this issue Nov 20, 2023 · 3 comments · Fixed by ruby/ruby#9311
Open

3.3.0-dev: Segfault in rb_profile_frames with N:M threads enabled #221

casperisfine opened this issue Nov 20, 2023 · 3 comments · Fixed by ruby/ruby#9311

Comments

@casperisfine
Copy link
Contributor

We run our CI against ruby-head, and a portion of it with N:M threads enabled on the main Ractor. rb_profile_frames often crashes when N:M are enabled.

I suspect the assumption that rb_profile_frames is async signal safe no longer holds there?

[BUG] Segmentation fault at 0x0000000000000010
ruby 3.3.0dev (2023-11-19T03:01:05Z shopify 9aee12cc28) +MN [x86_64-linux]

-- C level backtrace information -------------------------------------------
/usr/local/ruby/bin/ruby(rb_print_backtrace+0x14) [0x55c8f16333e1] /tmp/ruby-build/ruby-3.3.0-9aee12cc28cbca40306784e54e38558688caa9f7/vm_dump.c:812
/usr/local/ruby/bin/ruby(rb_vm_bugreport) /tmp/ruby-build/ruby-3.3.0-9aee12cc28cbca40306784e54e38558688caa9f7/vm_dump.c:1143
/usr/local/ruby/bin/ruby(rb_bug_for_fatal_signal+0xfc) [0x55c8f17e559c] /tmp/ruby-build/ruby-3.3.0-9aee12cc28cbca40306784e54e38558688caa9f7/error.c:1065
/usr/local/ruby/bin/ruby(sigsegv+0x4d) [0x55c8f158091d] /tmp/ruby-build/ruby-3.3.0-9aee12cc28cbca40306784e54e38558688caa9f7/signal.c:920
/lib/x86_64-linux-gnu/libc.so.6(0x7ff3db333520) [0x7ff3db333520]
/usr/local/ruby/bin/ruby(thread_profile_frames+0x10) [0x55c8f162efd0] /tmp/ruby-build/ruby-3.3.0-9aee12cc28cbca40306784e54e38558688caa9f7/vm_backtrace.c:1587
/tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/lib/stackprof/stackprof.so(stackprof_buffer_sample+0x28) [0x7ff3b7d2435c] /tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/ext/stackprof/stackprof.c:622
/tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/lib/stackprof/stackprof.so(stackprof_buffer_sample) /tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/ext/stackprof/stackprof.c:604
/tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/lib/stackprof/stackprof.so(stackprof_buffer_sample) (null):0
/tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/lib/stackprof/stackprof.so(stackprof_signal_handler+0x5) [0x7ff3b7d24545] /tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/ext/stackprof/stackprof.c:767
/tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/lib/stackprof/stackprof.so(stackprof_signal_handler) /tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/ext/stackprof/stackprof.c:722
/lib/x86_64-linux-gnu/libc.so.6(0x7ff3db333520) [0x7ff3db333520]
/lib/x86_64-linux-gnu/libc.so.6(pthread_cond_wait+0x24a) [0x7ff3db384a7a]
/usr/local/ruby/bin/ruby(rb_native_cond_wait+0xb) [0x55c8f15c8cbb] /tmp/ruby-build/ruby-3.3.0-9aee12cc28cbca40306784e54e38558688caa9f7/thread_pthread.c:214
/usr/local/ruby/bin/ruby(ractor_sched_deq) /tmp/ruby-build/ruby-3.3.0-9aee12cc28cbca40306784e54e38558688caa9f7/thread_pthread.c:1230
/usr/local/ruby/bin/ruby(nt_start) /tmp/ruby-build/ruby-3.3.0-9aee12cc28cbca40306784e54e38558688caa9f7/thread_pthread.c:2209

cc @ko1 @tenderlove @jhawthorn

@casperisfine
Copy link
Contributor Author

I tried to use postponed jobs, but it actually crashes even more:

[BUG] Segmentation fault at 0x0000000000000020
ruby 3.3.0dev (2023-11-22T17:01:13Z shopify c1fc1a00ea) +MN [x86_64-linux]

-- Machine register context ------------------------------------------------
 RIP: 0x000055df5fe38489 RBP: 0x00007f517bc59000 RSP: 0x00007f5123f3c6d0
 RAX: 0x0000000000000000 RBX: 0x00007f51596554e0 RCX: 0x00007f517bdb6b40
 RDX: 0x0000000000000001 RDI: 0x00007f517bc59000 RSI: 0x0000000000000000
  R8: 0x0000000000000000  R9: 0x00000000ffffffff R10: 0x0000000000000000
 R11: 0x0000000000000246 R12: 0x0000000000000000 R13: 0x0000000000000000
 R14: 0x0000000000001ea5 R15: 0x00007f517bc59104 EFL: 0x0000000000010202

-- C level backtrace information -------------------------------------------
/usr/local/ruby/bin/ruby(rb_print_backtrace+0x14) [0x55df5fe328e1] /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/vm_dump.c:812
/usr/local/ruby/bin/ruby(rb_vm_bugreport) /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/vm_dump.c:1143
/usr/local/ruby/bin/ruby(rb_bug_for_fatal_signal+0xfc) [0x55df5ffe509c] /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/error.c:1065
/usr/local/ruby/bin/ruby(sigsegv+0x4d) [0x55df5fd7f19d] /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/signal.c:920
/lib/x86_64-linux-gnu/libc.so.6(0x7f517c277520) [0x7f517c277520]
/usr/local/ruby/bin/ruby(rbimpl_atomic_or+0x0) [0x55df5fe38489] /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/vm_trace.c:1691
/usr/local/ruby/bin/ruby(postponed_job_register) /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/vm_trace.c:1693
/usr/local/ruby/bin/ruby(postponed_job_register) /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/vm_trace.c:1675
/usr/local/ruby/bin/ruby(rb_postponed_job_register_one) /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/vm_trace.c:1746
/tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/lib/stackprof/stackprof.so(stackprof_signal_handler+0x2d) [0x7f5159655434] /tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/ext/stackprof/stackprof.c:763
/tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/lib/stackprof/stackprof.so(stackprof_signal_handler) /tmp/bundle/ruby/3.3.0+0/gems/stackprof-0.2.25/ext/stackprof/stackprof.c:722
/lib/x86_64-linux-gnu/libc.so.6(0x7f517c277520) [0x7f517c277520]
/lib/x86_64-linux-gnu/libc.so.6(0x7f517c2c6117) [0x7f517c2c6117]
/lib/x86_64-linux-gnu/libc.so.6(pthread_cond_wait+0x211) [0x7f517c2c8a41]
/usr/local/ruby/bin/ruby(rb_native_cond_wait+0xb) [0x55df5fdc75fb] /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/thread_pthread.c:214
/usr/local/ruby/bin/ruby(ractor_sched_deq) /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/thread_pthread.c:1230
/usr/local/ruby/bin/ruby(nt_start) /tmp/ruby-build/ruby-3.3.0-c1fc1a00ea9633961153451d0e927db49c1b268d/thread_pthread.c:2209
/lib/x86_64-linux-gnu/libc.so.6(0x7f517c2c9ac3) [0x7f517c2c9ac3]

@jhawthorn
Copy link
Contributor

jhawthorn commented Nov 23, 2023

Yes, I think postponed jobs are likely to have the same problem. From what I can see M:N threads sets the EC to NULL at least part of the time (though I'd thought postponed jobs had a check for that happening and would delegate to the main thread/ec). I'm looking for solutions in Vernier, though likely a little help is needed upstream.

We could fairly easily make rb_profile_frames safe to call when EC is NULL, but I worry that would "work" but give inaccurate results (I'm checking what that looks like).

Out of curiosity, is this in CPU mode? I'd have somewhat expected my hack to always sample the same thread StackProf was started from in :wall mode to avoid this. In what way is M:N enabled, using the ENV far or via using a ractor?

@casperisfine
Copy link
Contributor Author

Out of curiosity, is this in CPU mode?

Not 100% sure but I don't think so.

In what way is M:N enabled, using the ENV far or via using a ractor?

Via ENV so that it's active on the main Ractor.

I filed https://bugs.ruby-lang.org/issues/20016 and https://bugs.ruby-lang.org/issues/20017 because if even postponed job can't work, it's definitely a Ruby bug, and something to fix upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants