Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s390x regression: failing io::tests::try_oom_error #133806

Open
uweigand opened this issue Dec 3, 2024 · 6 comments
Open

s390x regression: failing io::tests::try_oom_error #133806

uweigand opened this issue Dec 3, 2024 · 6 comments
Labels
A-ABI Area: Concerning the application binary interface (ABI) C-bug Category: This is a bug. E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example I-miscompile Issue: Correct Rust code lowers to incorrect machine code O-SystemZ Target: SystemZ processors (s390x) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@uweigand
Copy link
Contributor

uweigand commented Dec 3, 2024

As of this merge commit:

commit d53f0b1d8e261f2f3535f1cd165c714fc0b0b298
Merge: a2545fd6fc6 4a216a25d14
Author: bors <[email protected]>
Date:   Thu Nov 28 21:44:34 2024 +0000

    Auto merge of #123244 - Mark-Simulacrum:share-inline-never-generics, r=saethlin

I'm seeing the following test case failure. Note that the test passes in both parents (a2545fd and 4a216a2) of the merge commit.

thread 'io::tests::try_oom_error' panicked at std/src/io/tests.rs:822:62:
called `Result::unwrap_err()` on an `Ok` value: ()
stack backtrace:
   0:      0x3fff7dd6702 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h0eec3d9053c23c0f
   1:      0x3fff7e37506 - core::fmt::write::h66866b531685abe5
   2:      0x3fff7dc575e - std::io::Write::write_fmt::h89ced3ac9904279e
   3:      0x3fff7dd6570 - std::sys::backtrace::BacktraceLock::print::h363d5b9cad1f5c19
   4:      0x3fff7df62e4 - std::panicking::default_hook::{{closure}}::ha4b8eaf1f6a37f57
   5:      0x3fff7df60da - std::panicking::default_hook::hda41cc1e1c3b4efa
   6:      0x2aa00430d78 - test::test_main::{{closure}}::h4d9e2859f981c511
   7:      0x3fff7df6aa0 - std::panicking::rust_panic_with_hook::heff88192ef2a89fb
   8:      0x3fff7dd6d52 - std::panicking::begin_panic_handler::{{closure}}::hff5589d5c45993a6
   9:      0x3fff7dd69b4 - std::sys::backtrace::__rust_end_short_backtrace::h165daf71d9abcca8
  10:      0x3fff7df63ca - rust_begin_unwind
  11:      0x3fff7d4aa6a - core::panicking::panic_fmt::hec8c29ccd1751d1e
  12:      0x3fff7d4b948 - core::result::unwrap_failed::h47cf11019e236d96
  13:      0x2aa001c5a9a - core::ops::function::FnOnce::call_once::h5453841f675c42ec
  14:      0x2aa00436d74 - test::__rust_begin_short_backtrace::h31f93d45aa944e21
  15:      0x2aa00436f62 - test::run_test_in_process::h617ed5302028c350
  16:      0x2aa0042a67e - std::sys::backtrace::__rust_begin_short_backtrace::hbc434a15ea7a090f
  17:      0x2aa00425e14 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h2f86d2c09a8a35d2
  18:      0x3fff7df33a8 - std::sys::pal::unix::thread::Thread::new::thread_start::hce74d4c3b42eec78
  19:      0x3fff7bac3fa - start_thread
                               at /usr/src/debug/glibc-2.39-17.1.ibm.fc40.s390x/nptl/pthread_create.c:447:8
  20:      0x3fff7c2bde0 - thread_start
                               at /usr/src/debug/glibc-2.39-17.1.ibm.fc40.s390x/misc/../sysdeps/unix/sysv/linux/s390/s390-64/clone3.S:71
  21:                0x0 - <unknown>

I've tried debugging the test, but if I'm reading this correctly, the test function was already completely optimized out and replaced by a failed assertion at compile time:

Dump of assembler code for function _ZN4core3ops8function6FnOnce9call_once17h5453841f675c42ecE:
   0x000002aa001c5a60 <+0>:     stmg    %r6,%r15,48(%r15)
   0x000002aa001c5a66 <+6>:     aghi    %r15,-168
   0x000002aa001c5a6a <+10>:    lgr     %r11,%r15
   0x000002aa001c5a6e <+14>:    lgrl    %r1,0x2aa00568f28
   0x000002aa001c5a74 <+20>:    lb      %r0,0(%r1)
   0x000002aa001c5a7a <+26>:    la      %r4,167(%r11)
   0x000002aa001c5a7e <+30>:    larl    %r2,0x2aa00481e7c <anon.6846cc147164699b42462cc8b979de03.18.llvm.3644326088524771271>
   0x000002aa001c5a84 <+36>:    lghi    %r3,46
   0x000002aa001c5a88 <+40>:    larl    %r5,0x2aa00545d08 <anon.6846cc147164699b42462cc8b979de03.17.llvm.3644326088524771271>
   0x000002aa001c5a8e <+46>:    larl    %r6,0x2aa00546f78 <anon.6846cc147164699b42462cc8b979de03.473.llvm.3644326088524771271>
   0x000002aa001c5a94 <+52>:    brasl   %r14,0x2aa0005c0e0 <_ZN4core6result13unwrap_failed17h47cf11019e236d96E@plt>

Note the unconditional call to unwrap_failed.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 3, 2024
@bjorn3 bjorn3 added O-SystemZ Target: SystemZ processors (s390x) I-miscompile Issue: Correct Rust code lowers to incorrect machine code labels Dec 3, 2024
@saethlin saethlin added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Dec 3, 2024
@uweigand
Copy link
Contributor Author

uweigand commented Dec 3, 2024

As requested by @saethlin , I tried the previous commit a2545fd using

RUSTFLAGS_NOT_BOOTSTRAP=-Zshare-generics ./x.py test

Interestingly enough, the io::tests::try_oom_error test still succeeds. However, another test is now failing:

[uweigand@a35lp68 rust]$ LD_LIBRARY_PATH=./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib
./build/s390x-unknown-linux-gnu/stage1-std/s390x-unknown-linux-gnu/release/deps/alloctests-b4087bb360d3d1cf
sort::tests::stable::panic_retain_orig_set_cell_i32_random_d2

running 2 tests
memory allocation of 13192931584848 bytes failed
memory allocation of 13193334237648 bytes failed
Aborted (core dumped)

Backtrace shows:

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x000003fff7aae406 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
#2  0x000003fff7a54460 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x000003fff7a3449c in __GI_abort () at abort.c:79
#4  0x000003fff7d97b94 in std::sys::pal::unix::abort_internal ()
   from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#5  0x000003fff7d72fb4 in std::process::abort () from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#6  0x000003fff7d97c0c in std::alloc::rust_oom () from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#7  0x000003fff7d97c30 in __rg_oom () from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#8  0x000003fff7d752f4 in alloc::alloc::handle_alloc_error () from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#9  0x000003fff7d752d4 in alloc::raw_vec::handle_error () from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#10 0x000002aa0009b882 in alloctests::sort::tests::panic_retain_orig_set_cell_i32_random_d2_impl ()
#11 0x000002aa0018e238 in core::ops::function::FnOnce::call_once ()
#12 0x000002aa00266e04 in test::__rust_begin_short_backtrace ()
#13 0x000002aa00267024 in test::run_test_in_process ()
#14 0x000002aa00290e8e in std::sys::backtrace::__rust_begin_short_backtrace ()
#15 0x000002aa00268ee4 in core::ops::function::FnOnce::call_once{{vtable.shim}} ()
#16 0x000003fff7de4e88 in std::sys::pal::unix::thread::Thread::new::thread_start ()
   from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#17 0x000003fff7aac3fa in start_thread (arg=0x3fff79008c0) at pthread_create.c:447
#18 0x000003fff7b2bde0 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone3.S:71

Not sure if this is a related problem (at least it's also somewhere around OOM handling ...).

@saethlin
Copy link
Member

saethlin commented Dec 3, 2024

So it sounds to me like the increased use of "share-generics codegen" has exposed a pre-existing miscompile. -Zshare-generics is of course an unstable flag, but it is on by default in unoptimized builds so it is not really a niche option.

What is in your config.toml and exactly what command are you running to hit these crashes? I just want to make extra sure that this can or can't be reproduced on x86_64.

@jieyouxu jieyouxu added C-bug Category: This is a bug. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Dec 3, 2024
@uweigand
Copy link
Contributor Author

uweigand commented Dec 3, 2024

What is in your config.toml and exactly what command are you running to hit these crashes?

Nothing special as far as I can see ... config.toml is:

profile = "user"
[build]
extended = true
sanitizers = true
profiler = true
[rust]
lld = true

and then I'm running the following to trigger the failure:

./x.py build 
RUSTFLAGS_NOT_BOOTSTRAP=-Zshare-generics ./x.py test

All this is running natively on a s390x Linux system (Fedora 40 if it matters).

@saethlin
Copy link
Member

saethlin commented Dec 4, 2024

Confirming that I ran exactly that on my x86_86 Arch Linux dev machine and none of the library tests fail. (A ui test and an assembly test fail, but what those tests are doing is just incompatible with adding the flag).

So I think either the bad code or bad IR is gated behind a set of cfgs that are only toggled by s390x, or this is an LLVM backend issue.

In either case, minimizing a reproducer would be ideal. I don't know if the standard library build is going to be relevant here, you should be able to extract a simple test case that misbehaves on nightlies before that PR with RUSTFLAGS=-Zshare-generics cargo run -Zbuild-std --release. If you get different behavior with and without build-std, then the standard library is relevant.

The thorny part about share-generics is that changes how codegen works, depending on how your dependencies were compiled. And the standard library is always a dependency. So isolating this could be difficult.

@uweigand
Copy link
Contributor Author

uweigand commented Dec 4, 2024

There's been some interesting events: as of today, current mainline no longer shows the test case failure. I've been able to track the change down to this PR: #133701, and specifically the single changed line in library/std/src/sys/pal/unix/process/process_common.rs in that diff. Why this change should fix the problem is quite unclear. I'll try to track down differences in compiled code between the two source trees differing only in that one line.

@jieyouxu jieyouxu added the E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example label Dec 7, 2024
@uweigand
Copy link
Contributor Author

I've been able to track the change down to this PR: #133701, and specifically the single changed line in library/std/src/sys/pal/unix/process/process_common.rs in that diff. Why this change should fix the problem is quite unclear.

This was mostly a red herring. Turns out whether or not the bug is seen depends on the partitioning of code between different codegen units, which can be affected in various ways by random source code changes. Most of these random effects go away when forcing -Ccodegen-units=1. This also explains why I had been unable to create assembler or IR files showing the problem - -emit=asm or -emit=llvm-ir implicitly enforces a single codegen unit, which often changes the behavior significantly.

Using both -Zshare-generics and -Ccodegen-units=1 I was able to bisect the actual commit that introduces those bugs: #131586 . While I still don't fully understand why this introduced the problem, at least it makes sense as it is an actual codegen change for s390x. I'll investigate further.

@saethlin saethlin added the A-ABI Area: Concerning the application binary interface (ABI) label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ABI Area: Concerning the application binary interface (ABI) C-bug Category: This is a bug. E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example I-miscompile Issue: Correct Rust code lowers to incorrect machine code O-SystemZ Target: SystemZ processors (s390x) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants