Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: tests/codegen/issues/issue-101082.rs fails with -Ctarget-cpu=x86-64-v3 #131563

Open
cuviper opened this issue Oct 11, 2024 · 9 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another.

Comments

@cuviper
Copy link
Member

cuviper commented Oct 11, 2024

Code

I tried this test with -Ctarget-cpu=x86-64-v3 (which we have on by default in the upcoming RHEL 10):

tests/codegen/issues/issue-101082.rs

//@ compile-flags: -O

#![crate_type = "lib"]

#[no_mangle]
pub fn test() -> usize {
    // CHECK-LABEL: @test(
    // CHECK: ret {{i64|i32}} 165
    let values = [23, 16, 54, 3, 60, 9];
    let mut acc = 0;
    for item in values {
        acc += item;
    }
    acc
}

I expected to see this happen: FileCheck pass

Instead, this happened:

issue-101082.rs:8:12: error: CHECK: expected string not found in input

As of rustc 1.83.0-nightly (52fd99839 2024-10-10), that LLVM IR is:

; Function Attrs: nofree norecurse nosync nounwind nonlazybind memory(inaccessiblemem: write) uwtable
define noundef i64 @test() unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  %iter = alloca [64 x i8], align 8
  call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %iter)
  store i64 23, ptr %iter, align 8
  %_3.sroa.0.sroa.4.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 8
  store i64 16, ptr %_3.sroa.0.sroa.4.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.5.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 16
  store i64 54, ptr %_3.sroa.0.sroa.5.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.6.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 24
  store i64 3, ptr %_3.sroa.0.sroa.6.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.7.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 32
  store i64 60, ptr %_3.sroa.0.sroa.7.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.8.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 40
  store i64 9, ptr %_3.sroa.0.sroa.8.0.iter.sroa_idx, align 8
  %unmaskedload = load <4 x i64>, ptr %iter, align 8, !alias.scope !3
  %0 = getelementptr inbounds i8, ptr %iter, i64 32
  %unmaskedload10 = load <4 x i64>, ptr %0, align 8, !alias.scope !3
  %1 = add <4 x i64> %unmaskedload10, %unmaskedload
  %2 = shufflevector <4 x i64> %1, <4 x i64> %unmaskedload, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
  %3 = tail call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %2)
  call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %iter)
  ret i64 %3
}

Reducing to x86-64-v2 does get the expected output:

; Function Attrs: nofree norecurse nosync nounwind nonlazybind memory(inaccessiblemem: write) uwtable
define noundef i64 @test() unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  ret i64 165
}

Version it worked on

It most recently worked on: Rust 1.79.0

Version with regression

rustc --version --verbose:

rustc 1.80.0 (051478957 2024-07-21)
binary: rustc
commit-hash: 051478957371ee0084a7c0913941d2a8c4757bb9
commit-date: 2024-07-21
host: x86_64-unknown-linux-gnu
release: 1.80.0
LLVM version: 18.1.7

Note that the original issue #101082 was fixed by an LLVM upgrade. That version didn't change between 1.79.0 and 1.80.0, but there were some additional cherry-picks: rust-lang/llvm-project@rustc-1.79.0...rustc-1.80.0

However, cargo-bisect-rustc narrowed down to something else.

Bisection

searched nightlies: from nightly-2024-04-28 to nightly-2024-10-11
regressed nightly: nightly-2024-05-26
searched commit range: 36153f1...1ba35e9
regressed commit: 48f0011 (#121571)

bisected with cargo-bisect-rustc v0.6.9

Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --script test.sh --start 1.79.0
#!/bin/sh
rustc --emit=llvm-ir -O -Ctarget-cpu=x86-64-v3 -o- issue-101082.rs | FileCheck issue-101082.rs

@rustbot modify labels: +A-codegen +regression-from-stable-to-stable -regression-untriaged

@cuviper cuviper added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Oct 11, 2024
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. A-codegen Area: Code generation regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-untriaged Untriaged performance or correctness regression. labels Oct 11, 2024
@saethlin saethlin added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Oct 11, 2024
@cuviper
Copy link
Member Author

cuviper commented Oct 11, 2024

Part of #121571 added UB checks on indexing, and AIUI these survive to LLVM IR due to core's #![rustc_preserve_ub_checks], though we expect it to still be optimized away. I'm guessing that x86-64-v3 enables some transformation that disrupts the overall optimization, which may be buggy or just unfortunate.

Some of that was undone in #126299, but the changes in index_range.rs remain. In a quick test reverting those that are definitely superfluous, I do get back to full optimization here. I'll prepare a PR for that, but there's still probably an LLVM improvement to be had too.

cuviper added a commit to cuviper/rust that referenced this issue Oct 11, 2024
`IndexRange::len` is justified as an overall invariant, and
`take_prefix` and `take_suffix` are justified by local branch
conditions. A few more UB-checked calls remain in cases that are only
supported locally by `debug_assert!`, which won't do anything in
distributed builds, so those UB checks may still be useful.

We generally expect core's `#![rustc_preserve_ub_checks]` to optimize
away in user's release builds, but the mere presence of that extra code
can sometimes inhibit optimization, as seen in rust-lang#131563.
@saethlin
Copy link
Member

saethlin commented Oct 11, 2024

and AIUI these survive to LLVM IR due to core's #![rustc_preserve_ub_checks]

They shouldn't survive. That attribute is supposed to get them past MIR optimizations, but when we lower to LLVM IR with debug assertions disabled there should be at most a zombie br bbX.

I added the A-LLVM label because I suspect that the desirable sequence of LLVM behavior here was relying on MIR inlining. It wouldn't be the first time, and impeding the MIR inliner is the primary thing that these assertions do when disabled.

This produces that bad IR:

RUSTFLAGS="-Ctarget-cpu=x86-64-v3 --emit=llvm-ir" cargo +nightly b --release -Zbuild-std --target=x86_64-unknown-linux-gnu

And this produces the good IR (the normal inline-mir-hint-threshold is 100):

RUSTFLAGS="-Zinline-mir-hint-threshold=1000 -Ctarget-cpu=x86-64-v3 --emit=llvm-ir" cargo +nightly b --release -Zbuild-std --target=x86_64-unknown-linux-gnu

cuviper added a commit to cuviper/rust that referenced this issue Oct 11, 2024
`IndexRange::len` is justified as an overall invariant, and
`take_prefix` and `take_suffix` are justified by local branch
conditions. A few more UB-checked calls remain in cases that are only
supported locally by `debug_assert!`, which won't do anything in
distributed builds, so those UB checks may still be useful.

We generally expect core's `#![rustc_preserve_ub_checks]` to optimize
away in user's release builds, but the mere presence of that extra code
can sometimes inhibit optimization, as seen in rust-lang#131563.
@cuviper
Copy link
Member Author

cuviper commented Oct 11, 2024

They shouldn't survive. That attribute is supposed to get them past MIR optimizations, but when we lower to LLVM IR with debug assertions disabled there should be at most a zombie br bbX.

Well, it looks like a lot more here with -Cno-prepopulate-passes: https://rust.godbolt.org/z/n8reE6nzG

But yes, I would still hope that LLVM could chew through it, since it does with other CPUs. AFAICS our IR does not change from the target-cpu, apart from the expected function attributes.

@saethlin
Copy link
Member

The IR size difference for <core::ops::index_range::IndexRange as core::slice::index::SliceIndex<[T]>>::get_unchecked_mut is due to the use of the unchecked math functions in both library/core/src/slice/index.rs and library/core/src/ops/index_range.rs.

The IR for that link looks a lot better in nightly, I wonder if that's #129283. #126299 is also helping even if you just go to 1.81 I think.

Also the fact that there's a UB check in that IR at all is a bug. I'm looking into it.

@saethlin
Copy link
Member

Filed #131578

@saethlin
Copy link
Member

I've locally added enough post-mono MIR optimizations to grind down the -Cno-prepopulate-passes LLVM IR for the IndexRange as SliceIndex>::get_unchecked to:

; <core::ops::index_range::IndexRange as core::slice::index::SliceIndex<[T]>>::get_unchecked_mut
; Function Attrs: inlinehint nonlazybind uwtable
define internal { ptr, i64 } @"_ZN104_$LT$core..ops..index_range..IndexRange$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$17get_unchecked_mut17hfdf5440029c89a21E"(i64 noundef %0, i64 noundef %1, ptr noundef %slice.0, i64 noundef %slice.1) unnamed_addr #0 {
start:
  %self = alloca [16 x i8], align 8
  store i64 %0, ptr %self, align 8
  %2 = getelementptr inbounds i8, ptr %self, i64 8
  store i64 %1, ptr %2, align 8
  %offset = load i64, ptr %self, align 8, !noundef !3
  %3 = getelementptr inbounds i8, ptr %self, i64 8
  %self1 = load i64, ptr %3, align 8, !noundef !3
  %len = sub nuw i64 %self1, %offset
  %ptr = getelementptr inbounds i64, ptr %slice.0, i64 %offset
  %4 = insertvalue { ptr, i64 } poison, ptr %ptr, 0
  %5 = insertvalue { ptr, i64 } %4, i64 %len, 1
  ret { ptr, i64 } %5
}

As far as I can tell, this is at least as good input to LLVM as we provided on 1.79, but we still have a missed optimization with -Ctarget-cpu=x86-64-v3.

I've also tried your branch. I cannot even find this function in the LLVM IR, because in your branch it always gets inlined in MIR, which is basically what I was speculating originally about LLVM relying on MIR inlining.

So I think there's a deeper problem here with LLVM on this target-cpu, and this change to the standard library is papering over the reproducer we have for it.

@nikic
Copy link
Contributor

nikic commented Oct 13, 2024

Yeah, this looks like a phase ordering problem to me. The code is vectorized, then unrolled, and then SROA doesn't fold away the alloca because of the masked loads, which are only converted to plain loads by a later InstCombine run.

Easiest way to fix it is probably to support masked loads in SROA.

Ideally we would fully unroll the loop though, but it looks like we can't determine the trip count at that point: https://llvm.godbolt.org/z/q1jzfsYeT

The issue is that IndexRange::next() generates a conditional IV increment.

@apiraino
Copy link
Contributor

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

@rustbot rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Oct 15, 2024
@cuviper
Copy link
Member Author

cuviper commented Oct 15, 2024

@nikic is there enough info for an LLVM issue on this?

bors added a commit to rust-lang-ci/rust that referenced this issue Oct 17, 2024
Avoid superfluous UB checks in `IndexRange`

`IndexRange::len` is justified as an overall invariant, and
`take_prefix` and `take_suffix` are justified by local branch
conditions. A few more UB-checked calls remain in cases that are only
supported locally by `debug_assert!`, which won't do anything in
distributed builds, so those UB checks may still be useful.

We generally expect core's `#![rustc_preserve_ub_checks]` to optimize
away in user's release builds, but the mere presence of that extra code
can sometimes inhibit optimization, as seen in rust-lang#131563.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another.
Projects
None yet
Development

No branches or pull requests

5 participants