Regression: `tests/codegen/issues/issue-101082.rs` fails with `-Ctarget-cpu=x86-64-v3` #131563

cuviper · 2024-10-11T18:44:31Z

Code

I tried this test with -Ctarget-cpu=x86-64-v3 (which we have on by default in the upcoming RHEL 10):

tests/codegen/issues/issue-101082.rs

//@ compile-flags: -O

#![crate_type = "lib"]

#[no_mangle]
pub fn test() -> usize {
    // CHECK-LABEL: @test(
    // CHECK: ret {{i64|i32}} 165
    let values = [23, 16, 54, 3, 60, 9];
    let mut acc = 0;
    for item in values {
        acc += item;
    }
    acc
}

I expected to see this happen: FileCheck pass

Instead, this happened:

issue-101082.rs:8:12: error: CHECK: expected string not found in input

As of rustc 1.83.0-nightly (52fd99839 2024-10-10), that LLVM IR is:

; Function Attrs: nofree norecurse nosync nounwind nonlazybind memory(inaccessiblemem: write) uwtable
define noundef i64 @test() unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  %iter = alloca [64 x i8], align 8
  call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %iter)
  store i64 23, ptr %iter, align 8
  %_3.sroa.0.sroa.4.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 8
  store i64 16, ptr %_3.sroa.0.sroa.4.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.5.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 16
  store i64 54, ptr %_3.sroa.0.sroa.5.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.6.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 24
  store i64 3, ptr %_3.sroa.0.sroa.6.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.7.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 32
  store i64 60, ptr %_3.sroa.0.sroa.7.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.8.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 40
  store i64 9, ptr %_3.sroa.0.sroa.8.0.iter.sroa_idx, align 8
  %unmaskedload = load <4 x i64>, ptr %iter, align 8, !alias.scope !3
  %0 = getelementptr inbounds i8, ptr %iter, i64 32
  %unmaskedload10 = load <4 x i64>, ptr %0, align 8, !alias.scope !3
  %1 = add <4 x i64> %unmaskedload10, %unmaskedload
  %2 = shufflevector <4 x i64> %1, <4 x i64> %unmaskedload, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
  %3 = tail call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %2)
  call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %iter)
  ret i64 %3
}

Reducing to x86-64-v2 does get the expected output:

; Function Attrs: nofree norecurse nosync nounwind nonlazybind memory(inaccessiblemem: write) uwtable
define noundef i64 @test() unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  ret i64 165
}

Version it worked on

It most recently worked on: Rust 1.79.0

Version with regression

rustc --version --verbose:

rustc 1.80.0 (051478957 2024-07-21)
binary: rustc
commit-hash: 051478957371ee0084a7c0913941d2a8c4757bb9
commit-date: 2024-07-21
host: x86_64-unknown-linux-gnu
release: 1.80.0
LLVM version: 18.1.7

Note that the original issue #101082 was fixed by an LLVM upgrade. That version didn't change between 1.79.0 and 1.80.0, but there were some additional cherry-picks: rust-lang/llvm-project@rustc-1.79.0...rustc-1.80.0

However, cargo-bisect-rustc narrowed down to something else.

Bisection

searched nightlies: from nightly-2024-04-28 to nightly-2024-10-11
regressed nightly: nightly-2024-05-26
searched commit range: 36153f1...1ba35e9
regressed commit: 48f0011 (#121571)

bisected with cargo-bisect-rustc v0.6.9

Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --script test.sh --start 1.79.0

#!/bin/sh
rustc --emit=llvm-ir -O -Ctarget-cpu=x86-64-v3 -o- issue-101082.rs | FileCheck issue-101082.rs

@rustbot modify labels: +A-codegen +regression-from-stable-to-stable -regression-untriaged

The text was updated successfully, but these errors were encountered:

cuviper · 2024-10-11T21:22:03Z

Part of #121571 added UB checks on indexing, and AIUI these survive to LLVM IR due to core's #![rustc_preserve_ub_checks], though we expect it to still be optimized away. I'm guessing that x86-64-v3 enables some transformation that disrupts the overall optimization, which may be buggy or just unfortunate.

Some of that was undone in #126299, but the changes in index_range.rs remain. In a quick test reverting those that are definitely superfluous, I do get back to full optimization here. I'll prepare a PR for that, but there's still probably an LLVM improvement to be had too.

`IndexRange::len` is justified as an overall invariant, and `take_prefix` and `take_suffix` are justified by local branch conditions. A few more UB-checked calls remain in cases that are only supported locally by `debug_assert!`, which won't do anything in distributed builds, so those UB checks may still be useful. We generally expect core's `#![rustc_preserve_ub_checks]` to optimize away in user's release builds, but the mere presence of that extra code can sometimes inhibit optimization, as seen in rust-lang#131563.

saethlin · 2024-10-11T23:18:50Z

and AIUI these survive to LLVM IR due to core's #![rustc_preserve_ub_checks]

They shouldn't survive. That attribute is supposed to get them past MIR optimizations, but when we lower to LLVM IR with debug assertions disabled there should be at most a zombie br bbX.

I added the A-LLVM label because I suspect that the desirable sequence of LLVM behavior here was relying on MIR inlining. It wouldn't be the first time, and impeding the MIR inliner is the primary thing that these assertions do when disabled.

This produces that bad IR:

RUSTFLAGS="-Ctarget-cpu=x86-64-v3 --emit=llvm-ir" cargo +nightly b --release -Zbuild-std --target=x86_64-unknown-linux-gnu

And this produces the good IR (the normal inline-mir-hint-threshold is 100):

RUSTFLAGS="-Zinline-mir-hint-threshold=1000 -Ctarget-cpu=x86-64-v3 --emit=llvm-ir" cargo +nightly b --release -Zbuild-std --target=x86_64-unknown-linux-gnu

`IndexRange::len` is justified as an overall invariant, and `take_prefix` and `take_suffix` are justified by local branch conditions. A few more UB-checked calls remain in cases that are only supported locally by `debug_assert!`, which won't do anything in distributed builds, so those UB checks may still be useful. We generally expect core's `#![rustc_preserve_ub_checks]` to optimize away in user's release builds, but the mere presence of that extra code can sometimes inhibit optimization, as seen in rust-lang#131563.

cuviper · 2024-10-11T23:43:19Z

They shouldn't survive. That attribute is supposed to get them past MIR optimizations, but when we lower to LLVM IR with debug assertions disabled there should be at most a zombie br bbX.

Well, it looks like a lot more here with -Cno-prepopulate-passes: https://rust.godbolt.org/z/n8reE6nzG

But yes, I would still hope that LLVM could chew through it, since it does with other CPUs. AFAICS our IR does not change from the target-cpu, apart from the expected function attributes.

saethlin · 2024-10-12T00:53:16Z

The IR size difference for <core::ops::index_range::IndexRange as core::slice::index::SliceIndex<[T]>>::get_unchecked_mut is due to the use of the unchecked math functions in both library/core/src/slice/index.rs and library/core/src/ops/index_range.rs.

The IR for that link looks a lot better in nightly, I wonder if that's #129283. #126299 is also helping even if you just go to 1.81 I think.

Also the fact that there's a UB check in that IR at all is a bug. I'm looking into it.

saethlin · 2024-10-12T01:08:28Z

Filed #131578

saethlin · 2024-10-12T20:38:24Z

I've locally added enough post-mono MIR optimizations to grind down the -Cno-prepopulate-passes LLVM IR for the IndexRange as SliceIndex>::get_unchecked to:

; <core::ops::index_range::IndexRange as core::slice::index::SliceIndex<[T]>>::get_unchecked_mut
; Function Attrs: inlinehint nonlazybind uwtable
define internal { ptr, i64 } @"_ZN104_$LT$core..ops..index_range..IndexRange$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$17get_unchecked_mut17hfdf5440029c89a21E"(i64 noundef %0, i64 noundef %1, ptr noundef %slice.0, i64 noundef %slice.1) unnamed_addr #0 {
start:
  %self = alloca [16 x i8], align 8
  store i64 %0, ptr %self, align 8
  %2 = getelementptr inbounds i8, ptr %self, i64 8
  store i64 %1, ptr %2, align 8
  %offset = load i64, ptr %self, align 8, !noundef !3
  %3 = getelementptr inbounds i8, ptr %self, i64 8
  %self1 = load i64, ptr %3, align 8, !noundef !3
  %len = sub nuw i64 %self1, %offset
  %ptr = getelementptr inbounds i64, ptr %slice.0, i64 %offset
  %4 = insertvalue { ptr, i64 } poison, ptr %ptr, 0
  %5 = insertvalue { ptr, i64 } %4, i64 %len, 1
  ret { ptr, i64 } %5
}

As far as I can tell, this is at least as good input to LLVM as we provided on 1.79, but we still have a missed optimization with -Ctarget-cpu=x86-64-v3.

I've also tried your branch. I cannot even find this function in the LLVM IR, because in your branch it always gets inlined in MIR, which is basically what I was speculating originally about LLVM relying on MIR inlining.

So I think there's a deeper problem here with LLVM on this target-cpu, and this change to the standard library is papering over the reproducer we have for it.

nikic · 2024-10-13T14:45:50Z

Yeah, this looks like a phase ordering problem to me. The code is vectorized, then unrolled, and then SROA doesn't fold away the alloca because of the masked loads, which are only converted to plain loads by a later InstCombine run.

Easiest way to fix it is probably to support masked loads in SROA.

Ideally we would fully unroll the loop though, but it looks like we can't determine the trip count at that point: https://llvm.godbolt.org/z/q1jzfsYeT

The issue is that IndexRange::next() generates a conditional IV increment.

apiraino · 2024-10-15T09:43:46Z

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

cuviper · 2024-10-15T18:06:28Z

@nikic is there enough info for an LLVM issue on this?

Avoid superfluous UB checks in `IndexRange` `IndexRange::len` is justified as an overall invariant, and `take_prefix` and `take_suffix` are justified by local branch conditions. A few more UB-checked calls remain in cases that are only supported locally by `debug_assert!`, which won't do anything in distributed builds, so those UB checks may still be useful. We generally expect core's `#![rustc_preserve_ub_checks]` to optimize away in user's release builds, but the mere presence of that extra code can sometimes inhibit optimization, as seen in rust-lang#131563.

cuviper added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Oct 11, 2024

saethlin added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Oct 11, 2024

cuviper mentioned this issue Oct 11, 2024

Avoid superfluous UB checks in IndexRange #131572

Merged

rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: `tests/codegen/issues/issue-101082.rs` fails with `-Ctarget-cpu=x86-64-v3` #131563

Regression: `tests/codegen/issues/issue-101082.rs` fails with `-Ctarget-cpu=x86-64-v3` #131563

cuviper commented Oct 11, 2024

cuviper commented Oct 11, 2024

saethlin commented Oct 11, 2024 •

edited

Loading

cuviper commented Oct 11, 2024

saethlin commented Oct 12, 2024

saethlin commented Oct 12, 2024

saethlin commented Oct 12, 2024

nikic commented Oct 13, 2024

apiraino commented Oct 15, 2024

cuviper commented Oct 15, 2024

Regression: tests/codegen/issues/issue-101082.rs fails with -Ctarget-cpu=x86-64-v3 #131563

Regression: tests/codegen/issues/issue-101082.rs fails with -Ctarget-cpu=x86-64-v3 #131563

Comments

cuviper commented Oct 11, 2024

Code

Version it worked on

Version with regression

Bisection

cuviper commented Oct 11, 2024

saethlin commented Oct 11, 2024 • edited Loading

cuviper commented Oct 11, 2024

saethlin commented Oct 12, 2024

saethlin commented Oct 12, 2024

saethlin commented Oct 12, 2024

nikic commented Oct 13, 2024

apiraino commented Oct 15, 2024

cuviper commented Oct 15, 2024

Regression: `tests/codegen/issues/issue-101082.rs` fails with `-Ctarget-cpu=x86-64-v3` #131563

Regression: `tests/codegen/issues/issue-101082.rs` fails with `-Ctarget-cpu=x86-64-v3` #131563

saethlin commented Oct 11, 2024 •

edited

Loading