Performance regression in 1.84 for inline-able function in closure #137273

findepi · 2025-02-19T12:36:10Z

Code

I tried this code:

#[macro_use]
extern crate criterion;

use criterion::Criterion;
use rand::Rng;

// #[inline(never)]
pub fn sum_all(v: &[i32]) -> i32 {
    v.iter().fold(0i32, |a, b| curried_add(a)(*b))
}
fn curried_add(a: i32) -> Box<dyn Fn(i32) -> i32> {
    Box::new(move |b| a + b)
}

fn criterion_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("bg");
    let vals = generate_array(100_000);
    group.bench_function("sum", |b| {
        b.iter(|| {
            let sum = sum_all(criterion::black_box(&vals));
            criterion::black_box(sum);
        })
    });
}

fn generate_array(len: usize) -> Vec<i32> {
    let mut rng = rand::rng();
    (0..len).map(|_| rng.random_range(0..100)).collect()
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

I expected to see this happen: ~26 µs reported by cargo bench

Instead, this happened: ~1.4 ms reported by cargo bench

Version it worked on

It most recently worked on: 1.83.0

$ cargo clean; caffeinate -sd cargo +1.84.1 bench -q -- --save-baseline 1.84.1 && caffeinate -sd cargo +1.83.0 bench -q -- --save-baseline 1.83.0
     Removed 1043 files, 228.6MiB total
Benchmarking bg/sum: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.7s, enable flat sampling, or reduce sample count to 60.
bg/sum                  time:   [1.3761 ms 1.4096 ms 1.4473 ms]
Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low severe
  16 (16.00%) high severe

bg/sum                  time:   [26.609 µs 26.640 µs 26.679 µs]
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

Version with regression

rustc --version --verbose:

rustc 1.84.1 (e71f9a9a9 2025-01-27)
binary: rustc
commit-hash: e71f9a9a98b0faf423844bf0ba7438f29dc27d58
commit-date: 2025-01-27
host: aarch64-apple-darwin
release: 1.84.1
LLVM version: 19.1.5

Backtrace

no compiler crash

`#inline`

If I uncomment #[inline(never)] line, the performance is back to normal under 1.84.

bisecting output

searched nightlies: from nightly-2024-10-13 to nightly-2024-11-22
regressed nightly: nightly-2024-11-04
searched commit range: b3f75cc...b8c8287
regressed commit: e3a918e

bisected with cargo-bisect-rustc v0.6.9

Host triple: aarch64-apple-darwin
Reproduce with:

cargo bisect-rustc --start 1.83.0 --end 1.84.1 --prompt -- bench -q

e3a918e comes from #132542

The text was updated successfully, but these errors were encountered:

apiraino · 2025-02-19T15:56:11Z

cc @RalfJung
(but I'm not sure about the bisection - I wonder how #132542 can impact inlining)

IIUC the code that regresses performance is:

// #[inline(never)]
pub fn sum_all(v: &[i32]) -> i32 {
    v.iter().fold(0i32, |a, b| curried_add(a)(*b))
}
fn curried_add(a: i32) -> Box<dyn Fn(i32) -> i32> {
    Box::new(move |b| a + b)
}

RalfJung · 2025-02-19T16:08:05Z

I guess it could pass some threshold in some heuristic somewhere? That can happen with any tiny change though. It is very odd though that adding inline(never) improves performance, or did I misunderstand the report?

Note that the inlining handling for const_panic! changed a bit after that PR so the diff you see there does not reflect the latest rustc any more.

Cc @saethlin @scottmcm

RalfJung · 2025-02-19T16:11:06Z

That PR only changes very few functions that are not already inline(never): clamp, encode_utf16_raw, encode_utf8_raw. I doubt any of them is used in the example above. So I have no explanation how that PR could possibly affect this code.

saethlin · 2025-02-19T16:33:18Z

It is very odd though that adding inline(never) improves performance, or did I misunderstand the report?

Yes, inline(never) is faster here. With asterisks. All inline attributes perform the same with RUSTFLAGS=-Ccodegen-units=1 or

[profile.bench]
codegen-units = 1

I don't think a bisection is informative here, all that PR did was probably perturb CGU partitioning to cause the bad partitioning.

This kind of behavior smells like IPSCCP https://llvm.org/docs/Passes.html#ipsccp-interprocedural-sparse-conditional-constant-propagation

saethlin · 2025-02-19T16:36:54Z

Also, it looks like adding #[inline] to both curried_add and sum_all seems to produce the fast benchmark result. No idea if that's stable wrt other build config perturbations, though I'd expect it is.

apiraino · 2025-02-24T15:18:45Z

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

DianQK · 2025-02-26T02:59:00Z

It seems that criterion::bencher::Bencher<M>::iter is not inlining my_benchmark::curried_add::{{closure}}. Could this be related to the CGU partitioning? But both sum_all and curried_add are inlining my_benchmark::curried_add::{{closure}}.

saethlin · 2025-02-26T03:59:42Z

Closures should always get InstantiationMode::LocalCopy, so even in a debug build they should get internal and a copy should be placed in every CGU that calls them directly.

So normally CGU partitioning doesn't block inlining of a closure. Perhaps the Box + dyn makes that not true?

findepi added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Feb 19, 2025

rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Feb 19, 2025

jieyouxu added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 22, 2025

rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in 1.84 for inline-able function in closure #137273

Performance regression in 1.84 for inline-able function in closure #137273

findepi commented Feb 19, 2025 •

edited

Loading

apiraino commented Feb 19, 2025 •

edited

Loading

RalfJung commented Feb 19, 2025 •

edited

Loading

RalfJung commented Feb 19, 2025

saethlin commented Feb 19, 2025

saethlin commented Feb 19, 2025

apiraino commented Feb 24, 2025

DianQK commented Feb 26, 2025

saethlin commented Feb 26, 2025 •

edited

Loading

Performance regression in 1.84 for inline-able function in closure #137273

Performance regression in 1.84 for inline-able function in closure #137273

Comments

findepi commented Feb 19, 2025 • edited Loading

Code

Version it worked on

Version with regression

Backtrace

#inline

bisecting output

apiraino commented Feb 19, 2025 • edited Loading

RalfJung commented Feb 19, 2025 • edited Loading

RalfJung commented Feb 19, 2025

saethlin commented Feb 19, 2025

saethlin commented Feb 19, 2025

apiraino commented Feb 24, 2025

DianQK commented Feb 26, 2025

saethlin commented Feb 26, 2025 • edited Loading

findepi commented Feb 19, 2025 •

edited

Loading

`#inline`

apiraino commented Feb 19, 2025 •

edited

Loading

RalfJung commented Feb 19, 2025 •

edited

Loading

saethlin commented Feb 26, 2025 •

edited

Loading