Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in 1.84 for inline-able function in closure #137273

Open
findepi opened this issue Feb 19, 2025 · 8 comments
Open

Performance regression in 1.84 for inline-able function in closure #137273

findepi opened this issue Feb 19, 2025 · 8 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@findepi
Copy link

findepi commented Feb 19, 2025

Code

I tried this code:

#[macro_use]
extern crate criterion;

use criterion::Criterion;
use rand::Rng;

// #[inline(never)]
pub fn sum_all(v: &[i32]) -> i32 {
    v.iter().fold(0i32, |a, b| curried_add(a)(*b))
}
fn curried_add(a: i32) -> Box<dyn Fn(i32) -> i32> {
    Box::new(move |b| a + b)
}

fn criterion_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("bg");
    let vals = generate_array(100_000);
    group.bench_function("sum", |b| {
        b.iter(|| {
            let sum = sum_all(criterion::black_box(&vals));
            criterion::black_box(sum);
        })
    });
}

fn generate_array(len: usize) -> Vec<i32> {
    let mut rng = rand::rng();
    (0..len).map(|_| rng.random_range(0..100)).collect()
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

I expected to see this happen: ~26 µs reported by cargo bench

Instead, this happened: ~1.4 ms reported by cargo bench

Version it worked on

It most recently worked on: 1.83.0

$ cargo clean; caffeinate -sd cargo +1.84.1 bench -q -- --save-baseline 1.84.1 && caffeinate -sd cargo +1.83.0 bench -q -- --save-baseline 1.83.0
     Removed 1043 files, 228.6MiB total
Benchmarking bg/sum: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.7s, enable flat sampling, or reduce sample count to 60.
bg/sum                  time:   [1.3761 ms 1.4096 ms 1.4473 ms]
Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low severe
  16 (16.00%) high severe

bg/sum                  time:   [26.609 µs 26.640 µs 26.679 µs]
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

Version with regression

rustc --version --verbose:

rustc 1.84.1 (e71f9a9a9 2025-01-27)
binary: rustc
commit-hash: e71f9a9a98b0faf423844bf0ba7438f29dc27d58
commit-date: 2025-01-27
host: aarch64-apple-darwin
release: 1.84.1
LLVM version: 19.1.5

Backtrace

no compiler crash

#inline

If I uncomment #[inline(never)] line, the performance is back to normal under 1.84.

bisecting output

searched nightlies: from nightly-2024-10-13 to nightly-2024-11-22
regressed nightly: nightly-2024-11-04
searched commit range: b3f75cc...b8c8287
regressed commit: e3a918e

bisected with cargo-bisect-rustc v0.6.9

Host triple: aarch64-apple-darwin
Reproduce with:

cargo bisect-rustc --start 1.83.0 --end 1.84.1 --prompt -- bench -q

e3a918e comes from #132542

@findepi findepi added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Feb 19, 2025
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Feb 19, 2025
@apiraino
Copy link
Contributor

apiraino commented Feb 19, 2025

cc @RalfJung
(but I'm not sure about the bisection - I wonder how #132542 can impact inlining)

IIUC the code that regresses performance is:

// #[inline(never)]
pub fn sum_all(v: &[i32]) -> i32 {
    v.iter().fold(0i32, |a, b| curried_add(a)(*b))
}
fn curried_add(a: i32) -> Box<dyn Fn(i32) -> i32> {
    Box::new(move |b| a + b)
}

@RalfJung
Copy link
Member

RalfJung commented Feb 19, 2025

I guess it could pass some threshold in some heuristic somewhere? That can happen with any tiny change though. It is very odd though that adding inline(never) improves performance, or did I misunderstand the report?

Note that the inlining handling for const_panic! changed a bit after that PR so the diff you see there does not reflect the latest rustc any more.

Cc @saethlin @scottmcm

@RalfJung
Copy link
Member

That PR only changes very few functions that are not already inline(never): clamp, encode_utf16_raw, encode_utf8_raw. I doubt any of them is used in the example above. So I have no explanation how that PR could possibly affect this code.

@saethlin
Copy link
Member

It is very odd though that adding inline(never) improves performance, or did I misunderstand the report?

Yes, inline(never) is faster here. With asterisks. All inline attributes perform the same with RUSTFLAGS=-Ccodegen-units=1 or

[profile.bench]
codegen-units = 1

I don't think a bisection is informative here, all that PR did was probably perturb CGU partitioning to cause the bad partitioning.

This kind of behavior smells like IPSCCP https://llvm.org/docs/Passes.html#ipsccp-interprocedural-sparse-conditional-constant-propagation

@saethlin
Copy link
Member

Also, it looks like adding #[inline] to both curried_add and sum_all seems to produce the fast benchmark result. No idea if that's stable wrt other build config perturbations, though I'd expect it is.

@saethlin saethlin added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Feb 19, 2025
@jieyouxu jieyouxu added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 22, 2025
@apiraino
Copy link
Contributor

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

@rustbot rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Feb 24, 2025
@DianQK
Copy link
Member

DianQK commented Feb 26, 2025

It seems that criterion::bencher::Bencher<M>::iter is not inlining my_benchmark::curried_add::{{closure}}. Could this be related to the CGU partitioning? But both sum_all and curried_add are inlining my_benchmark::curried_add::{{closure}}.

@saethlin
Copy link
Member

saethlin commented Feb 26, 2025

Closures should always get InstantiationMode::LocalCopy, so even in a debug build they should get internal and a copy should be placed in every CGU that calls them directly.

So normally CGU partitioning doesn't block inlining of a closure. Perhaps the Box + dyn makes that not true?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants