[perf experiment] Ignore inline(always) in unoptimized builds #121417

saethlin · 2024-02-21T22:44:16Z

Yes I know we have a codegen test for this. But based on this perf run I'm concerned this is having unexpected perf implications so I want to measure what they are: #121369 (comment)

r? @ghost

saethlin · 2024-02-21T22:44:26Z

@bors try @rust-timer queue

bors · 2024-02-21T22:45:35Z

⌛ Trying commit 7a29818 with merge 393ef12...

[perf experiment] Ignore inline(always) in unoptimized builds Yes I know we have a codegen test for this. But based on this perf run I'm concerned this is having unexpected perf implications so I want to measure what they are: rust-lang#121369 (comment) r? `@ghost`

rust-log-analyzer · 2024-02-21T23:22:03Z

The job x86_64-gnu-llvm-16 failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

GITHUB_ENV=/home/runner/work/_temp/_runner_file_commands/set_env_f7399dcc-57d6-406b-bdc5-3a3508673bf1
GITHUB_EVENT_NAME=pull_request
GITHUB_EVENT_PATH=/home/runner/work/_temp/_github_workflow/event.json
GITHUB_GRAPHQL_URL=https://api.github.com/graphql
GITHUB_HEAD_REF=no-opt-no-inline
GITHUB_JOB=pr
GITHUB_PATH=/home/runner/work/_temp/_runner_file_commands/add_path_f7399dcc-57d6-406b-bdc5-3a3508673bf1
GITHUB_REF=refs/pull/121417/merge
GITHUB_REF_NAME=121417/merge
GITHUB_REF_PROTECTED=false
---
#12 writing image sha256:943f04602bd649d1b31cb5133d3394653cdbb758d5a05d5021c19524730c733e done
#12 naming to docker.io/library/rust-ci done
#12 DONE 9.8s
##[endgroup]
Setting extra environment values for docker:  --env ENABLE_GCC_CODEGEN=1 --env GCC_EXEC_PREFIX=/usr/lib/gcc/
[CI_JOB_NAME=x86_64-gnu-llvm-16]
##[group]Clock drift check
  local time: Wed Feb 21 22:46:43 UTC 2024
  network time: Wed, 21 Feb 2024 22:46:43 GMT
  network time: Wed, 21 Feb 2024 22:46:43 GMT
##[endgroup]
sccache: Starting the server...
##[group]Configure the build
configure: processing command line
configure: 
configure: build.configure-args := ['--build=x86_64-unknown-linux-gnu', '--llvm-root=/usr/lib/llvm-16', '--enable-llvm-link-shared', '--set', 'rust.thin-lto-import-instr-limit=10', '--set', 'change-id=99999999', '--enable-verbose-configure', '--enable-sccache', '--disable-manage-submodules', '--enable-locked-deps', '--enable-cargo-native-static', '--set', 'rust.codegen-units-std=1', '--set', 'dist.compression-profile=balanced', '--dist-compression-formats=xz', '--set', 'build.optimized-compiler-builtins', '--disable-dist-src', '--release-channel=nightly', '--enable-debug-assertions', '--enable-overflow-checks', '--enable-llvm-assertions', '--set', 'rust.verify-llvm-ir', '--set', 'rust.codegen-backends=llvm,cranelift,gcc', '--set', 'llvm.static-libstdcpp', '--enable-new-symbol-mangling']
configure: target.x86_64-unknown-linux-gnu.llvm-config := /usr/lib/llvm-16/bin/llvm-config
configure: llvm.link-shared     := True
configure: rust.thin-lto-import-instr-limit := 10
configure: change-id            := 99999999
---
##[endgroup]
Testing GCC stage1 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
   Compiling y v0.1.0 (/checkout/compiler/rustc_codegen_gcc/build_system)
    Finished release [optimized] target(s) in 1.26s
     Running `/checkout/obj/build/x86_64-unknown-linux-gnu/stage1-codegen/x86_64-unknown-linux-gnu/release/y test --use-system-gcc --use-backend gcc --out-dir /checkout/obj/build/x86_64-unknown-linux-gnu/stage1-tools/cg_gcc --release --no-default-features --mini-tests --std-tests`
Using system GCC
Using system GCC
[BUILD] example
[AOT] mini_core_hello_world
/checkout/obj/build/x86_64-unknown-linux-gnu/stage1-tools/cg_gcc/mini_core_hello_world
abc
---
---- [run-make] tests/run-make/inline-always-many-cgu stdout ----

error: make failed
status: exit status: 2
command: cd "/checkout/tests/run-make/inline-always-many-cgu" && env -u CARGO_MAKEFLAGS -u MAKEFLAGS -u MFLAGS -u RUSTFLAGS AR="ar" CC="cc -ffunction-sections -fdata-sections -fPIC -m64" CXX="c++ -ffunction-sections -fdata-sections -fPIC -m64" HOST_RPATH_DIR="/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib" LD_LIB_PATH_ENVVAR="LD_LIBRARY_PATH" LLVM_BIN_DIR="/usr/lib/llvm-16/bin" LLVM_COMPONENTS="aarch64 aarch64asmparser aarch64codegen aarch64desc aarch64disassembler aarch64info aarch64utils aggressiveinstcombine all all-targets amdgpu amdgpuasmparser amdgpucodegen amdgpudesc amdgpudisassembler amdgpuinfo amdgputargetmca amdgpuutils analysis arm armasmparser armcodegen armdesc armdisassembler arminfo armutils asmparser asmprinter avr avrasmparser avrcodegen avrdesc avrdisassembler avrinfo binaryformat bitreader bitstreamreader bitwriter bpf bpfasmparser bpfcodegen bpfdesc bpfdisassembler bpfinfo cfguard codegen core coroutines coverage debuginfocodeview debuginfodwarf debuginfogsym debuginfologicalview debuginfomsf debuginfopdb demangle dlltooldriver dwarflinker dwarflinkerparallel dwp engine executionengine extensions filecheck frontendhlsl frontendopenacc frontendopenmp fuzzercli fuzzmutate globalisel hexagon hexagonasmparser hexagoncodegen hexagondesc hexagondisassembler hexagoninfo instcombine instrumentation interfacestub interpreter ipo irprinter irreader jitlink lanai lanaiasmparser lanaicodegen lanaidesc lanaidisassembler lanaiinfo libdriver lineeditor linker loongarch loongarchasmparser loongarchcodegen loongarchdesc loongarchdisassembler loongarchinfo lto m68k m68kasmparser m68kcodegen m68kdesc m68kdisassembler m68kinfo mc mca mcdisassembler mcjit mcparser mips mipsasmparser mipscodegen mipsdesc mipsdisassembler mipsinfo mirparser msp430 msp430asmparser msp430codegen msp430desc msp430disassembler msp430info native nativecodegen nvptx nvptxcodegen nvptxdesc nvptxinfo objcarcopts objcopy object objectyaml option orcjit orcshared orctargetprocess passes perfjitevents powerpc powerpcasmparser powerpccodegen powerpcdesc powerpcdisassembler powerpcinfo profiledata remarks riscv riscvasmparser riscvcodegen riscvdesc riscvdisassembler riscvinfo riscvtargetmca runtimedyld scalaropts selectiondag sparc sparcasmparser sparccodegen sparcdesc sparcdisassembler sparcinfo support symbolize systemz systemzasmparser systemzcodegen systemzdesc systemzdisassembler systemzinfo tablegen target targetparser textapi transformutils ve veasmparser vecodegen vectorize vedesc vedisassembler veinfo webassembly webassemblyasmparser webassemblycodegen webassemblydesc webassemblydisassembler webassemblyinfo webassemblyutils windowsdriver windowsmanifest x86 x86asmparser x86codegen x86desc x86disassembler x86info x86targetmca xcore xcorecodegen xcoredesc xcoredisassembler xcoreinfo xray" LLVM_FILECHECK="/usr/lib/llvm-16/bin/FileCheck" NODE="/usr/bin/node" PYTHON="/usr/bin/python3" RUSTC="/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" RUSTDOC="/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustdoc" RUST_BUILD_STAGE="stage2-x86_64-unknown-linux-gnu" S="/checkout" TARGET="x86_64-unknown-linux-gnu" TARGET_RPATH_DIR="/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/lib" TMPDIR="/checkout/obj/build/x86_64-unknown-linux-gnu/test/run-make/inline-always-many-cgu/inline-always-many-cgu" "make"
--- stdout -------------------------------
LD_LIBRARY_PATH="/checkout/obj/build/x86_64-unknown-linux-gnu/test/run-make/inline-always-many-cgu/inline-always-many-cgu:/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib:/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-bootstrap-tools/x86_64-unknown-linux-gnu/release/deps:/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/lib" '/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc' --out-dir /checkout/obj/build/x86_64-unknown-linux-gnu/test/run-make/inline-always-many-cgu/inline-always-many-cgu -L /checkout/obj/build/x86_64-unknown-linux-gnu/test/run-make/inline-always-many-cgu/inline-always-many-cgu  -Ainternal_features foo.rs --emit llvm-ir -C codegen-units=2
if cat /checkout/obj/build/x86_64-unknown-linux-gnu/test/run-make/inline-always-many-cgu/inline-always-many-cgu/*.ll | "/checkout/src/etc/cat-and-grep.sh" -e '\bcall\b'; then \
 echo "found call instruction when one wasn't expected"; \
Build completed unsuccessfully in 0:34:28
fi
fi
[[[ begin stdout ]]]
; ModuleID = 'foo.5be5606e1f6aa79b-cgu.0'
source_filename = "foo.5be5606e1f6aa79b-cgu.0"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; foo::a::foo
; Function Attrs: inlinehint nonlazybind uwtable
define internal void @_ZN3foo1a3foo17h9ed28b0d5896c05dE() unnamed_addr #0 {
  ret void
}


; Function Attrs: nonlazybind uwtable
define void @bar() unnamed_addr #1 {
start:
; call foo::a::foo
  call void @_ZN3foo1a3foo17h9ed28b0d5896c05dE()
}


attributes #0 = { inlinehint nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #1 = { nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 8, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}
!2 = !{!"rustc version 1.78.0-nightly (e244ff2c7 2024-02-21)"}
; ModuleID = 'foo.5be5606e1f6aa79b-cgu.1'
source_filename = "foo.5be5606e1f6aa79b-cgu.1"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; foo::a::bar
; Function Attrs: nonlazybind uwtable
define void @_ZN3foo1a3bar17h11af45a6c59c3fd7E() unnamed_addr #0 {
  ret void
}


attributes #0 = { nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 8, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}
!2 = !{!"rustc version 1.78.0-nightly (e244ff2c7 2024-02-21)"}

[[[ end stdout ]]]
found call instruction when one wasn't expected
--- stderr -------------------------------
--- stderr -------------------------------
warning: ignoring emit path because multiple .ll files were produced
warning: 1 warning emitted

make: *** [Makefile:5: all] Error 1
------------------------------------------

bors · 2024-02-22T00:13:56Z

☀️ Try build successful - checks-actions
Build commit: 393ef12 (393ef12c970fbc7f294cd96c35cb76f9591bc1d6)

rust-timer · 2024-02-22T02:23:51Z

Finished benchmarking commit (393ef12): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.7%	[0.2%, 1.0%]	5
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-8.9%	[-45.3%, -0.5%]	19
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-6.9%	[-45.3%, 1.0%]	24

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.0%	[1.0%, 1.0%]	1
Regressions ❌ (secondary)	3.7%	[3.7%, 3.7%]	1
Improvements ✅ (primary)	-11.4%	[-25.0%, -3.4%]	11
Improvements ✅ (secondary)	-3.6%	[-3.6%, -3.6%]	1
All ❌✅ (primary)	-10.4%	[-25.0%, 1.0%]	12

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.5%	[2.2%, 2.7%]	2
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-8.8%	[-43.8%, -1.1%]	19
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-7.7%	[-43.8%, 2.7%]	21

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.5%	[0.0%, 1.7%]	40
Regressions ❌ (secondary)	0.8%	[0.0%, 2.2%]	18
Improvements ✅ (primary)	-3.4%	[-9.3%, -0.0%]	29
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-1.2%	[-9.3%, 1.7%]	69

Bootstrap: 649.698s -> 651.565s (0.29%)
Artifact size: 310.95 MiB -> 310.97 MiB (0.01%)

saethlin · 2024-02-22T03:38:16Z

The comment that motivated the codegen test #45201 (comment) suggests that things will break if we stop doing this. Considering how harmful inlining at -Copt-level=0 is, I'm going to challenge that statement.

@craterbot run mode=build-and-test

craterbot · 2024-02-22T03:38:49Z

👌 Experiment pr-121417 created and queued.
🤖 Automatically detected try build 393ef12
🔍 You can check out the queue and this experiment's details.

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

craterbot · 2024-03-01T17:14:36Z

🚧 Experiment pr-121417 is now running

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

craterbot · 2024-03-03T17:55:24Z

🎉 Experiment pr-121417 is completed!
📊 97 regressed and 113 fixed (419702 total)
📰 Open the full report.

⚠️ If you notice any spurious failure please add them to the blacklist!
ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

saethlin · 2024-03-03T18:33:25Z

@craterbot run mode=build-and-test p=1 crates=https://crater-reports.s3.amazonaws.com/pr-121417/retry-regressed-list.txt

craterbot · 2024-03-03T18:33:31Z

👌 Experiment pr-121417-1 created and queued.
🤖 Automatically detected try build 393ef12
🔍 You can check out the queue and this experiment's details.

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

craterbot · 2024-03-05T14:07:04Z

🚧 Experiment pr-121417-1 is now running

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

craterbot · 2024-03-05T16:12:49Z

🎉 Experiment pr-121417-1 is completed!
📊 23 regressed and 12 fixed (97 total)
📰 Open the full report.

⚠️ If you notice any spurious failure please add them to the blacklist!
ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

Add inline(usually) r? `@ghost` I'm looking into what kind of things could recover the perf improvement detected in rust-lang#121417 (comment)

RalfJung · 2024-09-22T17:54:49Z

The comment that motivated the codegen test #45201 (comment) suggests that things will break if we stop doing this. Considering how harmful inlining at -Copt-level=0 is, I'm going to challenge that statement.

I wonder if that ABI concern mentioned there is the same as #116558... but that would apply only to extern "C". Maybe that comment predates extern "Rust" passing vectors indirectly.

Anyway, what Alex describes there sounds like a critical codegen bug to me, so if this still happens in today's compiler we should definitely track that.
@alexcrichton if you still remember what that issue was about many years ago, could you give a self-contained example demonstrating the problem?

bjorn3 · 2024-09-22T17:59:52Z

The Rust abi always passes vectors by-ref precisely to avoid ABI issues with differing target features:

rust/compiler/rustc_ty_utils/src/abi.rs

Lines 732 to 756 in 0af7f0f

    
           // This is a fun case! The gist of what this is doing is 
        
           // that we want callers and callees to always agree on the 
        
           // ABI of how they pass SIMD arguments. If we were to *not* 
        
           // make these arguments indirect then they'd be immediates 
        
           // in LLVM, which means that they'd used whatever the 
        
           // appropriate ABI is for the callee and the caller. That 
        
           // means, for example, if the caller doesn't have AVX 
        
           // enabled but the callee does, then passing an AVX argument 
        
           // across this boundary would cause corrupt data to show up. 
        
           // 
        
           // This problem is fixed by unconditionally passing SIMD 
        
           // arguments through memory between callers and callees 
        
           // which should get them all to agree on ABI regardless of 
        
           // target feature sets. Some more information about this 
        
           // issue can be found in #44367. 
        
           // 
        
           // Note that the intrinsic ABI is exempt here as 
        
           // that's how we connect up to LLVM and it's unstable 
        
           // anyway, we control all calls to it in libstd. 
        
           Abi::Vector { .. } 
        
               if abi != SpecAbi::RustIntrinsic && tcx.sess.target.simd_types_indirect => 
        
           { 
        
               arg.make_indirect(); 
        
               return; 
        
           }

RalfJung · 2024-09-22T18:15:51Z

Yes, exactly. That is why I wonder which ABI issues Alex was referring to in that long ago post.

bjorn3 · 2024-09-22T18:22:25Z

Turns out this ABI issue was only fixed in #47743, which was opened a couple of months after the comment in question.

RalfJung · 2024-09-22T18:30:33Z

Okay, in that case we can likely consider this concern to be resolved.

alexcrichton · 2024-09-22T20:11:41Z

IIRC, yeah, the ABI fixes came after that comment, and only extern "C" is affected as you've linked, so I don't think I'm trapping any hidden knowledge in my head (yay!)

Add inline(usually) r? `@ghost` I'm looking into what kind of things could recover the perf improvement detected in rust-lang#121417 (comment)

Add inline(usually) I'm looking into what kind of things could recover the perf improvement detected in rust-lang#121417 (comment). I think it's worth spending quite a bit of effort to figure out how to capture a 45% incr-patched improvement. As far as I can tell, the root cause of the problem is that we have taken very deliberate steps in the compiler to ensure that `#[inline(always)]` causes inlining where possible, even when all optimizations are disabled. Some of the reasons that was done are now outdated or were misguided, but some some users still have a legitimate use for the current behavior, `@bjorn3` says: > Unlike other targets the mere presence of a simd instruction is not allowed if the wasm runtime doesn't support simd. Other targets merely require it to never be executed at runtime. I'm quite sure that the majority of users applied this attribute believing it does not cause inlining in unoptimized builds, or didn't appreciate the build time regressions that implies and would prefer it didn't if they knew. (if that's you, put a heart on this or say something elsewhere, don't reply on this PR) I am going to _try_ to use the existing benchmark suite to evaluate a number of different approaches and take notes here, and hopefully I can collect enough data to shape any conversation about what we can do to help users. The core of this PR is `InlineAttr::Usually` (name doesn't matter) which ensures that when optimizations are enabled that the function is inlined (usual exceptions like recursion apply). I think most users believe this is what `#[inline(always)]` does. rust-lang#130685 (comment) Replaced `#[inline(always)]` with `#[inline(usually)]` in the standard library, and did not recover the same 45% incr-patched improvement in regex. It's a tidy net positive though, and I suspect that perf improvement would normally be big enough to motivate merging a change. I think that means the standard library's use of `#[inline(always)]` is imposing marginal compile time overhead on the ecosystem, but the bigger opportunities are probably in third-party crates. rust-lang#130679 (comment) Treats `#[inline(always)]` as `#[inline(usually)]` literally everywhere; this gets the desired incr-patched improvement but suffers quite a few check and doc regressions. I think that means that `alwaysinline` is more powerful than `function-inline-cost=0` in LLVM. rust-lang#130679 (comment) Treats `#[inline(always)]` as `#[inline(usually)]` when `-Copt-level=0`, which looks basically the same as rust-lang#121417 (comment) (omit `alwaysinline` when doing `-Copt-level=0` codegen). TODO: Try function-inline-cost = -1000, perhaps penalties can add up and snuff inline(usually)?

Add inline(usually) I'm looking into what kind of things could recover the perf improvement detected in rust-lang#121417 (comment). I think it's worth spending quite a bit of effort to figure out how to capture a 45% incr-patched improvement. As far as I can tell, the root cause of the problem is that we have taken very deliberate steps in the compiler to ensure that `#[inline(always)]` causes inlining where possible, even when all optimizations are disabled. Some of the reasons that was done are now outdated or were misguided. I think the only remaining use case is where the inlined body even without optimizations is cheaper to codegen or call, for example SIMD intrinsics may require a lot of code to put their arguments on the stack, which is slow to compile and run. I'm quite sure that the majority of users applied this attribute believing it does not cause inlining in unoptimized builds, or didn't appreciate the build time regressions that implies and would prefer it didn't if they knew. (if that's you, put a heart on this or say something elsewhere, don't reply on this PR) I am going to _try_ to use the existing benchmark suite to evaluate a number of different approaches and take notes here, and hopefully I can collect enough data to shape any conversation about what we can do to help users. The core of this PR is `InlineAttr::Usually` (name doesn't matter) which ensures that when optimizations are enabled that the function is inlined (usual exceptions like recursion apply). I think most users believe this is what `#[inline(always)]` does. rust-lang#130685 (comment) Replaced `#[inline(always)]` with `#[inline(usually)]` in the standard library, and did not recover the same 45% incr-patched improvement in regex. It's a tidy net positive though, and I suspect that perf improvement would normally be big enough to motivate merging a change. I think that means the standard library's use of `#[inline(always)]` is imposing marginal compile time overhead on the ecosystem, but the bigger opportunities are probably in third-party crates. rust-lang#130679 (comment) Treats `#[inline(always)]` as `#[inline(usually)]` literally everywhere; this gets the desired incr-patched improvement but suffers quite a few check and doc regressions. I think that means that `alwaysinline` is more powerful than `function-inline-cost=0` in LLVM. rust-lang#130679 (comment) Treats `#[inline(always)]` as `#[inline(usually)]` when `-Copt-level=0`, which looks basically the same as rust-lang#121417 (comment) (omit `alwaysinline` when doing `-Copt-level=0` codegen). rust-lang#130679 (comment) replaces `alwaysinline` with a very negative inline cost, and it still has check and doc regressions. More investigation required on what the different inlining decision is. rust-lang#130679 (comment) is a likely explanation of this, with some interesting implications; adding `inline(always)` to a function that was going to be inlined anyway can change change optimizations (usually it seems to improve things?). TODO: stm32f4 and to a lesser extent bitmaps seem to compile slower and to larger binaries when we treat `inline(always)` as `inline(usually)`. Is that because of this? https://github.com/rust-lang/rust/blob/9e394f551c050ff03c6fc57f190e0761cf0be6e8/compiler/rustc_middle/src/mir/mono.rs#L141 If it's not, what happens if we infer `alwaysinline` for extremely small functions like most of those in stm32f4?

Add inline(usually) I'm looking into what kind of things could recover the perf improvement detected in rust-lang#121417 (comment). I think it's worth spending quite a bit of effort to figure out how to capture a 45% incr-patched improvement. As far as I can tell, the root cause of the problem is that we have taken very deliberate steps in the compiler to ensure that `#[inline(always)]` causes inlining where possible, even when all optimizations are disabled. Some of the reasons that was done are now outdated or were misguided. I think the only remaining use case is where the inlined body even without optimizations is cheaper to codegen or call, for example SIMD intrinsics may require a lot of code to put their arguments on the stack, which is slow to compile and run. I'm quite sure that the majority of users applied this attribute believing it does not cause inlining in unoptimized builds, or didn't appreciate the build time regressions that implies and would prefer it didn't if they knew. (if that's you, put a heart on this or say something elsewhere, don't reply on this PR) I am going to _try_ to use the existing benchmark suite to evaluate a number of different approaches and take notes here, and hopefully I can collect enough data to shape any conversation about what we can do to help users. The core of this PR is `InlineAttr::Usually` (name doesn't matter) which ensures that when optimizations are enabled that the function is inlined (usual exceptions like recursion apply). I think most users believe this is what `#[inline(always)]` does. rust-lang#130685 (comment) Replaced `#[inline(always)]` with `#[inline(usually)]` in the standard library, and did not recover the same 45% incr-patched improvement in regex. It's a tidy net positive though, and I suspect that perf improvement would normally be big enough to motivate merging a change. I think that means the standard library's use of `#[inline(always)]` is imposing marginal compile time overhead on the ecosystem, but the bigger opportunities are probably in third-party crates. rust-lang#130679 (comment) Treats `#[inline(always)]` as `#[inline(usually)]` literally everywhere; this gets the desired incr-patched improvement but suffers quite a few check and doc regressions. I think that means that `alwaysinline` is more powerful than `function-inline-cost=0` in LLVM. rust-lang#130679 (comment) Treats `#[inline(always)]` as `#[inline(usually)]` when `-Copt-level=0`, which looks basically the same as rust-lang#121417 (comment) (omit `alwaysinline` when doing `-Copt-level=0` codegen). rust-lang#130679 (comment) replaces `alwaysinline` with a very negative inline cost, and it still has check and doc regressions. More investigation required on what the different inlining decision is. rust-lang#130679 (comment) is a likely explanation of this, with some interesting implications; adding `inline(always)` to a function that was going to be inlined anyway can change change optimizations (usually it seems to improve things?). rust-lang#130679 (comment) makes `#[inline(usually)]` also defy instantiation mode selection and always be LocalCopy the way `#[inline(always)]` does, but still has regressions in stm32f4. I think that proves that `alwaysinline` can actually improve debug build times. TODO: What happens if we infer `alwaysinline` for extremely small functions like most of those in stm32f4?

Add inline(usually) I'm looking into what kind of things could recover the perf improvement detected in rust-lang#121417 (comment). I think it's worth spending quite a bit of effort to figure out how to capture a 45% incr-patched improvement. As far as I can tell, the root cause of the problem is that we have taken very deliberate steps in the compiler to ensure that `#[inline(always)]` causes inlining where possible, even when all optimizations are disabled. Some of the reasons that was done are now outdated or were misguided. I think the only remaining use case is where the inlined body even without optimizations is cheaper to codegen or call, for example SIMD intrinsics may require a lot of code to put their arguments on the stack, which is slow to compile and run. I'm quite sure that the majority of users applied this attribute believing it does not cause inlining in unoptimized builds, or didn't appreciate the build time regressions that implies and would prefer it didn't if they knew. (if that's you, put a heart on this or say something elsewhere, don't reply on this PR) I am going to _try_ to use the existing benchmark suite to evaluate a number of different approaches and take notes here, and hopefully I can collect enough data to shape any conversation about what we can do to help users. The core of this PR is `InlineAttr::Usually` (name doesn't matter) which ensures that when optimizations are enabled that the function is inlined (usual exceptions like recursion apply). I think most users believe this is what `#[inline(always)]` does. rust-lang#130685 (comment) Replaced `#[inline(always)]` with `#[inline(usually)]` in the standard library, and did not recover the same 45% incr-patched improvement in regex. It's a tidy net positive though, and I suspect that perf improvement would normally be big enough to motivate merging a change. I think that means the standard library's use of `#[inline(always)]` is imposing marginal compile time overhead on the ecosystem, but the bigger opportunities are probably in third-party crates. rust-lang#130679 (comment) Treats `#[inline(always)]` as `#[inline(usually)]` literally everywhere; this gets the desired incr-patched improvement but suffers quite a few check and doc regressions. I think that means that `alwaysinline` is more powerful than `function-inline-cost=0` in LLVM. rust-lang#130679 (comment) Treats `#[inline(always)]` as `#[inline(usually)]` when `-Copt-level=0`, which looks basically the same as rust-lang#121417 (comment) (omit `alwaysinline` when doing `-Copt-level=0` codegen). rust-lang#130679 (comment) replaces `alwaysinline` with a very negative inline cost, and it still has check and doc regressions. More investigation required on what the different inlining decision is. rust-lang#130679 (comment) is a likely explanation of this, with some interesting implications; adding `inline(always)` to a function that was going to be inlined anyway can change change optimizations (usually it seems to improve things?). rust-lang#130679 (comment) makes `#[inline(usually)]` also defy instantiation mode selection and always be LocalCopy the way `#[inline(always)]` does, but still has regressions in stm32f4. I think that proves that `alwaysinline` can actually improve debug build times. rust-lang#130679 (comment) infers `alwaysinline` for extremely trivial functions, but still has regressions for stm32f4. But of course it does, I left `inline(always)` treated as `inline(usually)` which slows down the compiler 🤦 inconclusive perf run. TODO: What happens if we infer `alwaysinline` for extremely small functions like most of those in stm32f4?

Add inline(usually) I'm looking into what kind of things could recover the perf improvement detected in rust-lang#121417 (comment). I think it's worth spending quite a bit of effort to figure out how to capture a 45% incr-patched improvement. As far as I can tell, the root cause of the problem is that we have taken very deliberate steps in the compiler to ensure that `#[inline(always)]` causes inlining where possible, even when all optimizations are disabled. Some of the reasons that was done are now outdated or were misguided. I think the only remaining use case is where the inlined body even without optimizations is cheaper to codegen or call, for example SIMD intrinsics may require a lot of code to put their arguments on the stack, which is slow to compile and run. I'm quite sure that the majority of users applied this attribute believing it does not cause inlining in unoptimized builds, or didn't appreciate the build time regressions that implies and would prefer it didn't if they knew. (if that's you, put a heart on this or say something elsewhere, don't reply on this PR) I am going to _try_ to use the existing benchmark suite to evaluate a number of different approaches and take notes here, and hopefully I can collect enough data to shape any conversation about what we can do to help users. The core of this PR is `InlineAttr::Usually` (name doesn't matter) which ensures that when optimizations are enabled that the function is inlined (usual exceptions like recursion apply). I think most users believe this is what `#[inline(always)]` does. rust-lang#130685 (comment) Replaced `#[inline(always)]` with `#[inline(usually)]` in the standard library, and did not recover the same 45% incr-patched improvement in regex. It's a tidy net positive though, and I suspect that perf improvement would normally be big enough to motivate merging a change. I think that means the standard library's use of `#[inline(always)]` is imposing marginal compile time overhead on the ecosystem, but the bigger opportunities are probably in third-party crates. rust-lang#130679 (comment) Treats `#[inline(always)]` as `#[inline(usually)]` literally everywhere; this gets the desired incr-patched improvement but suffers quite a few check and doc regressions. I think that means that `alwaysinline` is more powerful than `function-inline-cost=0` in LLVM. rust-lang#130679 (comment) Treats `#[inline(always)]` as `#[inline(usually)]` when `-Copt-level=0`, which looks basically the same as rust-lang#121417 (comment) (omit `alwaysinline` when doing `-Copt-level=0` codegen). rust-lang#130679 (comment) replaces `alwaysinline` with a very negative inline cost, and it still has check and doc regressions. More investigation required on what the different inlining decision is. rust-lang#130679 (comment) is a likely explanation of this, with some interesting implications; adding `inline(always)` to a function that was going to be inlined anyway can change change optimizations (usually it seems to improve things?). rust-lang#130679 (comment) makes `#[inline(usually)]` also defy instantiation mode selection and always be LocalCopy the way `#[inline(always)]` does, but still has regressions in stm32f4. I think that proves that `alwaysinline` can actually improve debug build times. rust-lang#130679 (comment) infers `alwaysinline` for extremely trivial functions, but still has regressions for stm32f4. But of course it does, I left `inline(always)` treated as `inline(usually)` which slows down the compiler 🤦 inconclusive perf run. rust-lang#130679 (comment) doesn't have any stm32f4 regressions 🥳 I think this means that there is some threshold where `alwaysinline` produces faster debug builds. So still two questions: 1. Why does `alwaysinline` sometimes make debug builds faster? 2. Is there any obvious threshold at which adding `alwaysinline` causes more work for debug builds?

Ignore inline(always) in unoptimized builds

7a29818

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 21, 2024

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2024

saethlin removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 21, 2024

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 22, 2024

craterbot added the S-waiting-on-crater Status: Waiting on a crater run to be completed. label Feb 22, 2024

This was referenced Feb 24, 2024

Add debug_assert_nounwind to unchecked_{add,sub,neg,mul,shl,shr} methods (haunted PR) #117494

Closed

Use intrinsics::debug_assertions in debug_assert_nounwind #120863

Merged

craterbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-crater Status: Waiting on a crater run to be completed. labels Mar 3, 2024

craterbot added S-waiting-on-crater Status: Waiting on a crater run to be completed. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 3, 2024

craterbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-crater Status: Waiting on a crater run to be completed. labels Mar 5, 2024

saethlin closed this May 12, 2024

saethlin deleted the no-opt-no-inline branch May 12, 2024 04:34

saethlin mentioned this pull request Sep 21, 2024

Add inline(usually) #130679

Draft

saethlin mentioned this pull request Oct 15, 2024

inline(always) use probably causes large incremental build time regressions BurntSushi/aho-corasick#146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf experiment] Ignore inline(always) in unoptimized builds #121417

[perf experiment] Ignore inline(always) in unoptimized builds #121417

saethlin commented Feb 21, 2024

saethlin commented Feb 21, 2024

This comment has been minimized.

bors commented Feb 21, 2024

rust-log-analyzer commented Feb 21, 2024

bors commented Feb 22, 2024

This comment has been minimized.

rust-timer commented Feb 22, 2024

saethlin commented Feb 22, 2024

craterbot commented Feb 22, 2024

craterbot commented Mar 1, 2024

craterbot commented Mar 3, 2024

saethlin commented Mar 3, 2024

craterbot commented Mar 3, 2024

craterbot commented Mar 5, 2024

craterbot commented Mar 5, 2024

RalfJung commented Sep 22, 2024

bjorn3 commented Sep 22, 2024

RalfJung commented Sep 22, 2024 via email

bjorn3 commented Sep 22, 2024

RalfJung commented Sep 22, 2024

alexcrichton commented Sep 22, 2024

[perf experiment] Ignore inline(always) in unoptimized builds #121417

[perf experiment] Ignore inline(always) in unoptimized builds #121417

Conversation

saethlin commented Feb 21, 2024

saethlin commented Feb 21, 2024

This comment has been minimized.

bors commented Feb 21, 2024

rust-log-analyzer commented Feb 21, 2024

bors commented Feb 22, 2024

This comment has been minimized.

rust-timer commented Feb 22, 2024

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

Binary size

saethlin commented Feb 22, 2024

craterbot commented Feb 22, 2024

craterbot commented Mar 1, 2024

craterbot commented Mar 3, 2024

saethlin commented Mar 3, 2024

craterbot commented Mar 3, 2024

craterbot commented Mar 5, 2024

craterbot commented Mar 5, 2024

RalfJung commented Sep 22, 2024

bjorn3 commented Sep 22, 2024

RalfJung commented Sep 22, 2024 via email

bjorn3 commented Sep 22, 2024

RalfJung commented Sep 22, 2024

alexcrichton commented Sep 22, 2024