Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Vec derefing inlinable #52704

Closed
wants to merge 1 commit into from
Closed

Make Vec derefing inlinable #52704

wants to merge 1 commit into from

Conversation

koute
Copy link
Member

@koute koute commented Jul 25, 2018

Vec derefing is consuming a non-insignificant amount of time (a few percent of the whole runtime) in one of my apps when building without LTO, so it should be worthwhile to mark these #[inline].

@rust-highfive
Copy link
Collaborator

r? @shepmaster

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 25, 2018
@shepmaster
Copy link
Member

Have you tested these changes? Can you provide a before/after benchmark?

I was under the impression that any type with a generic was automatically #[inline] as that's basically how monomorphization works.

@koute
Copy link
Member Author

koute commented Jul 25, 2018

Hmm... right, good point; in which case perhaps those should actually be #[inline(always)]?

I managed to reproduce the non-inlining behavior with upstream Rust in a toy benchmark; I'll retest with #[inline]'d methods, and if that won't affect the results I'll try #[inline(always)].

@hanna-kruppe
Copy link
Contributor

These methods are so trivial that it seems very unlikely to me for inline(always) to make any difference here.

What seems more likely to me is that the adage "generic functions are always inlinable" isn't completely true anymore. IIUC #48779 made it so that if e.g. Vec<u8>::deref is translated in an upstream crate, then Vec<u8>::deref won't be copied into your own crate's code by default and hence not be available for inlining.

@koute
Copy link
Member Author

koute commented Jul 25, 2018

I can confirm that this change does make a difference.

Before:

test bench_deref ... bench:       4,127 ns/iter (+/- 130)

After:

test bench_deref ... bench:       1,576 ns/iter (+/- 98)

The benchmark:

#![feature(test)]

extern crate test;
use std::ops::Deref;
use test::Bencher;

#[bench]
fn bench_deref(b: &mut Bencher) {
    let vec: Vec<u32> = (0..1000).into_iter().collect();
    b.iter(|| {
        let mut sum: u32 = 0;
        for index in 0..vec.len() {
            let slice = vec.deref();
            sum = sum.wrapping_add(slice[index]);
        }
        sum
    });
}

and the profile from Cargo.toml:

[profile.bench]
opt-level = 3
debug = true
lto = false
codegen-units = 16
incremental = true

I've verified with a profiler that the deref isn't inlined before, and is inlined after.

What's also interesting is that if I change incremental to false this is the result I'm getting (both before the change and after):

test bench_deref ... bench:          49 ns/iter (+/- 2)

So it looks like enabling incremental compilation completely kills the compiler's ability to inline generics which don't have the #[inline] annotation.

(Also, in this particular benchmark we can also see that <T as core::convert::TryFrom<U>>::try_from is not getting inlined as well, which is most likely where the remaining 1us of overhead comes from; it'd probably be worthwhile to mark that as #[inline] too.)

@jonas-schievink
Copy link
Contributor

@koute Yeah, @rkruppe mentioned that #48779 is probably at fault, which will reuse monomorphizations as soon as incr. comp. is turned on, so disabling it is a workaround (but you also have to use -O2 or -O3).

@shepmaster
Copy link
Member

I'll loop in @alexcrichton and @michaelwoerister due to #48779 then. In this case, I'm guessing that the answer is going to be "if you want maximum performance" you can disable incremental compilation" (i.e., great for your redistributed, published binary that you create once a month).

@alexcrichton
Copy link
Member

It's as known bug in the compiler that incremental release builds perform very poorly compared to normal release builds. While #[inline] helps the situation the in specific cases like this it doesn't help in general, and ThinLTO is required for that. @michaelwoerister is working on incremental ThinLTO, but for now I do not think we should take patches like this as there's just unfortunately too many that would need to be taken (and they can be deterimental in some situations)

@koute
Copy link
Member Author

koute commented Jul 26, 2018

I'm guessing that the answer is going to be "if you want maximum performance" you can disable incremental compilation"

This is one of those rare-ish cases where I need reasonable performance in my dev builds (otherwise the resulting binary is too slow at runtime to be useful for anything), and since those are the dev builds I also need reasonably fast recompilation.

I just checked, and currently derefing a Vec takes 11% of the total CPU time in my app, which is huge for a method that basically does nothing!

for now I do not think we should take patches like this as there's just unfortunately too many that would need to be taken

I somewhat disagree with this, considering we already have over two thousand #[inline] annotations in std and core. In cases like this one where the methods are so simple and so widely used they basically should always be inlined in any optimized build. We shouldn't need any form of LTO to get the the most basic zero cost abstractions like these optimized out!

@michaelwoerister
Copy link
Member

What seems more likely to me is that the adage "generic functions are always inlinable" isn't completely true anymore

This hasn't been true for quite a while now, at least for a year or maybe two.

This is one of those rare-ish cases where I need reasonable performance in my dev builds

Incremental ThinLTO should indeed help with that. But it's a non-trivial feature and I'm currently tasked with working on other things, so don't keep your fingers crossed for it to appear in the compiler in the next few weeks.

I just checked, and currently derefing a Vec takes 11% of the total CPU time in my app, which is huge for a method that basically does nothing!

Personally, I'm fine with adding #[inline] in the particular cases of this PR.

@alexcrichton
Copy link
Member

@koute the fact of the matter is that incremental release builds simply aren't tuned for performance right now. The standard library and large chunks of the Rust ecosystem are written in a way that if one crate is split into multiple cgus then ThinLTO is required to recover the performance loss from lack of inlining.

Multiple codegen units is enabled by default in release builds today and ThinLTO is there to recover the performance. The ThinLTO passes are disabled in incremental mode as they aren't compatible with incremental compilation. As @michaelwoerister mentions enabling this is a significant chunk of work.

The fact that Vec::deref takes so much time in your application is indeed worrying, but it's expected. If this specific instance is fixed it'll be a never-ending stream of PRs adding #[inline] annotations continually to other collections-related methods. Adding #[inline] is not a fix, it's a local hack.

@koute
Copy link
Member Author

koute commented Jul 26, 2018

The fact that Vec::deref takes so much time in your application is indeed worrying, but it's expected. If this specific instance is fixed it'll be a never-ending stream of PRs adding #[inline] annotations continually to other collections-related methods. Adding #[inline] is not a fix, it's a local hack.

@alexcrichton So could you please explain how is this any different than any other of the two thousand #[inline] annotations we already have in std and core, and the PRs which added them? Many of current #[inline]s are even attached to generic functions, exactly just like these ones here.

It was always the case that #[inline] annotations were necessary to get something inlined across crates without LTO and people used to add them on a regular basis. The fact that you didn't need to do that for generics was just a "magic" implementation detail. Now with the incremental compilation this isn't true anymore, but suddenly it's not okay to add new #[inline]s?

I mean absolutely no offence, but, for example, why was adding this okay (I picked the commit at random; I didn't deliberately pick one of yours):

commit d01427f53e81a24e1c49d1672f37d4809c04a004
Author: Alex Crichton <[email protected]>
Date:   Fri Oct 20 11:40:04 2017 -0700

    [test] Add some `#[inline]` to `HashMap`

diff --git a/src/libstd/collections/hash/map.rs b/src/libstd/collections/hash/map.rs
index 3c0fa0860d..d96a4f40d3 100644
--- a/src/libstd/collections/hash/map.rs
+++ b/src/libstd/collections/hash/map.rs
@@ -1113,6 +1113,7 @@ impl<K, V, S> HashMap<K, V, S>
     /// assert_eq!(map.get(&2), None);
     /// ```
     #[stable(feature = "rust1", since = "1.0.0")]
+    #[inline]
     pub fn get<Q: ?Sized>(&self, k: &Q) -> Option<&V>
         where K: Borrow<Q>,
               Q: Hash + Eq
@@ -2554,6 +2555,7 @@ impl<K, S, Q: ?Sized> super::Recover<Q> for HashMap<K, (), S>
 {
     type Key = K;
 
+    #[inline]
     fn get(&self, key: &Q) -> Option<&K> {
         self.search(key).into_occupied_bucket().map(|bucket| bucket.into_refs().0)
     }
@@ -2566,6 +2568,7 @@ impl<K, S, Q: ?Sized> super::Recover<Q> for HashMap<K, (), S>
         self.search_mut(key).into_occupied_bucket().map(|bucket| pop_internal(bucket).0)
     }
 
+    #[inline]
     fn replace(&mut self, key: K) -> Option<K> {
         self.reserve(1);
 

but adding the same here (Especially when it has a measurable practical impact!) is not acceptable anymore? If the answer to that is "well, we didn't have ThinLTO planned back then" then I need to ask: since we don't need #[inline]s with LTO to make something inlinable are we going to remove all of the current #[inline]s and reject every further PRs which will add them? Otherwise it feels like a double standard.

I apologize if I seem somewhat brash, but it is a little disheartening to see such resistance to seemingly harmless and uncontroversial change such as this when it's not even the first of its kind. ):

@sfackler
Copy link
Member

I agree with @koute here I think - we've had a ~constant stream of PRs inlining things as necessary for years.

@alexcrichton
Copy link
Member

The #[inline] attribute does two things:

  • It forces usage in any CGU to translate the function into the CGU. This is "cross crate inlining" but also works for multi-codegen-unit scenarios.
  • It applies an "inline hint" to LLVM, indicating that it should think more strongly about inlining the tagged function.

Thus, if you have a function like fn foo() {}, then only way to get cross-crate inlining is to tag it with #[inline] (assuming LTO isn't used). Attaching the #[inline] annotation can have adverse side effects. Force LLVM to codegen more functions can increase compile times as LLVM has to crunch through more code. Additionally if LLVM would otherwise deduce that a function shouldn't be inlined, then forcing it to inline it with an extra hint may cause worse-performing runtime code as it's too large among other reasons.

Translation of generics is slightly different. If you have a function like fn foo<T>() {} then we can't generate code for it. Only when used does the compiler generate code. This means that generics usage enables cross-crate inlining by default. Now note that the "inlining" here is different than the #[inline] tag. Usage of a generic inlines once per crate, not into every CGU. If two CGUs in a crate reference a generic function, then it's translated once into a CGU and both CGUs reference the same definition.

Now why haven't we applied this patch already? It's terrible if Vec::deref isn't inlined! (as can be seen here for performance). The key here is that, in release mode, after the compiler translates and optimizes all CGUs it performs ThinLTO. The purpose of ThinLTO here is to optimize a set of CGUs "as if" they were one CGU, with respect to inlining. The ThinLTO passes primarily allow inlining across CGUs within a crate. This means that Vec::deref is already inlined everywhere. Each time you use it in a crate it's translated once. All usages refer to the same definition (maybe in another CGU, preventing inlining). ThinLTO then typically inlines Vec::deref calls across CGUs because it knows that they're small and good to inline.

Ok, so why is this patch needed? As mentioned before, incremental release builds do not run ThinLTO. This means that the inlining does not happen in all CGUs in a crate, because rustc only translates Vec::deref into one CGU per crate.


Ok, so why have we had a "stream of #[inline] PRs" up to this point? Why do I not want to merge this one? Each PR merged falls into one of these buckets I believe

  • The PR has been tested and shown performance gains. @koute if you look at the PR in question I wrote a very long and detailed comment explaining why #[inline] annotations were needed. I tested the addition and it showed clear improvements over the previous code.
  • The PR was inlining something that should be cross-crate inlined but wasn't possible to be doing so. We've got lots of little functions in the standard library that should be inlined, we forget to tag them all inline, PRs fix that. This is things for like String::deref which, without #[inline], would not be inlined across crates and clearly need to be.
  • The PR simply didn't get enough review. Most additions of #[inline] don't consider the downsides and receive very little review.

To me, this PR falls in none of those buckets. While it may naively appear it falls in the first bucket (tested to show performance gains) I do not consider it to be in that bucket. This is a bug in the compiler that incremental release builds are so slow. There are likely hundreds of tiny generic methods in the standard library which we have explicitly not tagged with #[inline] which would need to be tagged as such to get the current system of incremental release builds.

We've trained Rust programmers ever since the beginning that generics imply efficient cross crate inlining, no #[inline] necessary. That rustc doens't enable that today for multi-cgu release builds is a bug. That will be fixed by enabling ThinLTO in incremental release builds, which is a work in progress but will take some time.

@shepmaster
Copy link
Member

We've trained Rust programmers ever since the beginning that generics imply efficient cross crate inlining, no #[inline] necessary

Count me in that bucket, and I like to think I have a reasonable grasp on the language...

Since this is now above my paygrade...

r? @alexcrichton

@alexcrichton
Copy link
Member

@koute do you have thoughts on my comment above?

@koute
Copy link
Member Author

koute commented Aug 7, 2018

@alexcrichton Thanks for the detailed explanation! It did make things a lot clearer.

I agree with your reasoning, although purely personally I would still merge this in, at least temporarily.

There are probably a lot of such functions in the stdlib, but, actually, I don't believe we need to mark too many of them as #[inline] to have a measurable impact. Case in point - I've just profiled the benchmarks for the regex crate with incremental compilation turned on and without LTO, and uninlined Vec::deref takes 17% of its runtime on my machine. Most crates at which I looked at burn cycles in a similar set of non-inlined function.

I think it's worthwile to add the annotations for the most egregious cases like this which impact the majority of the ecosystem, especially as @michaelwoerister said it's going to take a while to get ThinLTO working with incremental compilation. But, now I understand why you don't want it merged, so feel free to close this PR.

There's one other tangentially-related thing I'd like to ask - if we have ThinLTO to inline functions across the CGUs, then wouldn't it be reasonable (after ThinLTO works with incremental builds) to make #[inline] not inline into every CGU but only into a single CGU just as non-#[inline]'d generics currently are?

@michaelwoerister
Copy link
Member

There's one other tangentially-related thing I'd like to ask - if we have ThinLTO to inline functions across the CGUs, then wouldn't it be reasonable (after ThinLTO works with incremental builds) to make #[inline] not inline into every CGU but only into a single CGU just as non-#[inline]'d generics currently are?

I suspect that this is something we'll look into. There's still the question of cross-crate inlining. Right now, #[inline] functions are available for inlining even if they are defined in another crate, whereas (by default) ThinLTO only performs inlining across CGUs of the current crate. It's a solvable problem though.

@alexcrichton
Copy link
Member

Ok! In that case yeah I'm going to close this for now.

@alexcrichton
Copy link
Member

#53673 has now merged which implements incremental ThinLTO. This, for example, fixes the benchmark in #52704 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants