-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloning a 1MB vector is 30x slower than cloning a 1MB ~str #13472
Comments
I think this is probably #11751. I don't think we need to change any of the vector code, it's a fixable bug. |
#11015 might also be related. |
(Actually, I suppose I'd want |
|
I'm looking into this (and the related bugs) and it seems that LLVM is suspicious of the write to the vector length during the copy loop and therefore doesn't convert the loop to a memcpy. I'm currently coaxing the relevant code to allow LLVM to optimise to a memcpy where it can. |
@Aatch: What if you mark the |
@thestinger I think it might be more than that. The bounds checks seem to cause the memcpy recogniser to fail, even though they actually get optimised out anyway. |
AFAIK, there are no bounds checks here. It may hit the issue causing a redundant null check to be in the loop due to a missed optimization in LLVM. That's caused by the lack of a way to tell LLVM a pointer is non-null, but I expect LLVM could do better even without it. I guess you mean the comparison of the length against the capacity in |
(If we want to, we can also rewrite the vec code to use the |
Using |
So I did the pragmatic thing and just made the code itself slightly better. There's no special-casing of types, so it's still fully general. I'm just making sure it still passes the tests now, but the benchmarks on my machine are: Current
|
LLVM wasn't recognising the loops as memcpy loops and was therefore failing to optimise them properly. While improving LLVM is the "proper" way to fix this, I think that these cases are important enough to warrant a little low-level optimisation. Fixes #13472 r? @thestinger --- Benchmark Results: ``` --- Before --- test clone_owned ... bench: 6126104 ns/iter (+/- 285962) = 170 MB/s test clone_owned_to_owned ... bench: 6125054 ns/iter (+/- 271197) = 170 MB/s test clone_str ... bench: 80586 ns/iter (+/- 11489) = 13011 MB/s test clone_vec ... bench: 3903220 ns/iter (+/- 658556) = 268 MB/s test test_memcpy ... bench: 69401 ns/iter (+/- 2168) = 15108 MB/s --- After --- test clone_owned ... bench: 70839 ns/iter (+/- 4931) = 14801 MB/s test clone_owned_to_owned ... bench: 70286 ns/iter (+/- 4836) = 14918 MB/s test clone_str ... bench: 78519 ns/iter (+/- 5511) = 13353 MB/s test clone_vec ... bench: 71415 ns/iter (+/- 1999) = 14682 MB/s test test_memcpy ... bench: 70980 ns/iter (+/- 2126) = 14772 MB/s ```
…meter-hints-toggle, r=Veykril fix: `editor.parameterHints.enabled` not always being respected rust-lang#13472 When accepting a suggestion, the parameter hints would always trigger automatically. This PR provides the ability for users to toggle this functionality off by specifying the new "rust-analyzer.autoTriggerParameterHints" option in `settings.json`. Possible options are `true` and `false`. It's `true` by default.
Optimise Msrv for common one item case Currently, `Msrv` is cloned around a lot in order to handle the `#[clippy::msrv]` attribute. This attribute, however, means `RustcVersion` will be heap allocated if there is only one source of an msrv (eg: `rust-version` in `Cargo.toml`). This PR optimizes for said case, while keeping the external interface the same by swapping the internal representation to `SmallVec<[RustcVersion; 2]>`. changelog: none
Benchmark program at the end. The results:
That comes out to 300 MB/s… really bad for a memory copy. I'm guessing this has to do with the fact that
Vec<T>
is generic. Itsclone
looks likeand LLVM probably isn't smart enough to optimize this down to a
memcpy
. But anyway, there is a need for efficient vectors of primitive numerical types. If this can't happen by optimization magic, we need something like(untested). I can also imagine a language feature which would let you write
This is worryingly close to C++ template specialization, but might be worth it for core data structures. It's really counterintuitive if you need to use a special vector type or a special clone method to get acceptable performance on vectors of primitive integers.
Bonus weirdness: If you comment out
clone_owned
thenclone_owned_to_owned
gets significantly faster (though still way too slow):If you comment out
clone_owned_to_owned
instead, nothing in particular happens.Here's the benchmark program:
The text was updated successfully, but these errors were encountered: