Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloning a 1MB vector is 30x slower than cloning a 1MB ~str #13472

Closed
kmcallister opened this issue Apr 12, 2014 · 11 comments · Fixed by #13539
Closed

Cloning a 1MB vector is 30x slower than cloning a 1MB ~str #13472

kmcallister opened this issue Apr 12, 2014 · 11 comments · Fixed by #13539
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@kmcallister
Copy link
Contributor

Benchmark program at the end. The results:

$ rustc -v
rustc 0.11-pre (cee9a83 2014-04-11 15:54:46 -0700)
host: x86_64-unknown-linux-gnu

$ rustc -O --test foo.rs && ./foo --bench

running 5 tests
test clone_owned          ... bench:   5319322 ns/iter (+/- 166831)
test clone_owned_to_owned ... bench:   5293984 ns/iter (+/- 125331)
test clone_str            ... bench:     85526 ns/iter (+/- 1333)
test clone_vec            ... bench:   3332139 ns/iter (+/- 17227)
test test_memcpy          ... bench:     85931 ns/iter (+/- 563)

test result: ok. 0 passed; 0 failed; 0 ignored; 5 measured

That comes out to 300 MB/s… really bad for a memory copy. I'm guessing this has to do with the fact that Vec<T> is generic. Its clone looks like

impl<T:Clone> Clone for Vec<T> {
    fn clone(&self) -> Vec<T> {
        self.iter().map(|x| x.clone()).collect()
    }
}

and LLVM probably isn't smart enough to optimize this down to a memcpy. But anyway, there is a need for efficient vectors of primitive numerical types. If this can't happen by optimization magic, we need something like

impl<T: POD> Vec<T> {
    fn fast_clone(&self) -> Vec<T> {
        let mut vector = Vec::with_capacity(self.len);
        unsafe {
            vector.set_len(self.len);
            vector.copy_memory(self.as_slice());
        }
        vector
    }
}

(untested). I can also imagine a language feature which would let you write

impl<T: Clone> Clone for Vec<T> {
    fn clone(&self) -> Vec<T> {
        if implements_trait!(T, POD) {
            // ...
        } else {
            // ...
        }
    }
}

This is worryingly close to C++ template specialization, but might be worth it for core data structures. It's really counterintuitive if you need to use a special vector type or a special clone method to get acceptable performance on vectors of primitive integers.

Bonus weirdness: If you comment out clone_owned then clone_owned_to_owned gets significantly faster (though still way too slow):

running 4 tests
test clone_owned_to_owned ... bench:   3355442 ns/iter (+/- 69257)
test clone_str            ... bench:     78866 ns/iter (+/- 5433)
test clone_vec            ... bench:   3346685 ns/iter (+/- 134001)
test test_memcpy          ... bench:     85116 ns/iter (+/- 3570)

If you comment out clone_owned_to_owned instead, nothing in particular happens.

Here's the benchmark program:

extern crate test;
extern crate libc;

use test::{Bencher, black_box};
use libc::size_t;
use std::slice;

static size: uint = 1024*1024;

#[bench]
fn clone_str(bh: &mut Bencher) {
    let mut x = StrBuf::with_capacity(size);
    for _ in range(0, size) {
        x.push_char('x');
    }
    let x: ~str = x.into_owned();
    bh.iter(|| black_box(x.clone()));
}

#[bench]
fn clone_vec(bh: &mut Bencher) {
    let mut x: Vec<u8> = Vec::with_capacity(size);
    for _ in range(0, size) {
        x.push(0x78);
    }
    bh.iter(|| black_box(x.clone()));
}

#[bench]
fn clone_owned(bh: &mut Bencher) {
    let mut x: ~[u8] = slice::with_capacity(size);
    for _ in range(0, size) {
        x.push(0x78);
    }
    bh.iter(|| black_box(x.clone()));
}

#[bench]
fn clone_owned_to_owned(bh: &mut Bencher) {
    let mut x: ~[u8] = slice::with_capacity(size);
    for _ in range(0, size) {
        x.push(0x78);
    }
    let y = x.to_owned();
    bh.iter(|| black_box(y.clone()));
}

extern {
    fn memcpy(dest: *mut u8, src: *u8, n: size_t);
}

#[bench]
fn test_memcpy(bh: &mut Bencher) {
    let src = ~[0x78_u8, ..size];
    let mut dst = ~[0_u8, ..size];
    bh.iter(|| {
        unsafe {
            memcpy(dst.as_mut_ptr(), src.as_ptr(), size as u64);
        }
    })
}
@thestinger
Copy link
Contributor

I think this is probably #11751. I don't think we need to change any of the vector code, it's a fixable bug.

@huonw huonw added the I-slow label Apr 12, 2014
@thestinger
Copy link
Contributor

#11015 might also be related.

@kmcallister
Copy link
Contributor Author

(Actually, I suppose I'd want fast_to_owned and then fast_clone is a one-liner using that.)

@adrientetar
Copy link
Contributor

Pod was renamed to Copy, fwiw.

@Aatch
Copy link
Contributor

Aatch commented Apr 15, 2014

I'm looking into this (and the related bugs) and it seems that LLVM is suspicious of the write to the vector length during the copy loop and therefore doesn't convert the loop to a memcpy. I'm currently coaxing the relevant code to allow LLVM to optimise to a memcpy where it can.

@thestinger
Copy link
Contributor

@Aatch: What if you mark the &self pointer as noalias? It would be correct to have rustc mark all &T pointers to types without Unsafe inside that way.

@Aatch Aatch self-assigned this Apr 15, 2014
@Aatch
Copy link
Contributor

Aatch commented Apr 15, 2014

@thestinger I think it might be more than that. The bounds checks seem to cause the memcpy recogniser to fail, even though they actually get optimised out anyway.

@thestinger
Copy link
Contributor

AFAIK, there are no bounds checks here. It may hit the issue causing a redundant null check to be in the loop due to a missed optimization in LLVM. That's caused by the lack of a way to tell LLVM a pointer is non-null, but I expect LLVM could do better even without it.

I guess you mean the comparison of the length against the capacity in push?

@nikomatsakis
Copy link
Contributor

(If we want to, we can also rewrite the vec code to use the ty_is_pod intrinsic or whatever it's called)

@thestinger
Copy link
Contributor

Using Copy will only fix a subset of this problem, because many types are not Copy but still have a trivial Clone implementation. The bug will still need to remain open if that's the attempted fix.

@Aatch
Copy link
Contributor

Aatch commented Apr 15, 2014

So I did the pragmatic thing and just made the code itself slightly better. There's no special-casing of types, so it's still fully general. I'm just making sure it still passes the tests now, but the benchmarks on my machine are:

Current rustc

running 5 tests
test clone_owned          ... bench:   6128100 ns/iter (+/- 300273)
test clone_owned_to_owned ... bench:   6159888 ns/iter (+/- 402183)
test clone_str            ... bench:     81172 ns/iter (+/- 26302)
test clone_vec            ... bench:   4061528 ns/iter (+/- 211693)
test test_memcpy          ... bench:     72390 ns/iter (+/- 4130)

My rustc

running 5 tests
test clone_owned          ... bench:     72636 ns/iter (+/- 5768)
test clone_owned_to_owned ... bench:     71967 ns/iter (+/- 5565)
test clone_str            ... bench:     76996 ns/iter (+/- 7083)
test clone_vec            ... bench:     72076 ns/iter (+/- 1591)
test test_memcpy          ... bench:     71596 ns/iter (+/- 1016)

Both are done with just -O.

bors added a commit that referenced this issue Apr 16, 2014
LLVM wasn't recognising the loops as memcpy loops and was therefore failing to optimise them properly. While improving LLVM is the "proper" way to fix this, I think that these cases are important enough to warrant a little low-level optimisation.

Fixes #13472 

r? @thestinger 

---

Benchmark Results:

```
--- Before ---
test clone_owned          ... bench:   6126104 ns/iter (+/- 285962) = 170 MB/s
test clone_owned_to_owned ... bench:   6125054 ns/iter (+/- 271197) = 170 MB/s
test clone_str            ... bench:     80586 ns/iter (+/- 11489) = 13011 MB/s
test clone_vec            ... bench:   3903220 ns/iter (+/- 658556) = 268 MB/s
test test_memcpy          ... bench:     69401 ns/iter (+/- 2168) = 15108 MB/s

--- After ---
test clone_owned          ... bench:     70839 ns/iter (+/- 4931) = 14801 MB/s
test clone_owned_to_owned ... bench:     70286 ns/iter (+/- 4836) = 14918 MB/s
test clone_str            ... bench:     78519 ns/iter (+/- 5511) = 13353 MB/s
test clone_vec            ... bench:     71415 ns/iter (+/- 1999) = 14682 MB/s
test test_memcpy          ... bench:     70980 ns/iter (+/- 2126) = 14772 MB/s
```
bors added a commit to rust-lang-ci/rust that referenced this issue Jun 5, 2023
…meter-hints-toggle, r=Veykril

fix: `editor.parameterHints.enabled` not always being respected

rust-lang#13472

When accepting a suggestion, the parameter hints would always trigger automatically. This PR provides the ability for users to toggle this functionality off by specifying the new "rust-analyzer.autoTriggerParameterHints" option in `settings.json`. Possible options are `true` and `false`. It's `true` by default.
flip1995 pushed a commit to flip1995/rust that referenced this issue Nov 7, 2024
Optimise Msrv for common one item case

Currently, `Msrv` is cloned around a lot in order to handle the `#[clippy::msrv]` attribute. This attribute, however, means `RustcVersion` will be heap allocated if there is only one source of an msrv (eg: `rust-version` in `Cargo.toml`).

This PR optimizes for said case, while keeping the external interface the same by swapping the internal representation to `SmallVec<[RustcVersion; 2]>`.

changelog: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants