Cloning a 1MB vector is 30x slower than cloning a 1MB ~str #13472

kmcallister · 2014-04-12T01:25:05Z

Benchmark program at the end. The results:

$ rustc -v
rustc 0.11-pre (cee9a83 2014-04-11 15:54:46 -0700)
host: x86_64-unknown-linux-gnu

$ rustc -O --test foo.rs && ./foo --bench

running 5 tests
test clone_owned          ... bench:   5319322 ns/iter (+/- 166831)
test clone_owned_to_owned ... bench:   5293984 ns/iter (+/- 125331)
test clone_str            ... bench:     85526 ns/iter (+/- 1333)
test clone_vec            ... bench:   3332139 ns/iter (+/- 17227)
test test_memcpy          ... bench:     85931 ns/iter (+/- 563)

test result: ok. 0 passed; 0 failed; 0 ignored; 5 measured

That comes out to 300 MB/s… really bad for a memory copy. I'm guessing this has to do with the fact that Vec<T> is generic. Its clone looks like

impl<T:Clone> Clone for Vec<T> {
    fn clone(&self) -> Vec<T> {
        self.iter().map(|x| x.clone()).collect()
    }
}

and LLVM probably isn't smart enough to optimize this down to a memcpy. But anyway, there is a need for efficient vectors of primitive numerical types. If this can't happen by optimization magic, we need something like

impl<T: POD> Vec<T> {
    fn fast_clone(&self) -> Vec<T> {
        let mut vector = Vec::with_capacity(self.len);
        unsafe {
            vector.set_len(self.len);
            vector.copy_memory(self.as_slice());
        }
        vector
    }
}

(untested). I can also imagine a language feature which would let you write

impl<T: Clone> Clone for Vec<T> {
    fn clone(&self) -> Vec<T> {
        if implements_trait!(T, POD) {
            // ...
        } else {
            // ...
        }
    }
}

This is worryingly close to C++ template specialization, but might be worth it for core data structures. It's really counterintuitive if you need to use a special vector type or a special clone method to get acceptable performance on vectors of primitive integers.

Bonus weirdness: If you comment out clone_owned then clone_owned_to_owned gets significantly faster (though still way too slow):

running 4 tests
test clone_owned_to_owned ... bench:   3355442 ns/iter (+/- 69257)
test clone_str            ... bench:     78866 ns/iter (+/- 5433)
test clone_vec            ... bench:   3346685 ns/iter (+/- 134001)
test test_memcpy          ... bench:     85116 ns/iter (+/- 3570)

If you comment out clone_owned_to_owned instead, nothing in particular happens.

Here's the benchmark program:

extern crate test;
extern crate libc;

use test::{Bencher, black_box};
use libc::size_t;
use std::slice;

static size: uint = 1024*1024;

#[bench]
fn clone_str(bh: &mut Bencher) {
    let mut x = StrBuf::with_capacity(size);
    for _ in range(0, size) {
        x.push_char('x');
    }
    let x: ~str = x.into_owned();
    bh.iter(|| black_box(x.clone()));
}

#[bench]
fn clone_vec(bh: &mut Bencher) {
    let mut x: Vec<u8> = Vec::with_capacity(size);
    for _ in range(0, size) {
        x.push(0x78);
    }
    bh.iter(|| black_box(x.clone()));
}

#[bench]
fn clone_owned(bh: &mut Bencher) {
    let mut x: ~[u8] = slice::with_capacity(size);
    for _ in range(0, size) {
        x.push(0x78);
    }
    bh.iter(|| black_box(x.clone()));
}

#[bench]
fn clone_owned_to_owned(bh: &mut Bencher) {
    let mut x: ~[u8] = slice::with_capacity(size);
    for _ in range(0, size) {
        x.push(0x78);
    }
    let y = x.to_owned();
    bh.iter(|| black_box(y.clone()));
}

extern {
    fn memcpy(dest: *mut u8, src: *u8, n: size_t);
}

#[bench]
fn test_memcpy(bh: &mut Bencher) {
    let src = ~[0x78_u8, ..size];
    let mut dst = ~[0_u8, ..size];
    bh.iter(|| {
        unsafe {
            memcpy(dst.as_mut_ptr(), src.as_ptr(), size as u64);
        }
    })
}

The text was updated successfully, but these errors were encountered:

thestinger · 2014-04-12T01:27:39Z

I think this is probably #11751. I don't think we need to change any of the vector code, it's a fixable bug.

thestinger · 2014-04-12T01:46:00Z

#11015 might also be related.

kmcallister · 2014-04-12T01:50:50Z

(Actually, I suppose I'd want fast_to_owned and then fast_clone is a one-liner using that.)

adrientetar · 2014-04-12T15:29:16Z

Pod was renamed to Copy, fwiw.

Aatch · 2014-04-15T06:55:05Z

I'm looking into this (and the related bugs) and it seems that LLVM is suspicious of the write to the vector length during the copy loop and therefore doesn't convert the loop to a memcpy. I'm currently coaxing the relevant code to allow LLVM to optimise to a memcpy where it can.

thestinger · 2014-04-15T07:04:00Z

@Aatch: What if you mark the &self pointer as noalias? It would be correct to have rustc mark all &T pointers to types without Unsafe inside that way.

Aatch · 2014-04-15T07:53:51Z

@thestinger I think it might be more than that. The bounds checks seem to cause the memcpy recogniser to fail, even though they actually get optimised out anyway.

thestinger · 2014-04-15T07:59:34Z

AFAIK, there are no bounds checks here. It may hit the issue causing a redundant null check to be in the loop due to a missed optimization in LLVM. That's caused by the lack of a way to tell LLVM a pointer is non-null, but I expect LLVM could do better even without it.

I guess you mean the comparison of the length against the capacity in push?

nikomatsakis · 2014-04-15T14:20:45Z

(If we want to, we can also rewrite the vec code to use the ty_is_pod intrinsic or whatever it's called)

thestinger · 2014-04-15T18:54:38Z

Using Copy will only fix a subset of this problem, because many types are not Copy but still have a trivial Clone implementation. The bug will still need to remain open if that's the attempted fix.

Aatch · 2014-04-15T22:02:40Z

So I did the pragmatic thing and just made the code itself slightly better. There's no special-casing of types, so it's still fully general. I'm just making sure it still passes the tests now, but the benchmarks on my machine are:

Current `rustc`

running 5 tests
test clone_owned          ... bench:   6128100 ns/iter (+/- 300273)
test clone_owned_to_owned ... bench:   6159888 ns/iter (+/- 402183)
test clone_str            ... bench:     81172 ns/iter (+/- 26302)
test clone_vec            ... bench:   4061528 ns/iter (+/- 211693)
test test_memcpy          ... bench:     72390 ns/iter (+/- 4130)

My `rustc`

running 5 tests
test clone_owned          ... bench:     72636 ns/iter (+/- 5768)
test clone_owned_to_owned ... bench:     71967 ns/iter (+/- 5565)
test clone_str            ... bench:     76996 ns/iter (+/- 7083)
test clone_vec            ... bench:     72076 ns/iter (+/- 1591)
test test_memcpy          ... bench:     71596 ns/iter (+/- 1016)

Both are done with just -O.

@thestinger

LLVM wasn't recognising the loops as memcpy loops and was therefore failing to optimise them properly. While improving LLVM is the "proper" way to fix this, I think that these cases are important enough to warrant a little low-level optimisation. Fixes #13472 r? @thestinger --- Benchmark Results: ``` --- Before --- test clone_owned ... bench: 6126104 ns/iter (+/- 285962) = 170 MB/s test clone_owned_to_owned ... bench: 6125054 ns/iter (+/- 271197) = 170 MB/s test clone_str ... bench: 80586 ns/iter (+/- 11489) = 13011 MB/s test clone_vec ... bench: 3903220 ns/iter (+/- 658556) = 268 MB/s test test_memcpy ... bench: 69401 ns/iter (+/- 2168) = 15108 MB/s --- After --- test clone_owned ... bench: 70839 ns/iter (+/- 4931) = 14801 MB/s test clone_owned_to_owned ... bench: 70286 ns/iter (+/- 4836) = 14918 MB/s test clone_str ... bench: 78519 ns/iter (+/- 5511) = 13353 MB/s test clone_vec ... bench: 71415 ns/iter (+/- 1999) = 14682 MB/s test test_memcpy ... bench: 70980 ns/iter (+/- 2126) = 14772 MB/s ```

…meter-hints-toggle, r=Veykril fix: `editor.parameterHints.enabled` not always being respected rust-lang#13472 When accepting a suggestion, the parameter hints would always trigger automatically. This PR provides the ability for users to toggle this functionality off by specifying the new "rust-analyzer.autoTriggerParameterHints" option in `settings.json`. Possible options are `true` and `false`. It's `true` by default.

Optimise Msrv for common one item case Currently, `Msrv` is cloned around a lot in order to handle the `#[clippy::msrv]` attribute. This attribute, however, means `RustcVersion` will be heap allocated if there is only one source of an msrv (eg: `rust-version` in `Cargo.toml`). This PR optimizes for said case, while keeping the external interface the same by swapping the internal representation to `SmallVec<[RustcVersion; 2]>`. changelog: none

huonw added the I-slow label Apr 12, 2014

Aatch self-assigned this Apr 15, 2014

Aatch mentioned this issue Apr 15, 2014

Improve the copying code for slices and Vec #13539

Merged

bors closed this as completed in #13539 Apr 16, 2014

kmcallister unassigned Aatch Jun 16, 2014

rusty-nail2 mentioned this issue Oct 7, 2014

Vec::clone and String::clone are very slow #17844

Closed

schomatis mentioned this issue Feb 7, 2019

zigzag: cache optimizations filecoin-project/rust-fil-proofs#465

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloning a 1MB vector is 30x slower than cloning a 1MB ~str #13472

Cloning a 1MB vector is 30x slower than cloning a 1MB ~str #13472

kmcallister commented Apr 12, 2014

thestinger commented Apr 12, 2014

thestinger commented Apr 12, 2014

kmcallister commented Apr 12, 2014

adrientetar commented Apr 12, 2014

Aatch commented Apr 15, 2014

thestinger commented Apr 15, 2014

Aatch commented Apr 15, 2014

thestinger commented Apr 15, 2014

nikomatsakis commented Apr 15, 2014

thestinger commented Apr 15, 2014

Aatch commented Apr 15, 2014

Cloning a 1MB vector is 30x slower than cloning a 1MB ~str #13472

Cloning a 1MB vector is 30x slower than cloning a 1MB ~str #13472

Comments

kmcallister commented Apr 12, 2014

thestinger commented Apr 12, 2014

thestinger commented Apr 12, 2014

kmcallister commented Apr 12, 2014

adrientetar commented Apr 12, 2014

Aatch commented Apr 15, 2014

thestinger commented Apr 15, 2014

Aatch commented Apr 15, 2014

thestinger commented Apr 15, 2014

nikomatsakis commented Apr 15, 2014

thestinger commented Apr 15, 2014

Aatch commented Apr 15, 2014

Current rustc

My rustc

Current `rustc`

My `rustc`