Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vec::clone and String::clone are very slow #17844

Closed
rusty-nail2 opened this issue Oct 7, 2014 · 6 comments
Closed

Vec::clone and String::clone are very slow #17844

rusty-nail2 opened this issue Oct 7, 2014 · 6 comments
Labels
A-collections Area: `std::collection` I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@rusty-nail2
Copy link

Platform: Windows, rust nightly.

The optimizations made to fix #13472 appear to have been undone by #15471 and 3316b1e

In 32-bit mode, rustc -O --test test.rs && ./test --bench gives:

running 7 tests
test clone_str                  ... bench:   1540862 ns/iter (+/- 25433)
test clone_vec_from_fn          ... bench:    745343 ns/iter (+/- 7535)
test clone_vec_from_incremental ... bench:   1088468 ns/iter (+/- 5358)
test fast_clone_vec             ... bench:    310484 ns/iter (+/- 3212)
test memory_copy                ... bench:    102162 ns/iter (+/- 1220)
test naive_clone_vec            ... bench:    745599 ns/iter (+/- 8111)

With -C target-cpu=x86-64 SSE enabled (-C target-feature=+sse2), things look better but String::clone still stands out:

running 7 tests
test clone_str                  ... bench:   1661609 ns/iter (+/- 60586)
test clone_vec_from_fn          ... bench:    359253 ns/iter (+/- 202820)
test clone_vec_from_incremental ... bench:    357418 ns/iter (+/- 3406)
test fast_clone_vec             ... bench:    309973 ns/iter (+/- 9762)
test memory_copy                ... bench:    102129 ns/iter (+/- 1317)
test naive_clone_vec            ... bench:    753919 ns/iter (+/- 8473)

Performance seems very dependent on minor details: clone_vec_from_incremental is much slower than clone_vec_from_fn in 32 bit non-SSE mode.

Benchmark program:

extern crate test;
extern crate core;

use test::{Bencher, black_box};
use core::ptr;

static SIZE: uint = 1024*1024;

#[bench]
fn clone_str(bh: &mut Bencher) {
    let mut x = String::with_capacity(SIZE);
    for _ in range(0, SIZE) {
        x.push('x');
    }
    bh.iter(|| black_box(x.clone()));
}

#[bench]
fn clone_vec_from_incremental(bh: &mut Bencher) {
    let mut x: Vec<u8> = Vec::with_capacity(SIZE);
    for _ in range(0, SIZE) {
        x.push(0x78);
    }
    bh.iter(|| black_box(x.clone()));
}

#[bench]
fn clone_vec_from_fn(bh: &mut Bencher) {
    let x: Vec<u8> = Vec::from_fn(SIZE, |_| 0x78);
    bh.iter(|| black_box(x.clone()));
}

#[bench]
fn naive_clone_vec(bh: &mut Bencher) {
    let x: Vec<u8> = Vec::from_fn(SIZE, |_| 0x78);
    bh.iter(|| {
        black_box(Vec::from_fn(SIZE, |i| x[i]));
    });
}

#[bench]
fn fast_clone_vec(bh: &mut Bencher) {
    let x: Vec<u8> = Vec::from_fn(SIZE, |_| 0x78);
    bh.iter(|| {
        let mut copy = Vec::<u8>::with_capacity(SIZE);
        unsafe {
            ptr::copy_memory(copy.as_mut_ptr(), x.as_ptr(), x.len());
            copy.set_len(SIZE);
        }
        black_box(copy)
    });
}

#[bench]
fn memory_copy(bh: &mut Bencher) {
    let x: Vec<u8> = Vec::from_fn(SIZE, |_| 0x78);
    let mut y: Vec<u8> = Vec::from_fn(SIZE, |_| 0);
    bh.iter(|| {
        unsafe {
            ptr::copy_memory(y.as_mut_ptr(), x.as_ptr(), x.len());
        }
    })
}
@huonw huonw added I-slow Issue: Problems and improvements with respect to performance of generated code. A-libs labels Oct 7, 2014
@dotdash
Copy link
Contributor

dotdash commented Oct 7, 2014

Can't reproduce on x86_64 using either rustc 0.12.0-dev (dfbe9eb3b 2014-10-05 09:07:06 +0000) or rustc 0.12.0-dev (683e40f35 2014-10-07 14:32:09 +0000).

running 6 tests
test clone_str                  ... bench:    198461 ns/iter (+/- 2769)
test clone_vec_from_fn          ... bench:    197538 ns/iter (+/- 4357)
test clone_vec_from_incremental ... bench:    196808 ns/iter (+/- 3055)
test fast_clone_vec             ... bench:    268492 ns/iter (+/- 20548)
test memory_copy                ... bench:     64533 ns/iter (+/- 7712)
test naive_clone_vec            ... bench:    700255 ns/iter (+/- 17326)

Which rustc version are you using? Also I'm curious why your output says that it's running 7 tests but shows only 6 results.

@rusty-nail2
Copy link
Author

I removed an irrelevant test, but the results themselves are correct.

I'm using Windows nightlies downloaded a few hours before filing the issue, namely

$ rustc --version
rustc 0.12.0-nightly (dc987adfc 2014-10-04 23:42:07 +0000)

I just realized that -C target-cpu=x86-64 doesn't generate 64-bit binaries, it only enables things like SSE2 (-C target-feature=+sse2 has equivalent performance). So all my benchmarks are 32 bit.

So there are two problems:

  • low performance when SSE is not enabled; String::clone could use copy_memory for a 3x speedup in this configuration, which is the default on Windows and will be widely used.
  • odd behavior for clone_str.

Bizarrely, if you duplicate clone_str, the problem disappears from the second instance, making its performance match clone_vec_from_incremental.

Minimal test case:

#![crate_type = "lib"]

extern crate test;
use test::Bencher;

// These two functions are exactly the same.
pub fn clone_str_bad(x: &String, bh: &mut Bencher) {
    bh.iter(|| x.clone());
}
pub fn clone_str_good(x: &String, bh: &mut Bencher) {
    bh.iter(|| x.clone());
}

LLVM generates a non-inlined call to Vec::push_all that copies byte by byte for clone_str_bad, while clone_str_good optimizes into a fast SSE copy. If both functions were optimized correctly, LLVM would reduce the second one to a single jmp to the first one.

Perhaps this sort of thing is why f39ba69 claimed that "LLVM is easily confused".

@kmcallister kmcallister added the A-collections Area: `std::collection` label Jan 16, 2015
@emberian emberian self-assigned this Mar 25, 2015
@bluss
Copy link
Member

bluss commented May 6, 2015

I found that String::push_str and Vec::push_all didn't compile to memcpy either. Following you @rusty-nail2, it could be that in a bigger program, some instances of them will compile correctly?

@emberian emberian removed their assignment Jan 5, 2016
@briansmith
Copy link
Contributor

Maybe #31414 will help with this.

@bluss
Copy link
Member

bluss commented Feb 19, 2016

This seems to be fixed? String::clone could still be improved to unconditionally use memcpy (benefits debug perf)

@bluss
Copy link
Member

bluss commented Sep 10, 2016

String::clone is not slow (is not generic and is supplied by libstd). Vec::::clone uses .extend_from_slice(), but it looks good except in cases where #33518 hits, so it will be fixed with that issue.

Updated benchmark shows no problem. https://gist.github.com/bluss/3a0afb7345000ec76bb2baeadf10f921

bors added a commit that referenced this issue Sep 12, 2016
…d, r=alexcrichton

Work around pointer aliasing issue in Vec::extend_from_slice, extend_with_element

Due to missing noalias annotations for &mut T in general (issue #31681),
in larger programs extend_from_slice and extend_with_element may both
compile very poorly. What is observed is that the .set_len() calls are
not lifted out of the loop, even for `Vec<u8>`.

Use a local length variable for the Vec length instead, and use a scope
guard to write this value back to self.len when the scope ends or on
panic. Then the alias analysis is easy.

This affects extend_from_slice, extend_with_element, the vec![x; n]
macro, Write impls for Vec<u8>, BufWriter, etc (but may / may not
have triggered since inlining can be enough for the compiler to get it right).

Fixes #32155
Fixes #33518
Closes #17844
lnicola pushed a commit to lnicola/rust that referenced this issue Aug 13, 2024
fix: Fix find_path not respecting non-std preference config correctly

Fixes rust-lang/rust-analyzer#17840
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-collections Area: `std::collection` I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

7 participants