Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use std::arch for SIMD and target_feature #46

Open
bluss opened this issue Jan 9, 2016 · 9 comments
Open

Use std::arch for SIMD and target_feature #46

bluss opened this issue Jan 9, 2016 · 9 comments

Comments

@bluss
Copy link
Member

bluss commented Jan 9, 2016

See rust-lang/rust/issues/29717

Use to select impl for unrolled dot product and scalar sum.

@bluss bluss changed the title Use cfg(target_feature=) when stable Use std::arch for SIMD and target_feature Nov 13, 2018
@bluss
Copy link
Member Author

bluss commented Nov 13, 2018

Preferred approach would be to move the heavy lifting and inner loops (dot product etc) to a separate crate in the style of https://github.com/bluss/numeric-loops or another existing already simdified crate.

@SparrowLii
Copy link
Contributor

SparrowLii commented Mar 9, 2021

@bluss I am contributing to std::arch to make it a stable feature as soon as possible. I would like to undertake the simd-realization of ndarray. I think we can create a new branch from master for realizing and discussing.

The following is a very simple example:

#![feature(stdsimd)]
#![feature(stdsimd_internal)]
use ndarray::*;
use core_arch::simd::*;
use core_arch::simd_llvm::*;
use std::intrinsics::transmute;
use core_arch::arch::x86_64::{__m128bh, m128bhExt};

// Just for demonstration, much faster way is supposed to be used.
pub fn simd_arr1(xs: &[i32]) -> Array1<i32x4> {
    let len = xs.len();
    assert!(len % 4 == 0);
    let mut i = 0;
    let mut v: Vec<i32x4> = Vec::new();
    while i + 4 <= len {
        v.push(i32x4::new(xs[i], xs[i+1], xs[i+2], xs[i+3]));
        i += 4;
    }
    ArrayBase::from(v)
}

fn main() {
    let a = arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    let b = arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    let c = Zip::from(&a).and(&b).map_collect(|x, y| x * y);
    println!("{}", c);

    let a_simd = simd_arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    let b_simd = simd_arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    unsafe {
        let c_simd = Zip::from(&a_simd).and(&b_simd).map_collect(|x, y| simd_mul(transmute::<_, __m128bh>(x.clone()), transmute::<_, __m128bh>(y.clone())).as_i32x4());
        println!("{:?}", c_simd);
    }
}

Output:

[1, 4, 9, 16, 25, 36, 49, 64]
[i32x4(1, 4, 9, 16), i32x4(25, 36, 49, 64)], shape=[2], strides=[1], layout=CFcf (0xf), const ndim=1

@bluss
Copy link
Member Author

bluss commented Mar 9, 2021

Hey, it's good if we talk about this before you get started. Notice that in this issue - it's not intended to be about arrays using those explicit simd types at all - that would be a different design - accelerating operations on Array<f64, _> would be a lot more interesting.

IMO simd that we are most interested in, for x86 at least, is already stable.

Notice also in this issue that I have suggested that any simd code like that happens in a new crate that we depend on. That means, it is not part of the ndarray crate.

@SparrowLii
Copy link
Contributor

SparrowLii commented Mar 9, 2021

@bluss Then I hope we create such a crate in rust-ndarray ( instead of a personal crate).
So do we need a crate similar to universal intrinsics? Or we can also refer to usimd in numpy.
Yes, std::arch for x86 and x86_64 are already stable, I can start from here right away.

@SparrowLii
Copy link
Contributor

SparrowLii commented Mar 13, 2021

I tried to use the simd in the operator overloading of multiplication. here.
And put the usage of avx512f instructions in another crate
Then a simd test was performed on an array with a scale of 500x500: main.rs:

use ndarray::Array;
use std::time;
use ndarray_rand::RandomExt;
use ndarray_rand::rand::distributions::Uniform;

fn main() {
    // f64
    let a = Array::random((500, 500), Uniform::new(0., 2.));
    let b = Array::random((500, 500), Uniform::new(0., 2.));
    let start = time::SystemTime::now();
    let c_simd = &a * &b;
    let end = time::SystemTime::now();
    println!("simd f64 {:?}",end.duration_since(start).unwrap());

    let start = time::SystemTime::now();
    let c = a * b;
    let end = time::SystemTime::now();
    println!("normal f64 {:?}",end.duration_since(start).unwrap());
    assert_eq!(c_simd, c);

    // i32
    let a = Array::random((500, 500), Uniform::new(0, 255));
    let b = Array::random((500, 500), Uniform::new(0, 255));
    let start = time::SystemTime::now();
    let c_simd = &a * &b;
    let end = time::SystemTime::now();
    println!("simd i32 {:?}",end.duration_since(start).unwrap());

    let start = time::SystemTime::now();
    let c = a * b;
    let end = time::SystemTime::now();
    println!("normal i32 {:?}",end.duration_since(start).unwrap());
    assert_eq!(c_simd, c);
}

The result is as follows:

simd f64 6.6887ms
normal f64 14.7793ms
simd i32 3.4118ms
normal i32 13.6641ms

The operation of f64 has been accelerated by 2x+, and the operation of i32 has been accelerated by 4x+.

I'm wondering if I am working in the right direction.

@SparrowLii
Copy link
Contributor

@bluss Could you help pointing out which methods in ndarray should use simd in the first place?

@SparrowLii
Copy link
Contributor

SparrowLii commented Apr 29, 2021

Here is my plan

  1. Build a more easy-to-use simd crate based on stdarch and stdsimd which implements automatic detection of hardware characteristics, doesn't distinguish the vector lengths.
  2. Help the compiler team to complete specialization. In this way, simd acceleration can be achieved with little changing in ndarray. And it can also solve the issue of broadcasting.
    This looks crazy but I will try my best

@dafmdev
Copy link

dafmdev commented Jul 26, 2023

I think you may be interested in this project, when simd is in std possibly ndarray will support this to further improve its performance.

@skewballfox
Copy link

skewballfox commented Jan 4, 2024

Preferred approach would be to move the heavy lifting and inner loops (dot product etc) to a separate crate in the style of https://github.com/bluss/numeric-loops or another existing already simdified crate.

Is anybody working on this, or any reason I shouldn't attempt it?

Just to clarify, I'm assuming this means extracting the internal contents (like loops and basic operations) of the existing Ndarray functions into a separate crate ndarray-core , which can then be feature flagged or swapped with another?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants