Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why sub and mul operation of f32x4 is slower than scalar version? #426

Open
Jad2wizard opened this issue Jun 22, 2024 · 4 comments
Open

Comments

@Jad2wizard
Copy link

//cost 13ms when data.length is 4000w
fn sum_scalar(data: &[f32]) -> f32 {
    let mut sum = 0.0;
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        sum += (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
    }
    sum
}

//cost 16ms when data.length is 4000w
fn sum_simd(data: &[f32]) -> f32 {
    let pos = f32x4::load_or_default(&vec![2000.0, 2000.0, 2000.0]);
    let dir = f32x4::load_or_default(&vec![0.8, 0.6, 0.0]);

    let mut sum = 0.0;
    for i in 0..data.len() / 4 {
        let values = f32x4::from_array([
            data[i * 4 + 0],
            data[i * 4 + 1],
            data[i * 4 + 2],
            data[i * 4 + 3],
        ]);
        sum += ((values - pos) * dir).reduce_sum();
    }

    sum
}

cpu: Intel(R) Core(TM) i5-14400F 2.50 GHz
rustc 1.81.0-nightly (c1b336cb6 2024-06-21)

@jhorstmann
Copy link

SIMD instruction sets are optimized for operating on independent lanes. This makes the reduce_sum operation relatively expensive. You can get better performance by using a f32x4 for the sum and then only doing a single horizontal reduction at the end.

Note that the compiler is not able to do this optimization for you, since float operations are not associative.

On a nightly rust version you could also look into the fadd_algebraic / fsub_algebraic / fmul_agebraic intrinsics, which allow the compiler to reorder and auto-vectorize scalar floating point operations.

@Jad2wizard
Copy link
Author

@jhorstmann Thank you so much. After replace the -* of sum_scalar to fsub and fmul, the runtime of sum_scalar decreased by about 20%.
But things start to be weird when I change the code to

fn sum_scalar(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        res[i] = (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
    }
}

unsafe fn sum_fast(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        *res.get_unchecked_mut(i) = fmul_fast(fsub_fast(x, pos[0]), dir[0])
                + fmul_fast(fsub_fast(y, pos[1]), dir[1])
                + fmul_fast(fsub_fast(z, pos[2]), dir[2])
    }
}

the runtime of sum_scalar and sum_fast become almost equal. I am a beginner to rust, so the results are werid to me.

@jhorstmann
Copy link

@Jad2wizard very good question! I see two possible reasons for these having the same performance:

  • The compiler is not able to remove the bounds checks when accessing data. This means it still processes a single res value per loop iteration, and the overhead of the bounds checks masks any performance difference between the intrinsics and normal arithmetic.

  • The other issue is related to the data layout. For each res value, the code needs to process 3 different indices of data values. If the compiler wants to vectorize this to produce 4 results at a time for example, each of the 4 lanes of x, y and z would need to be loaded from a different, non consecutive address. Modern cpus have instructions for that, but they are much slower than loading data that is consecutive in memory. Fixing this would require to change your data structures to have different slices for x, y and z.

Without the data layout changes, this is how I would write the function to avoid the bounds checks:

pub fn sum_fast_no_bound_checks(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for (r, chunk) in res.iter_mut().zip(data.chunks_exact(4)) {
        let x = chunk[0];
        let y = chunk[1];
        let z = chunk[2];
        *r = fmul_algebraic(fsub_algebraic(x, pos[0]), dir[0])
                + fmul_algebraic(fsub_algebraic(y, pos[1]), dir[1])
                + fmul_algebraic(fsub_algebraic(z, pos[2]), dir[2])
    }
}

With chunks_exact the compiler knows that the indices are in bounds, and zip ensures that the res corresponding to each chunk is also in bounds.

@Jad2wizard
Copy link
Author

Thank you @jhorstmann, I will try change the data layout as you mentioned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants