why sub and mul operation of f32x4 is slower than scalar version? #426

Jad2wizard · 2024-06-22T17:03:41Z

//cost 13ms when data.length is 4000w
fn sum_scalar(data: &[f32]) -> f32 {
    let mut sum = 0.0;
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        sum += (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
    }
    sum
}

//cost 16ms when data.length is 4000w
fn sum_simd(data: &[f32]) -> f32 {
    let pos = f32x4::load_or_default(&vec![2000.0, 2000.0, 2000.0]);
    let dir = f32x4::load_or_default(&vec![0.8, 0.6, 0.0]);

    let mut sum = 0.0;
    for i in 0..data.len() / 4 {
        let values = f32x4::from_array([
            data[i * 4 + 0],
            data[i * 4 + 1],
            data[i * 4 + 2],
            data[i * 4 + 3],
        ]);
        sum += ((values - pos) * dir).reduce_sum();
    }

    sum
}

cpu: Intel(R) Core(TM) i5-14400F 2.50 GHz
rustc 1.81.0-nightly (c1b336cb6 2024-06-21)

jhorstmann · 2024-06-22T18:09:51Z

SIMD instruction sets are optimized for operating on independent lanes. This makes the reduce_sum operation relatively expensive. You can get better performance by using a f32x4 for the sum and then only doing a single horizontal reduction at the end.

Note that the compiler is not able to do this optimization for you, since float operations are not associative.

On a nightly rust version you could also look into the fadd_algebraic / fsub_algebraic / fmul_agebraic intrinsics, which allow the compiler to reorder and auto-vectorize scalar floating point operations.

Jad2wizard · 2024-06-23T01:51:23Z

@jhorstmann Thank you so much. After replace the -* of sum_scalar to fsub and fmul, the runtime of sum_scalar decreased by about 20%.
But things start to be weird when I change the code to

fn sum_scalar(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        res[i] = (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
    }
}

unsafe fn sum_fast(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        *res.get_unchecked_mut(i) = fmul_fast(fsub_fast(x, pos[0]), dir[0])
                + fmul_fast(fsub_fast(y, pos[1]), dir[1])
                + fmul_fast(fsub_fast(z, pos[2]), dir[2])
    }
}

the runtime of sum_scalar and sum_fast become almost equal. I am a beginner to rust, so the results are werid to me.

jhorstmann · 2024-06-23T08:56:43Z

@Jad2wizard very good question! I see two possible reasons for these having the same performance:

The compiler is not able to remove the bounds checks when accessing data. This means it still processes a single res value per loop iteration, and the overhead of the bounds checks masks any performance difference between the intrinsics and normal arithmetic.
The other issue is related to the data layout. For each res value, the code needs to process 3 different indices of data values. If the compiler wants to vectorize this to produce 4 results at a time for example, each of the 4 lanes of x, y and z would need to be loaded from a different, non consecutive address. Modern cpus have instructions for that, but they are much slower than loading data that is consecutive in memory. Fixing this would require to change your data structures to have different slices for x, y and z.

Without the data layout changes, this is how I would write the function to avoid the bounds checks:

pub fn sum_fast_no_bound_checks(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for (r, chunk) in res.iter_mut().zip(data.chunks_exact(4)) {
        let x = chunk[0];
        let y = chunk[1];
        let z = chunk[2];
        *r = fmul_algebraic(fsub_algebraic(x, pos[0]), dir[0])
                + fmul_algebraic(fsub_algebraic(y, pos[1]), dir[1])
                + fmul_algebraic(fsub_algebraic(z, pos[2]), dir[2])
    }
}

With chunks_exact the compiler knows that the indices are in bounds, and zip ensures that the res corresponding to each chunk is also in bounds.

Jad2wizard · 2024-06-26T12:03:57Z

Thank you @jhorstmann, I will try change the data layout as you mentioned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why sub and mul operation of f32x4 is slower than scalar version? #426

why sub and mul operation of f32x4 is slower than scalar version? #426

Jad2wizard commented Jun 22, 2024

jhorstmann commented Jun 22, 2024

Jad2wizard commented Jun 23, 2024

jhorstmann commented Jun 23, 2024

Jad2wizard commented Jun 26, 2024

why sub and mul operation of f32x4 is slower than scalar version? #426

why sub and mul operation of f32x4 is slower than scalar version? #426

Comments

Jad2wizard commented Jun 22, 2024

jhorstmann commented Jun 22, 2024

Jad2wizard commented Jun 23, 2024

jhorstmann commented Jun 23, 2024

Jad2wizard commented Jun 26, 2024