-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why sub and mul operation of f32x4 is slower than scalar version? #426
Comments
SIMD instruction sets are optimized for operating on independent lanes. This makes the Note that the compiler is not able to do this optimization for you, since float operations are not associative. On a nightly rust version you could also look into the |
@jhorstmann Thank you so much. After replace the -* of sum_scalar to fsub and fmul, the runtime of sum_scalar decreased by about 20%. fn sum_scalar(data: &[f32], res: &mut [f32]) {
let pos = vec![2000.0, 2000.0, 2000.0];
let dir = vec![0.8, 0.6, 0.0];
for i in 0..data.len() / 4 {
let x = data[i * 4 + 0];
let y = data[i * 4 + 1];
let z = data[i * 4 + 2];
res[i] = (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
}
}
unsafe fn sum_fast(data: &[f32], res: &mut [f32]) {
let pos = vec![2000.0, 2000.0, 2000.0];
let dir = vec![0.8, 0.6, 0.0];
for i in 0..data.len() / 4 {
let x = data[i * 4 + 0];
let y = data[i * 4 + 1];
let z = data[i * 4 + 2];
*res.get_unchecked_mut(i) = fmul_fast(fsub_fast(x, pos[0]), dir[0])
+ fmul_fast(fsub_fast(y, pos[1]), dir[1])
+ fmul_fast(fsub_fast(z, pos[2]), dir[2])
}
} the runtime of sum_scalar and sum_fast become almost equal. I am a beginner to rust, so the results are werid to me. |
@Jad2wizard very good question! I see two possible reasons for these having the same performance:
Without the data layout changes, this is how I would write the function to avoid the bounds checks: pub fn sum_fast_no_bound_checks(data: &[f32], res: &mut [f32]) {
let pos = vec![2000.0, 2000.0, 2000.0];
let dir = vec![0.8, 0.6, 0.0];
for (r, chunk) in res.iter_mut().zip(data.chunks_exact(4)) {
let x = chunk[0];
let y = chunk[1];
let z = chunk[2];
*r = fmul_algebraic(fsub_algebraic(x, pos[0]), dir[0])
+ fmul_algebraic(fsub_algebraic(y, pos[1]), dir[1])
+ fmul_algebraic(fsub_algebraic(z, pos[2]), dir[2])
}
} With |
Thank you @jhorstmann, I will try change the data layout as you mentioned |
cpu: Intel(R) Core(TM) i5-14400F 2.50 GHz
rustc 1.81.0-nightly (c1b336cb6 2024-06-21)
The text was updated successfully, but these errors were encountered: