Investigate autovectorization-based filtering #19

Shnatsel · 2022-12-24T07:02:27Z

Shnatsel
Dec 24, 2022

image-rs/image-png#363 has landed 100% safe Rust filters that benefit from autovectorization, without using any unsafe code.

The average filter was optimized into vector instructions earlier, also without unsafe: image-rs/image-png#198

It would be nice to investigate if that approach can replace some or all of the explicit unsafe used in zune-png.

etemesi254 · 2022-12-24T08:31:05Z

etemesi254
Dec 24, 2022
Maintainer

Hi those are for compression not decompression, but lemme give my thoughts on optimizing de-filtering operations

Hi, I was working on the filters just yesterday, and I was hoping the scalar versions would hopefully work better than sse or avx.
Sadly no.

Here is the 100% safe rust vectorized paeth from image-rs https://godbolt.org/z/s7f1js854 which does get vectorized and I do agree.

Here is the sse version used here, in spng, and libpng https://godbolt.org/z/zqqYzMK1T.

It's not that hard to know which is faster..

The essential thing about filters and how to make them faster is their inherent dependency chains.

They are divided into the folowing,
Up+None=>No dependency chains

Sub+Paeth+Avg=>Have a dependency chain on previous pixels on the same row.

For the latter, you cannot vectorize across pixels, you need the previous pixels to finish before you start the other one.
So there generally two ways to optimize them.

For scalar, chase high instructions per cycle speeds.
For vectorization, lump pixel calculation into 1 vector, i.e clamp calculating red, green and blue pixels in one vector operation, which is what the sse versions do.

Tell the compiler exaclty how things will look.

Take for example image-rs paeth unfilter which is vectorized at https://godbolt.org/z/T7q3nxaoe, but it's useless work.

The compiler assumes bpp will be huge enough, i.e something like a vector width and generates sse for that, the creators know that bpp cannot go above 4 since we can have a maximum of 4 channels, r,g,b,a. But not the compiler, so no matter how qute and fast the code may seem, every sse path is a cold path that won't be hit today, or forever.

To get fast scalar filters, you do manual unrolling, the pixels for now may depend on the previous ones, but they do not depend on each other, i.e calculating paeth for red wont depend on calculating one for green and the same applies for blue.

So for those, you branch depending on number of channels, explicitly creating a branch for each channel and optimizing carefully in that the compiler doesn't insert panics in the hot code, see my scalar version

zune-image/zune-png/src/filters.rs

Lines 296 to 466 in d8c975c

    
           pub fn handle_paeth(prev_row: &[u8], raw: &[u8], current: &mut [u8], components: usize) 
        
           { 
        
               match components 
        
               { 
        
                   1 => 
        
                   { 
        
                       // handle leftmost byte explicitly 
        
                       current[0] = raw[0].wrapping_add(paeth(0, prev_row[0], 0)); 
        
                       let mut max_recon = current[0]; 
        
                       let mut max_recon_c = prev_row[0]; 
        
                       for ((filt, recon_b), out_px) in raw[1..] 
        
                           .iter() 
        
                           .zip(prev_row[1..].iter()) 
        
                           .zip(current[1..].iter_mut()) 
        
                       { 
        
                           let paeth_res = paeth(max_recon, *recon_b, max_recon_c); 
        
                           *out_px = (*filt).wrapping_add(paeth_res); 
        
                           // setup for the following iteration 
        
                           max_recon_c = *recon_b; 
        
                           max_recon = *out_px; 
        
                       } 
        
                   } 
        
                   2 => 
        
                   { 
        
                       const COMP: usize = 2; 
        
                       let mut max_recon_a: [u8; COMP] = [0; COMP]; 
        
                       let mut max_recon_c: [u8; COMP] = [0; COMP]; 
        
                       if current.len() < COMP || raw.len() < COMP || prev_row.len() < COMP 
        
                       { 
        
                           return; 
        
                       } 
        
                       // handle leftmost byte explicitly 
        
                       for i in 0..COMP 
        
                       { 
        
                           current[i] = raw[i].wrapping_add(paeth(0, prev_row[i], 0)); 
        
                           max_recon_a[i] = current[i]; 
        
                           max_recon_c[i] = prev_row[i]; 
        
                       } 
        
                       for ((filt, recon_b), out_px) in raw[COMP..] 
        
                           .chunks_exact(COMP) 
        
                           .zip(prev_row[COMP..].chunks_exact(COMP)) 
        
                           .zip(current[COMP..].chunks_exact_mut(COMP)) 
        
                       { 
        
                           macro_rules! unroll { 
        
                               ($pos:tt) => { 
        
                                   let paeth_res = paeth(max_recon_a[$pos], recon_b[$pos], max_recon_c[$pos]); 
        
                                   out_px[$pos] = (filt[$pos]).wrapping_add(paeth_res); 
        
                                   // setup for the following iteration 
        
                                   max_recon_c[$pos] = recon_b[$pos]; 
        
                                   max_recon_a[$pos] = out_px[$pos]; 
        
                               }; 
        
                           } 
        
                           unroll!(0); 
        
                           unroll!(1); 
        
                       } 
        
                   } 
        
                   3 => 
        
                   { 
        
                       #[cfg(all(feature = "sse", any(target_arch = "x86", target_arch = "x86_64")))] 
        
                       { 
        
                           // use the sse capable one when possible 
        
                           if is_x86_feature_detected!("sse4.1") 
        
                           { 
        
                               crate::filters::sse4::de_filter_paeth3_sse41(prev_row, raw, current); 
        
                               return; 
        
                           } 
        
                       } 
        
                       const COMP: usize = 3; 
        
                       let mut max_recon_a: [u8; COMP] = [0; COMP]; 
        
                       let mut max_recon_c: [u8; COMP] = [0; COMP]; 
        
                       if current.len() < COMP || raw.len() < COMP || prev_row.len() < COMP 
        
                       { 
        
                           return; 
        
                       } 
        
                       // handle leftmost byte explicitly 
        
                       for i in 0..COMP 
        
                       { 
        
                           let paeth_x = paeth(0, prev_row[i], 0); 
        
                           current[i] = raw[i].wrapping_add(paeth_x); 
        
                           max_recon_a[i] = current[i]; 
        
                           max_recon_c[i] = prev_row[i]; 
        
                       } 
        
                       for ((filt, recon_b), out_px) in raw[COMP..] 
        
                           .chunks_exact(COMP) 
        
                           .zip(prev_row[COMP..].chunks_exact(COMP)) 
        
                           .zip(current[COMP..].chunks_exact_mut(COMP)) 
        
                       { 
        
                           macro_rules! unroll { 
        
                               ($pos:tt) => { 
        
                                   let paeth_res = paeth(max_recon_a[$pos], recon_b[$pos], max_recon_c[$pos]); 
        
                                   out_px[$pos] = (filt[$pos]).wrapping_add(paeth_res); 
        
                                   // setup for the following iteration 
        
                                   max_recon_c[$pos] = recon_b[$pos]; 
        
                                   max_recon_a[$pos] = out_px[$pos]; 
        
                               }; 
        
                           } 
        
                           unroll!(0); 
        
                           unroll!(1); 
        
                           unroll!(2); 
        
                       } 
        
                   } 
        
                   4 => 
        
                   { 
        
                       #[cfg(all(feature = "sse", any(target_arch = "x86", target_arch = "x86_64")))] 
        
                       { 
        
                           // use the sse capable one when possible 
        
                           if is_x86_feature_detected!("sse4.1") 
        
                           { 
        
                               crate::filters::sse4::de_filter_paeth4_sse41(prev_row, raw, current); 
        
                               return; 
        
                           } 
        
                       } 
        
                       const COMP: usize = 4; 
        
                       let mut max_recon_a: [u8; COMP] = [0; COMP]; 
        
                       let mut max_recon_c: [u8; COMP] = [0; COMP]; 
        
                       if current.len() < COMP || raw.len() < COMP || prev_row.len() < COMP 
        
                       { 
        
                           return; 
        
                       } 
        
                       // handle leftmost byte explicitly 
        
                       for i in 0..COMP 
        
                       { 
        
                           let paeth_x = paeth(0, prev_row[i], 0); 
        
                           current[i] = raw[i].wrapping_add(paeth_x); 
        
                           max_recon_a[i] = current[i]; 
        
                           max_recon_c[i] = prev_row[i]; 
        
                       } 
        
                       for ((filt, recon_b), out_px) in raw[COMP..] 
        
                           .chunks_exact(COMP) 
        
                           .zip(prev_row[COMP..].chunks_exact(COMP)) 
        
                           .zip(current[COMP..].chunks_exact_mut(COMP)) 
        
                       { 
        
                           macro_rules! unroll { 
        
                               ($pos:tt) => { 
        
                                   let paeth_res = paeth(max_recon_a[$pos], recon_b[$pos], max_recon_c[$pos]); 
        
                                   out_px[$pos] = (filt[$pos]).wrapping_add(paeth_res); 
        
                                   // setup for the following iteration 
        
                                   max_recon_c[$pos] = recon_b[$pos]; 
        
                                   max_recon_a[$pos] = out_px[$pos]; 
        
                               }; 
        
                           } 
        
                           unroll!(0); 
        
                           unroll!(1); 
        
                           unroll!(2); 
        
                           unroll!(3); 
        
                       } 
        
                   } 
        
                   _ => unreachable!() 
        
               } 
        
           }

And godbolt https://godbolt.org/z/17sdE8W7j

That gives you high ILP for the inner loops, better than whatever vectorization would have done.

But a reason to use SSE, well, now decode speeds are 2/3 of what image-rs does, checkout updated benchmarks

1 reply

Shnatsel Dec 24, 2022
Author

I see, thanks a lot for the explanation!

It's a shame that portable SIMD is nightly-only still, it would be nice to get the same implementation for NEON etc.

There might be a way to get the compiler to elide the panics in the hot loop, I'll have to check if that's the case. I have written an entire article about that, I'll send you a link to the draft privately (and it should be published soon-ish).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate autovectorization-based filtering #19

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Investigate autovectorization-based filtering #19

Shnatsel Dec 24, 2022

Replies: 1 comment · 1 reply

etemesi254 Dec 24, 2022 Maintainer

Shnatsel Dec 24, 2022 Author

Shnatsel
Dec 24, 2022

Replies: 1 comment 1 reply

etemesi254
Dec 24, 2022
Maintainer

Shnatsel Dec 24, 2022
Author