Replies: 1 comment 1 reply
-
Hi those are for compression not decompression, but lemme give my thoughts on optimizing de-filtering operations Hi, I was working on the filters just yesterday, and I was hoping the scalar versions would hopefully work better than sse or avx. Here is the 100% safe rust vectorized paeth from image-rs https://godbolt.org/z/s7f1js854 which does get vectorized and I do agree. Here is the sse version used here, in spng, and libpng https://godbolt.org/z/zqqYzMK1T. It's not that hard to know which is faster.. The essential thing about filters and how to make them faster is their inherent dependency chains. They are divided into the folowing, Sub+Paeth+Avg=>Have a dependency chain on previous pixels on the same row. For the latter, you cannot vectorize across pixels, you need the previous pixels to finish before you start the other one.
Tell the compiler exaclty how things will look. Take for example image-rs paeth unfilter which is vectorized at https://godbolt.org/z/T7q3nxaoe, but it's useless work. The compiler assumes bpp will be huge enough, i.e something like a vector width and generates sse for that, the creators know that bpp cannot go above 4 since we can have a maximum of 4 channels, r,g,b,a. But not the compiler, so no matter how qute and fast the code may seem, every sse path is a cold path that won't be hit today, or forever. To get fast scalar filters, you do manual unrolling, the pixels for now may depend on the previous ones, but they do not depend on each other, i.e calculating paeth for red wont depend on calculating one for green and the same applies for blue. So for those, you branch depending on number of channels, explicitly creating a branch for each channel and optimizing carefully in that the compiler doesn't insert panics in the hot code, see my scalar version zune-image/zune-png/src/filters.rs Lines 296 to 466 in d8c975c And godbolt https://godbolt.org/z/17sdE8W7j That gives you high ILP for the inner loops, better than whatever vectorization would have done. But a reason to use SSE, well, now decode speeds are 2/3 of what image-rs does, checkout updated benchmarks |
Beta Was this translation helpful? Give feedback.
-
image-rs/image-png#363 has landed 100% safe Rust filters that benefit from autovectorization, without using any unsafe code.
The average filter was optimized into vector instructions earlier, also without unsafe: image-rs/image-png#198
It would be nice to investigate if that approach can replace some or all of the explicit
unsafe
used inzune-png
.Beta Was this translation helpful? Give feedback.
All reactions