Vectorized Paeth filtering (multiple pixels at the same time) #157

IJzerbaard · 2024-12-17T10:37:23Z

Most PNG decoders (including the one in Wuffs from what I can see) use a pixel-by-pixel approach to filtering, and may use SIMD for parallelism across the channels of a pixel. However, it is possible to use SIMD to apply the Paeth filter to multiple pixels at once. Not 4 sequential pixels, which would be the neatest case for SIMD but is prevented by dependencies between the pixels, but 4 anti-diagonal pixels. 4 anti-diagonal pixels can be collected together in a vector relatively efficiently by loading 4 rows of pixels, staggered so that the first row has an x-offset of 3 pixels compared to the last row, then transpose those 4 rows (similar to _MM_TRANSPOSE4_PS, but with integer vectors). That produces 4 of those sets of 4 anti-diagonal pixels for the price of 8 shuffles, and after applying the filter another 8 shuffles are needed to un-transpose them (some additional shuffles are needed to update the "top" aka B vector between columns). An anti-diagonal group of pixels does not have dependencies between the pixels in the same group (as a horizontal group of pixels would) so that kind of group can be filtered at the same time using SIMD. All the shuffling is not free but in my tests it was well worth doing.

Some diagrams:

Filtering 4 or 8 rows at the same time (filtering 8 rows at once exposes more ILP) does not fit very naturally onto the "possibly different filter for each row" nature of the PNG format, but it is still possible in cases where the Paeth filter is used for 4 or 8 adjacent rows. The Avg filter can be implemented in a similar manner.

If the Wuffs project is interested in this approach, I'm willing to help integrate it into Wuffs, but I'm really not familiar with the language.

nigeltao · 2024-12-17T22:43:38Z

That's an interesting idea. Do you have any numbers on how often "the Paeth filter is used for 4 or 8 adjacent rows" is true?

All the shuffling is not free but in my tests it was well worth doing.

Are you able to link to or share your test code?

IJzerbaard · 2024-12-18T06:30:44Z

Here's some test code just testing the raw filter, assuming that all rows use Paeth: https://gitlab.com/-/snippets/4781267

I don't necessarily have good (representative) numbers on how often it happens that 4 or 8 adjacent rows use the Paeth filter, but across some files from the qoi benchmarking dataset, I have seen some distribution that in many files it does not happen at all, but then in the remaining files it happens a significant number of times.

Here are some stats by qoibench, stbi2 is a modified version of stb_image where I applied this filtering technique:

## Benchmarking images/icon_512/*.png -- 4 runs

## Total for images/icon_512
          decode us   encode ms   decode mpps   encode mpps   size kb    rate
libpng:      1888.1         0.0        138.84          0.00         0    0.0%
stbi:        1118.0         0.0        234.48          0.00         0    0.0%
stbi2:        857.1         0.0        305.86          0.00         0    0.0%
qoi:          478.2         0.0        548.16          0.00         0    0.0%

## Benchmarking images/screenshot_game/*.png -- 4 runs

## Total for images/screenshot_game
          decode us   encode ms   decode mpps   encode mpps   size kb    rate
libpng:      7140.4         0.0         88.65          0.00         0    0.0%
stbi:        6250.4         0.0        101.28          0.00         0    0.0%
stbi2:       5380.9         0.0        117.64          0.00         0    0.0%
qoi:         2223.8         0.0        284.66          0.00         0    0.0%

Here's a histogram of in how many files, a certain percentage of rows were covered by the 4-at-a-time technique, from 1% to 100% but without the bar for 0% which would ruin the scale of the y axis

nigeltao · 2024-12-18T06:42:58Z

Here are some stats by qoibench, stbi2 is a modified version of stb_image where I applied this filtering technique:

OK, stbi2 is faster than stb_image, but stb_image does not do SIMD Paeth filtering, right? (It does have some SIMD code, for JPEG decoding, but not PNG.)

Wuffs already does SIMD Paeth filtering (example), even if it's only 1 row at a time and not your 4-at-a-time technique. Do you have a sense of how much faster it would be (compared to a SIMD-using baseline) to adopt your algorithm?

IJzerbaard · 2024-12-18T06:55:53Z

The 4-rows-at-a-time technique also wins in the pure filter-vs-filter (which are both SIMD filters but one is the classic pixel-by-pixel one and the other is the 4-rows-at-a-time one) test by up to 3x on my PC, but I don't have an "integrated" test that compares them in-context

kobalicek · 2024-12-27T21:10:24Z

I've got the same idea as @IJzerbaard about a week ago when improving PNG decoding performance in Blend2D. I think it's a great idea to use multiple rows to filter Paeth, but I think the most important is to optimize the DEFLATE decompressor first, because it consumes the most cycles, at least according to perf.

Here are my own benchmarks that compare decoding of libpng, wuffs, spng, stbimage, and blend2d (these are from my local branch, still working on improvements). I'm using Ryzen 79950X.

[libpng  ] png/harvesters.png      : Decode  9.808 [ms]
[wuffs   ] png/harvesters.png      : Decode  8.152 [ms]
[spng    ] png/harvesters.png      : Decode  8.976 [ms]
[stbimage] png/harvesters.png      : Decode 12.425 [ms]
[blend2d ] png/harvesters.png      : Decode  5.086 [ms]

[libpng  ] png/hibiscus.primitive.png: Decode  0.822 [ms]
[wuffs   ] png/hibiscus.primitive.png: Decode  0.687 [ms]
[spng    ] png/hibiscus.primitive.png: Decode  0.652 [ms]
[stbimage] png/hibiscus.primitive.png: Decode  0.695 [ms]
[blend2d ] png/hibiscus.primitive.png: Decode  0.237 [ms]

[libpng  ] png/hibiscus.regular.png: Decode  1.652 [ms]
[wuffs   ] png/hibiscus.regular.png: Decode  1.212 [ms]
[spng    ] png/hibiscus.regular.png: Decode  1.513 [ms]
[stbimage] png/hibiscus.regular.png: Decode  1.645 [ms]
[blend2d ] png/hibiscus.regular.png: Decode  0.685 [ms]

[libpng  ] png/medium_rgb8.png     : Decode 14.534 [ms]
[wuffs   ] png/medium_rgb8.png     : Decode 11.486 [ms]
[spng    ] png/medium_rgb8.png     : Decode 13.054 [ms]
[stbimage] png/medium_rgb8.png     : Decode 16.945 [ms]
[blend2d ] png/medium_rgb8.png     : Decode  6.080 [ms]

[libpng  ] png/medium_rgba8.png    : Decode 19.812 [ms]
[wuffs   ] png/medium_rgba8.png    : Decode 20.042 [ms]
[spng    ] png/medium_rgba8.png    : Decode 18.333 [ms]
[stbimage] png/medium_rgba8.png    : Decode 22.638 [ms]
[blend2d ] png/medium_rgba8.png    : Decode 10.369 [ms]

[libpng  ] png/large_rgb8.png      : Decode 93.763 [ms]
[wuffs   ] png/large_rgb8.png      : Decode 75.457 [ms]
[spng    ] png/large_rgb8.png      : Decode 84.419 [ms]
[stbimage] png/large_rgb8.png      : Decode 96.506 [ms]
[blend2d ] png/large_rgb8.png      : Decode 42.086 [ms]

[libpng  ] png/large_rgba8.png     : Decode 112.122 [ms]
[wuffs   ] png/large_rgba8.png     : Decode 110.520 [ms]
[spng    ] png/large_rgba8.png     : Decode 100.714 [ms]
[stbimage] png/large_rgba8.png     : Decode 133.514 [ms]
[blend2d ] png/large_rgba8.png     : Decode 53.814 [ms]

[libpng  ] png/large_palette.png   : Decode  9.882 [ms]
[wuffs   ] png/large_palette.png   : Decode  4.477 [ms]
[spng    ] png/large_palette.png   : Decode  8.458 [ms]
[stbimage] png/large_palette.png   : Decode  5.255 [ms]
[blend2d ] png/large_palette.png   : Decode  2.974 [ms]

[libpng  ] qoi/dice.png            : Decode  3.043 [ms]
[wuffs   ] qoi/dice.png            : Decode  2.677 [ms]
[spng    ] qoi/dice.png            : Decode  2.527 [ms]
[stbimage] qoi/dice.png            : Decode  3.094 [ms]
[blend2d ] qoi/dice.png            : Decode  1.269 [ms]

[libpng  ] qoi/edgecase.png        : Decode  0.083 [ms]
[wuffs   ] qoi/edgecase.png        : Decode  0.016 [ms]
[spng    ] qoi/edgecase.png        : Decode  0.068 [ms]
[stbimage] qoi/edgecase.png        : Decode  0.022 [ms]
[blend2d ] qoi/edgecase.png        : Decode  0.017 [ms]

[libpng  ] qoi/kodim10.png         : Decode  4.201 [ms]
[wuffs   ] qoi/kodim10.png         : Decode  2.785 [ms]
[spng    ] qoi/kodim10.png         : Decode  3.693 [ms]
[stbimage] qoi/kodim10.png         : Decode  3.921 [ms]
[blend2d ] qoi/kodim10.png         : Decode  1.742 [ms]

[libpng  ] qoi/kodim23.png         : Decode  3.782 [ms]
[wuffs   ] qoi/kodim23.png         : Decode  2.588 [ms]
[spng    ] qoi/kodim23.png         : Decode  3.387 [ms]
[stbimage] qoi/kodim23.png         : Decode  3.675 [ms]
[blend2d ] qoi/kodim23.png         : Decode  1.813 [ms]

[libpng  ] qoi/qoi_logo.png        : Decode  0.332 [ms]
[wuffs   ] qoi/qoi_logo.png        : Decode  0.306 [ms]
[spng    ] qoi/qoi_logo.png        : Decode  0.199 [ms]
[stbimage] qoi/qoi_logo.png        : Decode  0.144 [ms]
[blend2d ] qoi/qoi_logo.png        : Decode  0.073 [ms]

[libpng  ] qoi/testcard.png        : Decode  0.194 [ms]
[wuffs   ] qoi/testcard.png        : Decode  0.217 [ms]
[spng    ] qoi/testcard.png        : Decode  0.106 [ms]
[stbimage] qoi/testcard.png        : Decode  0.119 [ms]
[blend2d ] qoi/testcard.png        : Decode  0.061 [ms]

[libpng  ] qoi/testcard_rgba.png   : Decode  0.207 [ms]
[wuffs   ] qoi/testcard_rgba.png   : Decode  0.222 [ms]
[spng    ] qoi/testcard_rgba.png   : Decode  0.143 [ms]
[stbimage] qoi/testcard_rgba.png   : Decode  0.130 [ms]
[blend2d ] qoi/testcard_rgba.png   : Decode  0.067 [ms]

[libpng  ] qoi/wikipedia_008.png   : Decode  9.481 [ms]
[wuffs   ] qoi/wikipedia_008.png   : Decode  7.479 [ms]
[spng    ] qoi/wikipedia_008.png   : Decode  8.107 [ms]
[stbimage] qoi/wikipedia_008.png   : Decode  9.678 [ms]
[blend2d ] qoi/wikipedia_008.png   : Decode  2.778 [ms]

These use testing images from QOI project and some other testing images for PNG I found people use to benchmark PNG decoders. The first images are especially interesting as they contain dozens of dynamic Huffman tables so the decompressor can bottleneck on Huffman table building if fast-bits is too high. My aim is to match libdeflate in DEFLATE performance - then I think further filter optimizations make sense.

For comparison, here is a comparison with QOI:

[qoi_ref  ] qoi/dice.qoi            : Decode  0.817 [ms] | Encode  1.246 [ms]
[magicqoi ] qoi/dice.qoi            : Decode  0.607 [ms] | Encode  0.743 [ms]
[wuffs 0.4] qoi/dice.qoi            : Decode  1.802 [ms] | Encode  0.000 [ms]
[blend2d  ] qoi/dice.qoi            : Decode  0.538 [ms] | Encode  0.635 [ms]

[qoi_ref  ] qoi/edgecase.qoi        : Decode  0.017 [ms] | Encode  0.032 [ms]
[magicqoi ] qoi/edgecase.qoi        : Decode  0.006 [ms] | Encode  0.008 [ms]
[wuffs 0.4] qoi/edgecase.qoi        : Decode  0.049 [ms] | Encode  0.000 [ms]
[blend2d  ] qoi/edgecase.qoi        : Decode  0.003 [ms] | Encode  0.007 [ms]

[qoi_ref  ] qoi/kodim10.qoi         : Decode  2.020 [ms] | Encode  2.422 [ms]
[magicqoi ] qoi/kodim10.qoi         : Decode  1.932 [ms] | Encode  2.641 [ms]
[wuffs 0.4] qoi/kodim10.qoi         : Decode  1.786 [ms] | Encode  0.000 [ms]
[blend2d  ] qoi/kodim10.qoi         : Decode  1.079 [ms] | Encode  1.502 [ms]

[qoi_ref  ] qoi/kodim23.qoi         : Decode  1.977 [ms] | Encode  2.505 [ms]
[magicqoi ] qoi/kodim23.qoi         : Decode  1.932 [ms] | Encode  2.731 [ms]
[wuffs 0.4] qoi/kodim23.qoi         : Decode  1.756 [ms] | Encode  0.000 [ms]
[blend2d  ] qoi/kodim23.qoi         : Decode  1.066 [ms] | Encode  1.488 [ms]

[qoi_ref  ] qoi/qoi_logo.qoi        : Decode  0.084 [ms] | Encode  0.202 [ms]
[magicqoi ] qoi/qoi_logo.qoi        : Decode  0.022 [ms] | Encode  0.042 [ms]
[wuffs 0.4] qoi/qoi_logo.qoi        : Decode  0.292 [ms] | Encode  0.000 [ms]
[blend2d  ] qoi/qoi_logo.qoi        : Decode  0.014 [ms] | Encode  0.045 [ms]

[qoi_ref  ] qoi/testcard.qoi        : Decode  0.082 [ms] | Encode  0.137 [ms]
[magicqoi ] qoi/testcard.qoi        : Decode  0.051 [ms] | Encode  0.069 [ms]
[wuffs 0.4] qoi/testcard.qoi        : Decode  0.209 [ms] | Encode  0.000 [ms]
[blend2d  ] qoi/testcard.qoi        : Decode  0.031 [ms] | Encode  0.053 [ms]

[qoi_ref  ] qoi/testcard_rgba.qoi   : Decode  0.082 [ms] | Encode  0.145 [ms]
[magicqoi ] qoi/testcard_rgba.qoi   : Decode  0.049 [ms] | Encode  0.069 [ms]
[wuffs 0.4] qoi/testcard_rgba.qoi   : Decode  0.209 [ms] | Encode  0.000 [ms]
[blend2d  ] qoi/testcard_rgba.qoi   : Decode  0.031 [ms] | Encode  0.054 [ms]

[qoi_ref  ] qoi/wikipedia_008.qoi   : Decode  5.684 [ms] | Encode  5.964 [ms]
[magicqoi ] qoi/wikipedia_008.qoi   : Decode  5.185 [ms] | Encode  6.623 [ms]
[wuffs 0.4] qoi/wikipedia_008.qoi   : Decode  4.720 [ms] | Encode  0.000 [ms]
[blend2d  ] qoi/wikipedia_008.qoi   : Decode  2.726 [ms] | Encode  4.446 [ms]

The article I wrote here is already outdated with numbers (the perf of blend2d improved since then):

https://blend2d.com/blog/qoi-image-codec.html

In general when both PNG and QOI decoders are optimized to the bone the average decoding time is not that far. Both PNG and QOI are annoyingly scalar when it comes to decoding (impossible to do more work ahead of time as everything depends on past work).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized Paeth filtering (multiple pixels at the same time) #157

Vectorized Paeth filtering (multiple pixels at the same time) #157

IJzerbaard commented Dec 17, 2024 •

edited

Loading

nigeltao commented Dec 17, 2024

IJzerbaard commented Dec 18, 2024

nigeltao commented Dec 18, 2024

IJzerbaard commented Dec 18, 2024

kobalicek commented Dec 27, 2024

Vectorized Paeth filtering (multiple pixels at the same time) #157

Vectorized Paeth filtering (multiple pixels at the same time) #157

Comments

IJzerbaard commented Dec 17, 2024 • edited Loading

nigeltao commented Dec 17, 2024

IJzerbaard commented Dec 18, 2024

nigeltao commented Dec 18, 2024

IJzerbaard commented Dec 18, 2024

kobalicek commented Dec 27, 2024

IJzerbaard commented Dec 17, 2024 •

edited

Loading