-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port of stb_image optimized paeth unfiltering #539
Conversation
I believe I can confirm the performance change on my machine.
Here kodim17 is indeed not using any Paeth filter line 🎉 |
I've changed the nightly version to just use autovectorized stable version. I'm not seeing a big boost from enabling |
I have the removal of nightly SIMD for RGBA ready to go in a branch, but I'm not including it in this PR to keep this one as simple and uncontroversial as possible, and hopefully get it merged quickly. |
In a microbenchmark on Apple Silicon speed of this filter varies depending on pixel size, and it's half the speed for some sizes. On x86 it's mostly faster, except 11% slower for 6-byte pixels. It's odd that there's such a big variance. |
On my Zen machine this gets us +5 MP/s on corpus-bench with a corpus of random PNG images craped off the web. According to Kornel's benchmarks this only regresses the bpp=6 case on x86, and even that regression is not dramatic (11% on the filter alone). But on Apple ARM this significantly regresses the crucial bpp=3 case, and I'm not willing to accept that. How about we land this only for x86, and stick to the previous impl on other platforms? I'm not a fan of the duplication, but the code is trivial enough that it's only adding 10 lines or so. |
Other decoders have different (manual SIMD) implementations for different architectures. It would be nice if we could find a single autovectorized/portable-simd implementation that got ideal performance on all architectures. But I'm not shocked that isn't the case. Using different implementations on x86 and ARM seem fine to me. Also, it probably shouldn't be in this PR, but filter.rs has gotten large enough that we should probably think about spinning it off into multiple files. |
@kornelski could you benchmark this PR on ARM with the That would be I am particularly curious about bpp=4 and 8. These benchmarks will tell me whether I need to bring back the portable SIMD version, or if autovectorization does a good job for those cases already. I can't do it myself because I don't have any Apple hardware, renting it in the cloud for a short time is impossible due to Apple's licensing shenanigans, and ARM cloud instances have different performance characteristics. Edit: wait nevermind I didn't push the changes I need benches on yet, I'm losing my mind |
The code isn't pretty but it's ready for benchmarking on ARM with |
Apple M3 Max: unstable pr vs stable main
unstable pr vs unstable main
|
Those benchmarks tell me two important things:
Thank you! |
I believe this is now ready to go, and performance is as good as I'm going to get it in this PR. On top of the x86-specific optimization for Paeth, this also fixes the issue of Now that we've added architecture-specific code, I've also made tests run on Apple M1 (ARM) on CI in addition to x86_64 Linux so that non-x86 bugs would get caught on CI.
|
…e configurations in conditional compilation
Yeah, I am no longer sure if |
After this PR, enabling
These cases are rather uncommon and also already fast, so we might want to consider removing them to reduce the maintenance burden. |
@anforowicz thanks for the info! I see that the Chromium merge window is open from today until the 18th of December. When do we need to publish new releases of |
The Dec 3rd to Dec 18th window is for starting and/or reconfiguring field trials, but we should have more time for making code changes (such as absorbing new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, if you're satisified with the test coverage across architectures then feel free to merge. It's an interesting data point against std::simd
optimality indeed.
10% end-to-end performance gain on a Paeth-filtered image I concocted with
convert -quality 49 in.png out.png
, which is comparatively big and Paeth images are rather common.Godbolt shows much shorter assembly that gets autovectorized even in the bpp=3 case: https://godbolt.org/z/fq3EjvT4b
TODO:
Also done:
unstable
feature dramatically regressing performance on ARMunstable
feature compiles