Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up JPEG Decoder color conversion #1121

Closed
1 of 2 tasks
antonfirsov opened this issue Feb 22, 2020 · 14 comments · Fixed by #1773
Closed
1 of 2 tasks

Speed up JPEG Decoder color conversion #1121

antonfirsov opened this issue Feb 22, 2020 · 14 comments · Fixed by #1773

Comments

@antonfirsov
Copy link
Member

antonfirsov commented Feb 22, 2020

Let's finally beat System.Drawing on the JPEG Load->Resize->Save scenario!

As discussed in #1064, it's finally possible thanks to the Intel SIMD intrinsics in .NET Core 3.1. Opening an issue so we can track this work, and hopefully get some help & feedback from the community.

/cc @Sergio0694 @saucecontrol

Current pipeline

Summary of steps currently done by ConvertColorsInto:

[D]: Data representation
(T): Bulk transformation between data representations

(case a) Y+Cb+Cr planes --> Single Rgba32 buffer

[D] 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) Color convert and pack into a single Vector4 buffer
[D] Floating point RGBA data as Memory<Vector4>
(T) Convert the Vector4 buffer to an Rgba32 buffer. In the Rgba32 case case, the input buffer could be handled as homogenous float buffer, where all individual float values should be converted to byte-s. The conversion is implemented in BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrowing operations through Vector<T>
[D] The result image as an Rgba32 buffer

(case b) Y+Cb+Cr planes --> Single Rgb24 buffer

[D] 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) Color convert and pack into a single Vector4 buffer
[D] Floating point RGBA data as Memory<Vector4>
(T) Convert the Vector4 buffer to an Rgba32 buffer, utilizing BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrow operations through Vector<T>
[D] Temporary Rgba32 buffer
(T) PixelOperations<Rgb24>.FromRgba32() (sub-optimal, extra transformation!)
[D] The result image as an Rgb24 buffer

Optimized pipeline

(default Rgb24 case) Y+Cb+Cr planes --> Single Rgb24 buffer

D1 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) Color convert, the 3 planes, and write them back to the originating buffers
D2 3 Planes of Buffer2D<float>, R+G+B)
(T) Narrow the float buffers to byte buffers using SimdUtils.BulkConvertNormalizedFloatToByteClampOverflows
D3 3 Planes of Buffer2D<byte>, R+G+B
(T) PACK the separate image planes (color channels) into a single Rgb24 buffer
D4 The result image as an Rgb24 buffer

(TPixel case) Y+Cb+Cr planes --> Single TPixel buffer

D1 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) All the steps from the default Rgb24 case
D4 Memory<Rgb24>
(T) Convert the Rgb24 buffer to TPixel buffer using PixelOperations<T>
D5 The result image as an TPixel buffer

The magic is mostly in the D3->D4 transition, because of the fact that we can now do the pixel packing with shuffle and permute intrinsics when those are available. The other fun thing is that if we decode to Image<Rgb24> (case b) we can omit an unnecessary step.

API proposal for packing

The best thing is that we can handle this big task incrementally:

  • First, extend PixelOperations<T> by new packing operations
  • Then, adapt the changes in JpegImagePostProcessor as described in the Optimized pipeline paragraph

The packing API is pretty straightforward:

public class PixelOperations<TPixel>
{
    // ...
    
    public void PackFromRgbPlanes(
           Configuration configuration,
	   ReadOnlySpan<byte> redChannel, 
	   ReadOnlySpan<byte> greenChannel, 
	   ReadOnlySpan<byte> blueChannel,
	   Span<TPixel> destination);
}

We can define a default implementations in the base PixelOperations<TPixel> class, and specialize it for Rgba32 and Rgb24. Optional hardcore task is to T4 a SIMD implementation it for all the RGB(A)-like formats.

Note

It is possible to optimize the conversion even further by doing D1->D3 in a single step, but I consider it a very hard task both implementation and architecture-wise, and prefer incremental evolution instead.

@antonfirsov
Copy link
Member Author

antonfirsov commented Feb 22, 2020

@brianpopow is packing useful for other codecs?

@saucecontrol
Copy link
Contributor

saucecontrol commented Feb 22, 2020

You can get a nice easy win just by using the intrinsics for your float->byte conversions, because the float->int instructions include rounding, and the packing instructions are saturating so you can skip the clamping step. That's as simple as this:

https://github.com/saucecontrol/PhotoSauce/blob/a9bd6e5162d2160419f0cf743fd4f536c079170b/src/MagicScaler/Magic/Processors/ConvertersFloat.cs#L451-L508

You can speed up Y'CbCr->RGB conversion with intrinsics as well, but I was only able to make the SIMD version faster than scalar with AVX. Any matrix multiplication can be accelerated with DPPS (dot product), but since you really only need 4 of the coefficients from the Y'CbCr->RGB matrix, it doesn't work out any faster unless you can operate on the wider vector.

@Sergio0694
Copy link
Member

Unrelated: @saucecontrol your repo is seriously impressive, incredible work there! 👏

I have a couple questions:

  1. What version of ImageSharp did you use to make those benchmarks? I'm particularly impressed by the very high memory usage it has compared to other libs, especially in the parallel benchmark (207x compared to System.Drawing? Am I missing something here? 😶).
  2. If you have time and are interested, I'd be curious to see your updated results for the benchmark where ImageSharp fails with the OOM error - @antonfirsov's recent PR should've fixed that.

@saucecontrol
Copy link
Contributor

Thanks! I did those benchmarks with the current dev at the time (1.0.0-dev003181). I did a quick run yesterday with 1.0.0-unstable0806, and GC allocations were way down but speed was a bit worse. I see @JimBobSquarePants has a new PR that improves perf. Hopefully that gets merged this weekend? I plan on re-running the whole suite since I've got a new MagicScaler version releasing today. Would be good to have all the ImageSharp improvements in for that.

@saucecontrol
Copy link
Contributor

207x compared to System.Drawing? Am I missing something here?

Those numbers will always be misleading because System.Drawing is just a thin wrapper around native code (GDI+). That's why I threw in a more comprehensive test that also measured unmanaged memory allocations.

@antonfirsov
Copy link
Member Author

@saucecontrol thanks for the suggestions! Introducing a System.Runtime.InteropServices AVX2 path for our float -> byte conversion is definitely a long hanging fruit. If we could base our implementation on your code it would be super easy.

For Y'CbCr -> RGB, we already have a SIMD path that compiles to AVX2 under the hood, and it outperforms our previous scalar approaches a lot. It's the slow scalar packing that hurts the most at the moment. It will be way much better to do the float -> byte conversion first, and pack 32 bytes at a single step with AVX2.

@saucecontrol
Copy link
Contributor

saucecontrol commented Feb 23, 2020

You're certainly welcome to borrow any of my code you find useful. My Y'CbCr conversion is here:

https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/PlanarConversionTransform.cs#L148-L236

You'll notice the scalar float version reads/writes directly to/from the buffers rather than going through a Vector4. Setting the individual fields of Vector4 is always slow because it means a minimum of three shuffle ops, and Intel processors only have a single port that can service those shuffles. It ends up being faster to do the memory read and write since there are more ports to service those and they'll be in cache anyway.

With AVX, it's possible to overcome the extra shuffle overhead by using a DotProduct that operates on two pixels at once, but even that is only profitable if you have access to the full AVX instruction set. I couldn't outdo the scalar code with System.Numerics.

Just realized in looking at the code again, I was doing an extra round of shuffles just to be able to use DotProduct when it's only a gain on the Green output. And with FMA, it's not even a gain there. I simplified the code, and added an SSE version which ends up just over twice as fast as scalar. I haven't had time to draw out ascii diagrams of how that code works, but I'm happy to answer questions if you have any.

I see another quick win possibility in your SIMD Y'CbCr->RGB path if you just hoist those constant Vector<float> values that hold the coefficients out of the loop. RyuJIT has never been smart enough to do that for you, and each of those requires a load/broadcast instruction pair.

@brianpopow
Copy link
Collaborator

@brianpopow is packing useful for other codecs?

WebP lossy format is similar to JPEG as in the image is also represented with YUV and split up into macroblocks. The difference is, that the DCT coefficients are represented as byte rather than float. All the decoding is done with integer operations.

For the other formats we currently have, i also can not see this applied. Unfortunately this optimization looks rather specific to JPEG at the moment.

@john-h-k
Copy link

I'll give this a shot. A couple of questions:

  • what is the default behaviour?
  • For RGB32, does it just assume alpha value of 1?

@JimBobSquarePants
Copy link
Member

Awesome! Not sure I understand what you mean by the first question but Rgba32 would use an alpha of 255.

@john-h-k
Copy link

We can define a default implementations in the base PixelOperations class, and specialize it for Rgba32 and Rgb24

@JimBobSquarePants i get the 32/24 behaviour but not sure what it means by default impl

Rgba32 would use an alpha of 255.
Silly me, was thinking of floating point max alpha 😁

@JimBobSquarePants
Copy link
Member

Ah right.... So we have a "default" PixelOperations class which we then override with implementations specific to individual pixel formats.

The default version will use vectors and the various members of IPixel interface
https://github.com/SixLabors/ImageSharp/tree/b720219bb5d01f8443408819574444cc19baf595/src/ImageSharp/PixelFormats

Whereas the explicit versions contain optimizations that take advantage of the size and layout of the individual pixel formats.
https://github.com/SixLabors/ImageSharp/tree/master/src/ImageSharp/PixelFormats/PixelImplementations/Generated

@antonfirsov
Copy link
Member Author

In other words: the default implementation contains generic code, that utilizes the IPixel<T> interface thanks to the constraints:
https://github.com/SixLabors/ImageSharp/blob/b720219bb5d01f8443408819574444cc19baf595/src/ImageSharp/PixelFormats/PixelOperations%7BTPixel%7D.Generated.cs

However instead of extending that interface (more codebloat for low benefits), I suggest to convert TPixel from Rgb24.

@br3aker
Copy link
Contributor

br3aker commented Jul 15, 2021

Considering implementing this. One question: how was (and is) it possible to implement this without immediate cast from float to byte?

Right now we have support for:
YCbCr (3 channels) - rgb compatible
YccK (4 channels) - rgb compatible
Cmyk (4 channels) - rgb compatible
Grayscale (1 channel) - not rgb compatible
Rgb (3 channels) - rgb compatible

We either need to allocate 2 extra buffers for the grayscale or we can cast to byte immediately after colorspace -> rgb conversion. Second option would yield 'pixels' with size of 3 bytes which ideally fit to the single color buffer of float(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants