Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component shuffling for RGB(A)-like formats with SIMD Intrinsics #1354

Closed
5 tasks done
antonfirsov opened this issue Sep 16, 2020 · 12 comments
Closed
5 tasks done

Component shuffling for RGB(A)-like formats with SIMD Intrinsics #1354

antonfirsov opened this issue Sep 16, 2020 · 12 comments
Assignees

Comments

@antonfirsov
Copy link
Member

antonfirsov commented Sep 16, 2020

We need cheap bulk-conversion between the following formats:
Argb32, Bgra32, Rgba32, Bgr24, Rgb24

We are especially interested in Rgb24, Bgr24 <=> Rgba32. I'm having sweat dreams about a community PR dealing with this problem. Should be easy for anyone having basic knowledge of System.Runtime.Intrinsics.

Implementation should be added to new span-based methods in PixelConverter.cs, those could be invoked than from T4 generated PixelOperations<TPixel> implementors like here:
https://github.com/SixLabors/ImageSharp/blob/78a584e8482b052d7a9885682299e2f37518d83d/src/ImageSharp/PixelFormats/PixelImplementations/Generated/Rgb24.PixelOperations.Generated.cs

If a PR would only add the PixelConverter helpers + tests, I'm happy to provide guidance or even finish the code for the rest of the work.

@john-h-k @Sergio0694 any chance you are interested?

Tasks:

  • Add 4=>4 channel shuffling methods to SimdUtils
  • Add 3=>4 channel shuffling methods to SimdUtils
  • Add 4=>3 channel shuffling methods to SimdUtils
  • Update Rgba32 compatible pixel operations to utilize new shuffle methods.
  • Update Rgb24 compatible pixel operations to utilize new shuffle methods.
@antonfirsov antonfirsov added this to the 1.1.0 milestone Sep 16, 2020
@antonfirsov antonfirsov changed the title Component shuffling for RGB(A) like formats with SIMD Intrinsics Component shuffling for RGB(A)-like formats with SIMD Intrinsics Sep 16, 2020
@JimBobSquarePants JimBobSquarePants self-assigned this Oct 26, 2020
@JimBobSquarePants
Copy link
Member

I'm gonna have a look at this.

@antonfirsov
Copy link
Member Author

antonfirsov commented Nov 2, 2020

Btw, the Span<Vector4> <=> Span<TPixel> conversions are delegating the (pad/slice) shuffle work to PixelOperations<TPixel> conversion methods:

https://github.com/SixLabors/ImageSharp/blob/78a584e8482b052d7a9885682299e2f37518d83d/src/ImageSharp/PixelFormats/Utils/Vector4Converters.RgbaCompatible.cs

Which means that the last two steps will be done automatically, if I'm not missing anything. Can't wait to see the Vector4 <=> Rgb24 before/after comparison. (Which is the main goal of this issue because of the conversion steps in Jpeg decoder and ResizeProcessor.)

@JimBobSquarePants
Copy link
Member

@antonfirsov need to add a specific benchmark for that but I know for certain that even my Rgba32 <==> Rgb24 fallback is a lot faster than the original.

@JimBobSquarePants
Copy link
Member

@antonfirsov Here you go!

ToVector4_Rgb24

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.403
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-OIBEDX : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
  Job-OPAORC : .NET Core 2.1.23 (CoreCLR 4.6.29321.03, CoreFX 4.6.29321.01), X64 RyuJIT
  Job-VPSIRL : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

IterationCount=3  LaunchCount=1  WarmupCount=3

Master

Method Job Runtime Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
PixelOperations_Base Job-BPVKME .NET 4.7.2 64 278.4 ns 6.89 ns 0.38 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-BPVKME .NET 4.7.2 64 310.2 ns 87.58 ns 4.80 ns 1.11 0.02 - - - -
PixelOperations_Base Job-FBIBGB .NET Core 2.1 64 231.9 ns 254.31 ns 13.94 ns 1.00 0.00 0.0052 - - 24 B
PixelOperations_Specialized Job-FBIBGB .NET Core 2.1 64 230.2 ns 27.06 ns 1.48 ns 0.99 0.05 - - - -
PixelOperations_Base Job-CETXPV .NET Core 3.1 64 215.5 ns 52.24 ns 2.86 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-CETXPV .NET Core 3.1 64 236.7 ns 553.01 ns 30.31 ns 1.10 0.13 - - - -
PixelOperations_Base Job-BPVKME .NET 4.7.2 256 1,022.3 ns 3,570.70 ns 195.72 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-BPVKME .NET 4.7.2 256 622.5 ns 26.76 ns 1.47 ns 0.62 0.11 - - - -
PixelOperations_Base Job-FBIBGB .NET Core 2.1 256 762.3 ns 103.78 ns 5.69 ns 1.00 0.00 0.0048 - - 24 B
PixelOperations_Specialized Job-FBIBGB .NET Core 2.1 256 498.1 ns 70.87 ns 3.88 ns 0.65 0.00 - - - -
PixelOperations_Base Job-CETXPV .NET Core 3.1 256 754.0 ns 37.92 ns 2.08 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-CETXPV .NET Core 3.1 256 436.8 ns 21.88 ns 1.20 ns 0.58 0.00 - - - -
PixelOperations_Base Job-BPVKME .NET 4.7.2 2048 5,679.3 ns 1,454.37 ns 79.72 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-BPVKME .NET 4.7.2 2048 3,460.6 ns 273.43 ns 14.99 ns 0.61 0.01 - - - -
PixelOperations_Base Job-FBIBGB .NET Core 2.1 2048 6,033.8 ns 8,785.67 ns 481.57 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-FBIBGB .NET Core 2.1 2048 3,421.3 ns 376.64 ns 20.64 ns 0.57 0.04 - - - -
PixelOperations_Base Job-CETXPV .NET Core 3.1 2048 5,542.3 ns 790.31 ns 43.32 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-CETXPV .NET Core 3.1 2048 2,972.2 ns 70.72 ns 3.88 ns 0.54 0.00 - - - -

Branch

Method Job Runtime Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
PixelOperations_Base Job-OIBEDX .NET 4.7.2 64 298.4 ns 33.63 ns 1.84 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-OIBEDX .NET 4.7.2 64 355.5 ns 908.51 ns 49.80 ns 1.19 0.17 - - - -
PixelOperations_Base Job-OPAORC .NET Core 2.1 64 220.1 ns 13.77 ns 0.75 ns 1.00 0.00 0.0055 - - 24 B
PixelOperations_Specialized Job-OPAORC .NET Core 2.1 64 228.5 ns 41.41 ns 2.27 ns 1.04 0.01 - - - -
PixelOperations_Base Job-VPSIRL .NET Core 3.1 64 213.6 ns 12.47 ns 0.68 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-VPSIRL .NET Core 3.1 64 217.0 ns 9.95 ns 0.55 ns 1.02 0.01 - - - -
PixelOperations_Base Job-OIBEDX .NET 4.7.2 256 829.0 ns 242.93 ns 13.32 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-OIBEDX .NET 4.7.2 256 448.9 ns 4.04 ns 0.22 ns 0.54 0.01 - - - -
PixelOperations_Base Job-OPAORC .NET Core 2.1 256 863.0 ns 1,253.26 ns 68.70 ns 1.00 0.00 0.0048 - - 24 B
PixelOperations_Specialized Job-OPAORC .NET Core 2.1 256 309.2 ns 66.16 ns 3.63 ns 0.36 0.03 - - - -
PixelOperations_Base Job-VPSIRL .NET Core 3.1 256 737.0 ns 253.90 ns 13.92 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-VPSIRL .NET Core 3.1 256 212.3 ns 1.07 ns 0.06 ns 0.29 0.01 - - - -
PixelOperations_Base Job-OIBEDX .NET 4.7.2 2048 5,625.6 ns 404.35 ns 22.16 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-OIBEDX .NET 4.7.2 2048 1,974.1 ns 229.84 ns 12.60 ns 0.35 0.00 - - - -
PixelOperations_Base Job-OPAORC .NET Core 2.1 2048 5,467.2 ns 537.29 ns 29.45 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-OPAORC .NET Core 2.1 2048 1,985.5 ns 4,714.23 ns 258.40 ns 0.36 0.05 - - - -
PixelOperations_Base Job-VPSIRL .NET Core 3.1 2048 5,888.2 ns 1,622.23 ns 88.92 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-VPSIRL .NET Core 3.1 2048 1,165.0 ns 191.71 ns 10.51 ns 0.20 0.00 - - - -

FromVector4_Rgb24

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.403
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-XYEQXL : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
  Job-HSXNJV : .NET Core 2.1.23 (CoreCLR 4.6.29321.03, CoreFX 4.6.29321.01), X64 RyuJIT
  Job-YUREJO : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

IterationCount=3  LaunchCount=1  WarmupCount=3

Master

Method Job Runtime Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
PixelOperations_Base Job-BPNZYS .NET 4.7.2 64 317.7 ns 125.40 ns 6.87 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-BPNZYS .NET 4.7.2 64 316.4 ns 70.42 ns 3.86 ns 1.00 0.03 - - - -
PixelOperations_Base Job-NYROHY .NET Core 2.1 64 232.7 ns 82.61 ns 4.53 ns 1.00 0.00 0.0055 - - 24 B
PixelOperations_Specialized Job-NYROHY .NET Core 2.1 64 238.9 ns 106.11 ns 5.82 ns 1.03 0.01 - - - -
PixelOperations_Base Job-LSNAID .NET Core 3.1 64 228.4 ns 15.16 ns 0.83 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-LSNAID .NET Core 3.1 64 250.3 ns 22.79 ns 1.25 ns 1.10 0.01 - - - -
PixelOperations_Base Job-BPNZYS .NET 4.7.2 256 975.5 ns 1,646.67 ns 90.26 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-BPNZYS .NET 4.7.2 256 1,051.3 ns 170.43 ns 9.34 ns 1.08 0.11 0.0172 - - 72 B
PixelOperations_Base Job-NYROHY .NET Core 2.1 256 793.0 ns 69.24 ns 3.80 ns 1.00 0.00 0.0048 - - 24 B
PixelOperations_Specialized Job-NYROHY .NET Core 2.1 256 846.8 ns 117.07 ns 6.42 ns 1.07 0.01 0.0172 - - 72 B
PixelOperations_Base Job-LSNAID .NET Core 3.1 256 797.2 ns 342.02 ns 18.75 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-LSNAID .NET Core 3.1 256 640.2 ns 19.74 ns 1.08 ns 0.80 0.02 0.0172 - - 72 B
PixelOperations_Base Job-BPNZYS .NET 4.7.2 2048 6,178.4 ns 1,537.81 ns 84.29 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-BPNZYS .NET 4.7.2 2048 4,551.2 ns 257.65 ns 14.12 ns 0.74 0.01 0.0153 - - 72 B
PixelOperations_Base Job-NYROHY .NET Core 2.1 2048 6,621.8 ns 13,533.62 ns 741.82 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-NYROHY .NET Core 2.1 2048 4,390.4 ns 184.24 ns 10.10 ns 0.67 0.07 0.0153 - - 72 B
PixelOperations_Base Job-LSNAID .NET Core 3.1 2048 6,357.9 ns 964.20 ns 52.85 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-LSNAID .NET Core 3.1 2048 2,979.4 ns 132.81 ns 7.28 ns 0.47 0.00 0.0153 - - 72 B

Branch

Method Job Runtime Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
PixelOperations_Base Job-XYEQXL .NET 4.7.2 64 343.2 ns 305.91 ns 16.77 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-XYEQXL .NET 4.7.2 64 320.8 ns 19.93 ns 1.09 ns 0.94 0.05 - - - -
PixelOperations_Base Job-HSXNJV .NET Core 2.1 64 234.3 ns 17.98 ns 0.99 ns 1.00 0.00 0.0052 - - 24 B
PixelOperations_Specialized Job-HSXNJV .NET Core 2.1 64 246.0 ns 82.34 ns 4.51 ns 1.05 0.02 - - - -
PixelOperations_Base Job-YUREJO .NET Core 3.1 64 222.3 ns 39.46 ns 2.16 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-YUREJO .NET Core 3.1 64 243.4 ns 33.58 ns 1.84 ns 1.09 0.01 - - - -
PixelOperations_Base Job-XYEQXL .NET 4.7.2 256 824.9 ns 32.77 ns 1.80 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-XYEQXL .NET 4.7.2 256 967.0 ns 39.09 ns 2.14 ns 1.17 0.01 0.0172 - - 72 B
PixelOperations_Base Job-HSXNJV .NET Core 2.1 256 756.9 ns 94.43 ns 5.18 ns 1.00 0.00 0.0048 - - 24 B
PixelOperations_Specialized Job-HSXNJV .NET Core 2.1 256 1,003.3 ns 3,192.09 ns 174.97 ns 1.32 0.22 0.0172 - - 72 B
PixelOperations_Base Job-YUREJO .NET Core 3.1 256 748.6 ns 248.03 ns 13.60 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-YUREJO .NET Core 3.1 256 437.0 ns 36.48 ns 2.00 ns 0.58 0.01 0.0172 - - 72 B
PixelOperations_Base Job-XYEQXL .NET 4.7.2 2048 5,751.6 ns 704.24 ns 38.60 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-XYEQXL .NET 4.7.2 2048 4,391.6 ns 718.17 ns 39.37 ns 0.76 0.00 0.0153 - - 72 B
PixelOperations_Base Job-HSXNJV .NET Core 2.1 2048 6,202.0 ns 1,815.18 ns 99.50 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-HSXNJV .NET Core 2.1 2048 4,225.6 ns 1,004.03 ns 55.03 ns 0.68 0.01 0.0153 - - 72 B
PixelOperations_Base Job-YUREJO .NET Core 3.1 2048 6,157.1 ns 2,516.98 ns 137.96 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-YUREJO .NET Core 3.1 2048 1,822.7 ns 1,764.43 ns 96.71 ns 0.30 0.02 0.0172 - - 72 B

@JimBobSquarePants
Copy link
Member

@antonfirsov Pushed an update that affects FromVector4_Rgb24 based on a suggestion by @Sergio0694 and It's squeezed a few more percentage performance.

Method Job Runtime Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
PixelOperations_Base Job-RPFDIH .NET 4.7.2 64 327.7 ns 179.42 ns 9.83 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-RPFDIH .NET 4.7.2 64 320.8 ns 21.37 ns 1.17 ns 0.98 0.03 - - - -
PixelOperations_Base Job-QSMZGQ .NET Core 2.1 64 254.8 ns 337.29 ns 18.49 ns 1.00 0.00 0.0052 - - 24 B
PixelOperations_Specialized Job-QSMZGQ .NET Core 2.1 64 245.3 ns 37.22 ns 2.04 ns 0.97 0.07 - - - -
PixelOperations_Base Job-YJZMFY .NET Core 3.1 64 232.2 ns 189.28 ns 10.37 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-YJZMFY .NET Core 3.1 64 255.4 ns 52.39 ns 2.87 ns 1.10 0.04 - - - -
PixelOperations_Base Job-RPFDIH .NET 4.7.2 256 910.1 ns 293.07 ns 16.06 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-RPFDIH .NET 4.7.2 256 974.4 ns 490.48 ns 26.89 ns 1.07 0.05 0.0172 - - 72 B
PixelOperations_Base Job-QSMZGQ .NET Core 2.1 256 849.7 ns 1,654.73 ns 90.70 ns 1.00 0.00 0.0048 - - 24 B
PixelOperations_Specialized Job-QSMZGQ .NET Core 2.1 256 759.9 ns 77.06 ns 4.22 ns 0.90 0.09 0.0172 - - 72 B
PixelOperations_Base Job-YJZMFY .NET Core 3.1 256 816.1 ns 56.07 ns 3.07 ns 1.00 0.00 0.0057 - - 24 B
PixelOperations_Specialized Job-YJZMFY .NET Core 3.1 256 493.0 ns 216.79 ns 11.88 ns 0.60 0.01 0.0172 - - 72 B
PixelOperations_Base Job-RPFDIH .NET 4.7.2 2048 6,394.6 ns 2,077.05 ns 113.85 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-RPFDIH .NET 4.7.2 2048 4,139.6 ns 276.41 ns 15.15 ns 0.65 0.01 0.0153 - - 72 B
PixelOperations_Base Job-QSMZGQ .NET Core 2.1 2048 6,249.9 ns 799.28 ns 43.81 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-QSMZGQ .NET Core 2.1 2048 4,020.5 ns 3,211.19 ns 176.02 ns 0.64 0.03 0.0153 - - 72 B
PixelOperations_Base Job-YJZMFY .NET Core 3.1 2048 6,403.5 ns 4,626.86 ns 253.61 ns 1.00 0.00 - - - 24 B
PixelOperations_Specialized Job-YJZMFY .NET Core 3.1 2048 1,588.3 ns 912.20 ns 50.00 ns 0.25 0.01 0.0172 - - 72 B

@antonfirsov
Copy link
Member Author

Looks great!

@JimBobSquarePants there is one other important metric we need to check to set our expectations for #1410. Can be done by defining 2 new simple benchmark classes (baseline VS SIMD with Count = 2048):

  1. Comparing Vector4 -> Rgba32 (baseline) to Vector4 -> Rgb24 (Jpeg decoder pipeline last step)
  2. Comparing Rgba32 -> Vector4 (baseline) to Rgb24 -> Vector4 (Resize pipeline first step)

The smaller the difference the bigger the happiness.

@JimBobSquarePants
Copy link
Member

Couldn't we pack directly into Rgb24 by not scaling down in the color converter and packing the planar values as bytes?

@JimBobSquarePants
Copy link
Member

Vector4 => Rgba32 is still 2.5x faster on .NET Core 3.1 than Vector4 => Rgb24.

We'd need to have a method that goes direct to be able to cut into that since the pipeline is Vector4 => Rgba32 => Rgb24. Not difficult with hardware intrinsics based on my existing code but likely not fun with the old stuff.

What I dream of is a combination of shuffle + convert.

@antonfirsov
Copy link
Member Author

antonfirsov commented Nov 4, 2020

Vector4 => Rgba32 is still 2.5x faster on .NET Core 3.1 than Vector4 => Rgb24.

That's bad news :(

I don't think going Vector4 => Rgb24 is the best approach here:

  • It's a huge refactor strongly interfering with Vectorize (AVX2) JPEG Color Converter #1411.
  • A single pipeline step would do way too much (convert colorspace + convert to byte + pack), making it very hard to maintain the code
  • Ensuring that the non-HwIntrinsics path does not regress is also very expensive (The Vector<float> -> Vector<byte> conversion shall be directly integrated into the colorspace conversion code.

I'd rather suggest to do the following:

  1. Color converters should convert and pack to a "Vector3" buffer (Span<float> of RGB components only, no padding for alpha)
  2. Apply SimdUtils.NormalizedFloatToByteSaturate to get the Rgb24 buffer
  3. Convert it further with PixelOperations<TPixel> if TPixel != Rgb24

=> Pro: likely still very fast, much more predictible amount of work, no regressions on old platforms
(But still a lot of work!)

@JimBobSquarePants
Copy link
Member

Yeah... All my shuffle code has touched conversion between pixel formats only. The Vector4 pipeline remains untouched (other than a speedup for converting to/from Rgba32)

What do you mean by Color converters? The jpeg ones?

@antonfirsov
Copy link
Member Author

What do you mean by Color converters? The jpeg ones?

Yes I meant that.

Couldn't we pack directly into Rgb24

I think I misunderstood you on this one. I thought you want Jpeg color converters to convert directly into Rgb24, my #1354 (comment) is listing arguments against doing that.

A one-step Vector4 => Rgb24 method can help a bit, but wouldn't expect big miracle from it, choose wisely if you want to invest your time into it or not.

@JimBobSquarePants
Copy link
Member

Just realized this was still open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants