Add AVX2 Vector4Octet.Pack implementation #1402

JimBobSquarePants · 2020-10-23T13:06:21Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

This should add some performance boost until #1242 and others are complete.

I struggled writing this; any performance tips are most welcome.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]          : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  AVX             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  No HwIntrinsics : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  SSE             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1

Note: No difference with SIMD disabled as we're not using it.

Method	Job	EnvironmentVariables	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Pack	AVX	Empty	10.69 ns	0.097 ns	0.075 ns	1.00	0.00	-	-	-	-
Pack	No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	13.70 ns	0.297 ns	0.558 ns	1.27	0.07	-	-	-	-
Pack	SSE	COMPlus_EnableAVX=0	13.72 ns	0.196 ns	0.164 ns	1.28	0.02	-	-	-	-

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-TEIGEQ : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1  IterationCount=3  LaunchCount=1
WarmupCount=2

Method	TestImage	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
'Decode Jpeg - System.Drawing'	Jpg/b(...)e.jpg [21]	5.065 ms	0.1713 ms	0.0094 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)e.jpg [21]	9.231 ms	0.6878 ms	0.0377 ms	1.82	0.00	-	-	-	15901 B

'Decode Jpeg - System.Drawing'	Jpg/b(...)f.jpg [28]	13.785 ms	1.2893 ms	0.0707 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)f.jpg [28]	22.786 ms	0.1398 ms	0.0077 ms	1.65	0.01	-	-	-	16896 B

'Decode Jpeg - System.Drawing'	Jpg/i(...)e.jpg [43]	382.942 ms	859.0807 ms	47.0891 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/i(...)e.jpg [43]	244.395 ms	9.9247 ms	0.5440 ms	0.64	0.07	-	-	-	36022592 B

JimBobSquarePants · 2020-10-23T13:14:40Z

@saucecontrol Is there something I can do here to avoid the permutation?

antonfirsov

LGTM, but check the second suggestion.

I also realized that we can get this much better by implementing a HwIntrinsics AVX2 version of FromYCbCrSimdVector8. I suggest to check benchmarks for the color converter instead of isolated benchmarking of Pack.

antonfirsov · 2020-10-23T13:19:44Z

src/ImageSharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.cs

+                    Vector4 vo = Vector4.One;
+                    Vector128<float> valpha = Unsafe.As<Vector4, Vector128<float>>(ref vo);
+
+                    ref byte control = ref MemoryMarshal.GetReference(SimdUtils.HwIntrinsics.PermuteMaskDeinterleave8x32);
+                    Vector256<int> vcontrol = Unsafe.As<byte, Vector256<int>>(ref control);


By having two versions of Pack, (or inlining the HwIntrinsics AVX2 version) we can move these loads outside of the for loop calling Pack.

(Not necessarily a suggestion for this PR since it extends the scope quite a lot)

antonfirsov · 2020-10-23T13:27:07Z

src/ImageSharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.cs

+                    Vector256<int> vcontrol = Unsafe.As<byte, Vector256<int>>(ref control);
+
+                    Vector256<float> r0 = Avx.InsertVector128(
+                       Unsafe.As<Vector4, Vector128<float>>(ref r.A).ToVector256(),


If I'm not getting it wrong, we can spare ToVector256 in lines dealing with r.A, g.A and b.A, since the upper 4 elements will be overwritten anyways:

Suggested change

Unsafe.As<Vector4, Vector128<float>>(ref r.A).ToVector256(),

Unsafe.As<Vector4, Vector256<float>>(ref r.A),

Weirdly when I tried that I got access violations!

Are you sure it was on .A stuff and not .B?

I’ll double check. Was super surprised to see it as it didn’t make sense.

Yup, crashes every time.

antonfirsov · 2020-10-23T13:37:05Z

src/ImageSharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.cs

+                       1);
+
+                    Vector256<float> r2 = Avx.InsertVector128(
+                       Unsafe.As<Vector4, Vector128<float>>(ref r.B).ToVector256(),


With a wider refactor, it's also possible to save the conversion here.

That'd certainly be easier if I was using pointers.

What I meant is inlining the Pack stuff into AVX2 color space conversion code. That would remove a bunch of unnecessary loads/stores.

Yeah, that's what I mean. It's difficult to inline at the moment because I need the 128bit offset when I'm aligning at 256bit.
If I fixed the r, g, b inputs then I could simply load up the vector from the offset.

I'm pushing the current state.

JimBobSquarePants · 2020-10-23T16:34:32Z

@antonfirsov Experimenting with a separate implementation. First time I've seen sub 9ms.

Need to up the warmup/run count on these benchmarks though there's always too much error.

Method	TestImage	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
'Decode Jpeg - System.Drawing'	Jpg/b(...)e.jpg [21]	4.818 ms	0.5476 ms	0.0300 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)e.jpg [21]	8.998 ms	0.7054 ms	0.0387 ms	1.87	0.00	-	-	-	15906 B

'Decode Jpeg - System.Drawing'	Jpg/b(...)f.jpg [28]	13.091 ms	0.3217 ms	0.0176 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)f.jpg [28]	21.961 ms	1.3653 ms	0.0748 ms	1.68	0.01	-	-	-	16896 B

'Decode Jpeg - System.Drawing'	Jpg/i(...)e.jpg [43]	326.853 ms	85.0622 ms	4.6625 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/i(...)e.jpg [43]	241.632 ms	125.2805 ms	6.8670 ms	0.74	0.02	-	-	-	36022512 B

codecov · 2020-10-23T17:24:01Z

Codecov Report

Merging #1402 into master will decrease coverage by 0.00%.
The diff coverage is 95.55%.

@@            Coverage Diff             @@
##           master    #1402      +/-   ##
==========================================
- Coverage   82.90%   82.89%   -0.01%     
==========================================
  Files         690      690              
  Lines       31008    31017       +9     
  Branches     3560     3561       +1     
==========================================
+ Hits        25706    25713       +7     
- Misses       4580     4581       +1     
- Partials      722      723       +1

Flag	Coverage Δ
#unittests	`82.89% <95.55%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ents/Decoder/ColorConverters/JpegColorConverter.cs	`93.18% <ø> (ø)`
...mageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs	`93.44% <60.00%> (-1.48%)`	⬇️
...Converters/JpegColorConverter.FromYCbCrSimdAvx2.cs	`92.45% <100.00%> (+0.78%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5975a3...3ae4b02. Read the comment docs.

antonfirsov

I'm fine to go with this as is, since it's definitely an improvement.

JimBobSquarePants · 2020-10-23T17:52:04Z

@antonfirsov There's more!

I inlined the method. Using Vector256 works now.

Method	TestImage	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
'Decode Jpeg - System.Drawing'	Jpg/b(...)e.jpg [21]	4.881 ms	0.5078 ms	0.0278 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)e.jpg [21]	9.056 ms	0.8409 ms	0.0461 ms	1.86	0.01	-	-	-	15889 B

'Decode Jpeg - System.Drawing'	Jpg/b(...)f.jpg [28]	13.211 ms	0.9451 ms	0.0518 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)f.jpg [28]	22.386 ms	1.1476 ms	0.0629 ms	1.69	0.00	-	-	-	16896 B

'Decode Jpeg - System.Drawing'	Jpg/i(...)e.jpg [43]	329.925 ms	41.1656 ms	2.2564 ms	1.00	0.00	-	-	-	216 B
'Decode Jpeg - ImageSharp'	Jpg/i(...)e.jpg [43]	239.662 ms	92.6744 ms	5.0798 ms	0.73	0.02	-	-	-	36022512 B

saucecontrol · 2020-10-23T19:02:52Z

@saucecontrol Is there something I can do here to avoid the permutation?

You can't avoid them, but you can reduce the number. I have an AVX2 YCbCr->BGRX converter you can look at here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/PlanarConversionTransform.cs#L153-L203

The main difference is I permute the values first, separating the even and odd indexes into the low and high lanes respectively. That allows the unpack operations to do all the remaining work. It also means you can get by with 3 permutes instead of 4 since your alpha vector is constant.

And if you make your green coefficients negative, you can use a couple more FMAs in there :)

JimBobSquarePants · 2020-10-23T19:44:26Z

@saucecontrol Ahah! I’ll have to give that a try! Thanks!

JimBobSquarePants · 2020-10-23T23:59:01Z

Very nice @saucecontrol

We're getting much closer now. Once we refactor to resolve directly to Rgb24, Rgba32 instead of Vector4 I think we'll be very close indeed!

I'll be running the latest nightly against the playground benchmarks once this is merged and built.

Method	TestImage	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
'Decode Jpeg - System.Drawing'	Jpg/b(...)e.jpg [21]	4.845 ms	0.7259 ms	0.0398 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)e.jpg [21]	8.724 ms	0.5863 ms	0.0321 ms	1.80	0.01	-	-	-	15888 B

'Decode Jpeg - System.Drawing'	Jpg/b(...)f.jpg [28]	13.134 ms	0.8384 ms	0.0460 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)f.jpg [28]	21.689 ms	0.9054 ms	0.0496 ms	1.65	0.01	-	-	-	16896 B

'Decode Jpeg - System.Drawing'	Jpg/i(...)e.jpg [43]	327.796 ms	19.6112 ms	1.0750 ms	1.00	0.00	-	-	-	216 B
'Decode Jpeg - ImageSharp'	Jpg/i(...)e.jpg [43]	231.320 ms	100.4219 ms	5.5045 ms	0.71	0.02	-	-	-	36022512 B

Add AVX2 Vector4Octet.Pack implementation

Add AVX2 implementation

c1e6d50

JimBobSquarePants added area:performance formats:jpeg labels Oct 23, 2020

JimBobSquarePants requested a review from antonfirsov October 23, 2020 13:06

JimBobSquarePants added this to the 1.1.0 milestone Oct 23, 2020

antonfirsov reviewed Oct 23, 2020

View reviewed changes

Use HW color conversion

ebfd069

Fix access violation

8872b2b

antonfirsov approved these changes Oct 23, 2020

View reviewed changes

Inline the packing.

eb315fe

Use less permutes and more multiply/add

3ae4b02

JimBobSquarePants merged commit 120080b into master Oct 24, 2020

JimBobSquarePants deleted the js/vector4octet-pack branch October 24, 2020 00:21

antonfirsov mentioned this pull request Nov 3, 2020

Vectorize (AVX2) JPEG Color Converter #1411

Merged

JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021

Merge pull request #1402 from SixLabors/js/vector4octet-pack

4756201

Add AVX2 Vector4Octet.Pack implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 Vector4Octet.Pack implementation #1402

Add AVX2 Vector4Octet.Pack implementation #1402

JimBobSquarePants commented Oct 23, 2020 •

edited

Loading

JimBobSquarePants commented Oct 23, 2020

antonfirsov left a comment •

edited

Loading

antonfirsov Oct 23, 2020 •

edited

Loading

antonfirsov Oct 23, 2020

JimBobSquarePants Oct 23, 2020

antonfirsov Oct 23, 2020

JimBobSquarePants Oct 23, 2020

JimBobSquarePants Oct 23, 2020

antonfirsov Oct 23, 2020

JimBobSquarePants Oct 23, 2020

antonfirsov Oct 23, 2020

JimBobSquarePants Oct 23, 2020

JimBobSquarePants Oct 23, 2020

JimBobSquarePants commented Oct 23, 2020

codecov bot commented Oct 23, 2020 •

edited

Loading

antonfirsov left a comment

JimBobSquarePants commented Oct 23, 2020 •

edited

Loading

saucecontrol commented Oct 23, 2020

JimBobSquarePants commented Oct 23, 2020

JimBobSquarePants commented Oct 23, 2020

	Unsafe.As<Vector4, Vector128<float>>(ref r.A).ToVector256(),
	Unsafe.As<Vector4, Vector256<float>>(ref r.A),

Add AVX2 Vector4Octet.Pack implementation #1402

Add AVX2 Vector4Octet.Pack implementation #1402

Conversation

JimBobSquarePants commented Oct 23, 2020 • edited Loading

Prerequisites

Description

JimBobSquarePants commented Oct 23, 2020

antonfirsov left a comment • edited Loading

Choose a reason for hiding this comment

antonfirsov Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants commented Oct 23, 2020

codecov bot commented Oct 23, 2020 • edited Loading

Codecov Report

antonfirsov left a comment

Choose a reason for hiding this comment

JimBobSquarePants commented Oct 23, 2020 • edited Loading

saucecontrol commented Oct 23, 2020

JimBobSquarePants commented Oct 23, 2020

JimBobSquarePants commented Oct 23, 2020

JimBobSquarePants commented Oct 23, 2020 •

edited

Loading

antonfirsov left a comment •

edited

Loading

antonfirsov Oct 23, 2020 •

edited

Loading

codecov bot commented Oct 23, 2020 •

edited

Loading

JimBobSquarePants commented Oct 23, 2020 •

edited

Loading