Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX2 Vector4Octet.Pack implementation #1402

Merged
merged 5 commits into from
Oct 24, 2020

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Oct 23, 2020

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This should add some performance boost until #1242 and others are complete.

I struggled writing this; any performance tips are most welcome.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]          : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  AVX             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  No HwIntrinsics : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  SSE             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1

Note: No difference with SIMD disabled as we're not using it.

Method Job EnvironmentVariables Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Pack AVX Empty 10.69 ns 0.097 ns 0.075 ns 1.00 0.00 - - - -
Pack No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 13.70 ns 0.297 ns 0.558 ns 1.27 0.07 - - - -
Pack SSE COMPlus_EnableAVX=0 13.72 ns 0.196 ns 0.164 ns 1.28 0.02 - - - -
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-TEIGEQ : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1  IterationCount=3  LaunchCount=1
WarmupCount=2
Method TestImage Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
'Decode Jpeg - System.Drawing' Jpg/b(...)e.jpg [21] 5.065 ms 0.1713 ms 0.0094 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)e.jpg [21] 9.231 ms 0.6878 ms 0.0377 ms 1.82 0.00 - - - 15901 B
'Decode Jpeg - System.Drawing' Jpg/b(...)f.jpg [28] 13.785 ms 1.2893 ms 0.0707 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)f.jpg [28] 22.786 ms 0.1398 ms 0.0077 ms 1.65 0.01 - - - 16896 B
'Decode Jpeg - System.Drawing' Jpg/i(...)e.jpg [43] 382.942 ms 859.0807 ms 47.0891 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/i(...)e.jpg [43] 244.395 ms 9.9247 ms 0.5440 ms 0.64 0.07 - - - 36022592 B

@JimBobSquarePants
Copy link
Member Author

@saucecontrol Is there something I can do here to avoid the permutation?

@JimBobSquarePants JimBobSquarePants added this to the 1.1.0 milestone Oct 23, 2020
Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but check the second suggestion.

I also realized that we can get this much better by implementing a HwIntrinsics AVX2 version of FromYCbCrSimdVector8. I suggest to check benchmarks for the color converter instead of isolated benchmarking of Pack.

Comment on lines 201 to 205
Vector4 vo = Vector4.One;
Vector128<float> valpha = Unsafe.As<Vector4, Vector128<float>>(ref vo);

ref byte control = ref MemoryMarshal.GetReference(SimdUtils.HwIntrinsics.PermuteMaskDeinterleave8x32);
Vector256<int> vcontrol = Unsafe.As<byte, Vector256<int>>(ref control);
Copy link
Member

@antonfirsov antonfirsov Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By having two versions of Pack, (or inlining the HwIntrinsics AVX2 version) we can move these loads outside of the for loop calling Pack.

(Not necessarily a suggestion for this PR since it extends the scope quite a lot)

Vector256<int> vcontrol = Unsafe.As<byte, Vector256<int>>(ref control);

Vector256<float> r0 = Avx.InsertVector128(
Unsafe.As<Vector4, Vector128<float>>(ref r.A).ToVector256(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not getting it wrong, we can spare ToVector256 in lines dealing with r.A, g.A and b.A, since the upper 4 elements will be overwritten anyways:

Suggested change
Unsafe.As<Vector4, Vector128<float>>(ref r.A).ToVector256(),
Unsafe.As<Vector4, Vector256<float>>(ref r.A),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly when I tried that I got access violations!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it was on .A stuff and not .B?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll double check. Was super surprised to see it as it didn’t make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, crashes every time.

image

1);

Vector256<float> r2 = Avx.InsertVector128(
Unsafe.As<Vector4, Vector128<float>>(ref r.B).ToVector256(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a wider refactor, it's also possible to save the conversion here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd certainly be easier if I was using pointers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is inlining the Pack stuff into AVX2 color space conversion code. That would remove a bunch of unnecessary loads/stores.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I mean. It's difficult to inline at the moment because I need the 128bit offset when I'm aligning at 256bit.
If I fixed the r, g, b inputs then I could simply load up the vector from the offset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pushing the current state.

@JimBobSquarePants
Copy link
Member Author

@antonfirsov Experimenting with a separate implementation. First time I've seen sub 9ms.

Need to up the warmup/run count on these benchmarks though there's always too much error.

Method TestImage Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
'Decode Jpeg - System.Drawing' Jpg/b(...)e.jpg [21] 4.818 ms 0.5476 ms 0.0300 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)e.jpg [21] 8.998 ms 0.7054 ms 0.0387 ms 1.87 0.00 - - - 15906 B
'Decode Jpeg - System.Drawing' Jpg/b(...)f.jpg [28] 13.091 ms 0.3217 ms 0.0176 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)f.jpg [28] 21.961 ms 1.3653 ms 0.0748 ms 1.68 0.01 - - - 16896 B
'Decode Jpeg - System.Drawing' Jpg/i(...)e.jpg [43] 326.853 ms 85.0622 ms 4.6625 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/i(...)e.jpg [43] 241.632 ms 125.2805 ms 6.8670 ms 0.74 0.02 - - - 36022512 B

@codecov
Copy link

codecov bot commented Oct 23, 2020

Codecov Report

Merging #1402 into master will decrease coverage by 0.00%.
The diff coverage is 95.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1402      +/-   ##
==========================================
- Coverage   82.90%   82.89%   -0.01%     
==========================================
  Files         690      690              
  Lines       31008    31017       +9     
  Branches     3560     3561       +1     
==========================================
+ Hits        25706    25713       +7     
- Misses       4580     4581       +1     
- Partials      722      723       +1     
Flag Coverage Δ
#unittests 82.89% <95.55%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ents/Decoder/ColorConverters/JpegColorConverter.cs 93.18% <ø> (ø)
...mageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs 93.44% <60.00%> (-1.48%) ⬇️
...Converters/JpegColorConverter.FromYCbCrSimdAvx2.cs 92.45% <100.00%> (+0.78%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5975a3...3ae4b02. Read the comment docs.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to go with this as is, since it's definitely an improvement.

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Oct 23, 2020

@antonfirsov There's more!

I inlined the method. Using Vector256 works now.

Method TestImage Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
'Decode Jpeg - System.Drawing' Jpg/b(...)e.jpg [21] 4.881 ms 0.5078 ms 0.0278 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)e.jpg [21] 9.056 ms 0.8409 ms 0.0461 ms 1.86 0.01 - - - 15889 B
'Decode Jpeg - System.Drawing' Jpg/b(...)f.jpg [28] 13.211 ms 0.9451 ms 0.0518 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)f.jpg [28] 22.386 ms 1.1476 ms 0.0629 ms 1.69 0.00 - - - 16896 B
'Decode Jpeg - System.Drawing' Jpg/i(...)e.jpg [43] 329.925 ms 41.1656 ms 2.2564 ms 1.00 0.00 - - - 216 B
'Decode Jpeg - ImageSharp' Jpg/i(...)e.jpg [43] 239.662 ms 92.6744 ms 5.0798 ms 0.73 0.02 - - - 36022512 B

@saucecontrol
Copy link
Contributor

@saucecontrol Is there something I can do here to avoid the permutation?

You can't avoid them, but you can reduce the number. I have an AVX2 YCbCr->BGRX converter you can look at here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/PlanarConversionTransform.cs#L153-L203

The main difference is I permute the values first, separating the even and odd indexes into the low and high lanes respectively. That allows the unpack operations to do all the remaining work. It also means you can get by with 3 permutes instead of 4 since your alpha vector is constant.

And if you make your green coefficients negative, you can use a couple more FMAs in there :)

@JimBobSquarePants
Copy link
Member Author

@saucecontrol Ahah! I’ll have to give that a try! Thanks!

@JimBobSquarePants
Copy link
Member Author

Very nice @saucecontrol

We're getting much closer now. Once we refactor to resolve directly to Rgb24, Rgba32 instead of Vector4 I think we'll be very close indeed!

I'll be running the latest nightly against the playground benchmarks once this is merged and built.

Method TestImage Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
'Decode Jpeg - System.Drawing' Jpg/b(...)e.jpg [21] 4.845 ms 0.7259 ms 0.0398 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)e.jpg [21] 8.724 ms 0.5863 ms 0.0321 ms 1.80 0.01 - - - 15888 B
'Decode Jpeg - System.Drawing' Jpg/b(...)f.jpg [28] 13.134 ms 0.8384 ms 0.0460 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)f.jpg [28] 21.689 ms 0.9054 ms 0.0496 ms 1.65 0.01 - - - 16896 B
'Decode Jpeg - System.Drawing' Jpg/i(...)e.jpg [43] 327.796 ms 19.6112 ms 1.0750 ms 1.00 0.00 - - - 216 B
'Decode Jpeg - ImageSharp' Jpg/i(...)e.jpg [43] 231.320 ms 100.4219 ms 5.5045 ms 0.71 0.02 - - - 36022512 B

@JimBobSquarePants JimBobSquarePants merged commit 120080b into master Oct 24, 2020
@JimBobSquarePants JimBobSquarePants deleted the js/vector4octet-pack branch October 24, 2020 00:21
JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021
Add AVX2 Vector4Octet.Pack implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants