-
-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX2 Vector4Octet.Pack implementation #1402
Conversation
@saucecontrol Is there something I can do here to avoid the permutation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but check the second suggestion.
I also realized that we can get this much better by implementing a HwIntrinsics AVX2 version of FromYCbCrSimdVector8
. I suggest to check benchmarks for the color converter instead of isolated benchmarking of Pack
.
Vector4 vo = Vector4.One; | ||
Vector128<float> valpha = Unsafe.As<Vector4, Vector128<float>>(ref vo); | ||
|
||
ref byte control = ref MemoryMarshal.GetReference(SimdUtils.HwIntrinsics.PermuteMaskDeinterleave8x32); | ||
Vector256<int> vcontrol = Unsafe.As<byte, Vector256<int>>(ref control); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By having two versions of Pack
, (or inlining the HwIntrinsics AVX2 version) we can move these loads outside of the for loop calling Pack
.
(Not necessarily a suggestion for this PR since it extends the scope quite a lot)
Vector256<int> vcontrol = Unsafe.As<byte, Vector256<int>>(ref control); | ||
|
||
Vector256<float> r0 = Avx.InsertVector128( | ||
Unsafe.As<Vector4, Vector128<float>>(ref r.A).ToVector256(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not getting it wrong, we can spare ToVector256
in lines dealing with r.A
, g.A
and b.A
, since the upper 4 elements will be overwritten anyways:
Unsafe.As<Vector4, Vector128<float>>(ref r.A).ToVector256(), | |
Unsafe.As<Vector4, Vector256<float>>(ref r.A), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weirdly when I tried that I got access violations!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it was on .A
stuff and not .B
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll double check. Was super surprised to see it as it didn’t make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1); | ||
|
||
Vector256<float> r2 = Avx.InsertVector128( | ||
Unsafe.As<Vector4, Vector128<float>>(ref r.B).ToVector256(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a wider refactor, it's also possible to save the conversion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd certainly be easier if I was using pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is inlining the Pack
stuff into AVX2 color space conversion code. That would remove a bunch of unnecessary loads/stores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I mean. It's difficult to inline at the moment because I need the 128bit offset when I'm aligning at 256bit.
If I fixed the r, g, b inputs then I could simply load up the vector from the offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pushing the current state.
@antonfirsov Experimenting with a separate implementation. First time I've seen sub 9ms. Need to up the warmup/run count on these benchmarks though there's always too much error.
|
Codecov Report
@@ Coverage Diff @@
## master #1402 +/- ##
==========================================
- Coverage 82.90% 82.89% -0.01%
==========================================
Files 690 690
Lines 31008 31017 +9
Branches 3560 3561 +1
==========================================
+ Hits 25706 25713 +7
- Misses 4580 4581 +1
- Partials 722 723 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to go with this as is, since it's definitely an improvement.
@antonfirsov There's more! I inlined the method. Using Vector256 works now.
|
You can't avoid them, but you can reduce the number. I have an AVX2 YCbCr->BGRX converter you can look at here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/PlanarConversionTransform.cs#L153-L203 The main difference is I permute the values first, separating the even and odd indexes into the low and high lanes respectively. That allows the unpack operations to do all the remaining work. It also means you can get by with 3 permutes instead of 4 since your alpha vector is constant. And if you make your green coefficients negative, you can use a couple more FMAs in there :) |
@saucecontrol Ahah! I’ll have to give that a try! Thanks! |
Very nice @saucecontrol We're getting much closer now. Once we refactor to resolve directly to I'll be running the latest nightly against the playground benchmarks once this is merged and built.
|
Add AVX2 Vector4Octet.Pack implementation
Prerequisites
Description
This should add some performance boost until #1242 and others are complete.
I struggled writing this; any performance tips are most welcome.
Note: No difference with SIMD disabled as we're not using it.