-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Porting additional SIMD Intrinsics to use SimdAsHWIntrinsic #37882
Porting additional SIMD Intrinsics to use SimdAsHWIntrinsic #37882
Conversation
4044add
to
e98213d
Compare
e98213d
to
defea13
Compare
defea13
to
4105ab7
Compare
8ac8ead
to
63ec8f0
Compare
bfb7472
to
d080102
Compare
CC. @CarolEidt, @echesakovMSFT |
Benchmarks x64
Frameworks x64
|
Working on getting perf numbers but this should resolve the following perf regression #37425 |
Most of the diffs are essentially the following: - vmovaps xmm1, xmm0
- vdpps xmm1, xmm0, 113
- vmovaps xmm0, xmm1
+ vdpps xmm0, xmm0, xmm0, 113 There are also a few similar too: vmovmskps xrax, xmm0
- mov edx, 7
- and eax, edx
+ and eax, 7
Likewise a few that go from:
```diff
- mov eax, 0xD1FFAB1E
- vmovd xmm0, eax
- vpbroadcastd ymm0, ymm0
+ vmovupd ymm0, ymmword ptr[reloc @RWD32] A more extreme example: vxorps xmm0, xmm0
vmovdqu xmmword ptr [rsp+48H], xmm0
vmovdqu xmmword ptr [rsp+58H], xmm0
lea rcx, bword ptr [rsp+48H]
call Vector`1:.ctor(Vector`1):this
mov rcx, rdi
vmovdqu xmm0, xmmword ptr [rsp+48H]
vmovdqu xmmword ptr [rsp+28H], xmm0
vmovdqu xmm0, xmmword ptr [rsp+58H]
vmovdqu xmmword ptr [rsp+38H], xmm0
lea rdx, bword ptr [rsp+28H]
mov r8, rsi
call Vector`1:op_Multiply(Vector`1,Vector`1):Vector`1
mov rax, rdi To: vmovupd ymm0, ymmword ptr[rdx]
vbroadcastss ymm0, ymm0
vmovupd ymmword ptr[rsp+50H], ymm0
mov rcx, rsi
vmovupd ymm0, ymmword ptr[rsp+50H]
vmovupd ymmword ptr[rsp+20H], ymm0
lea rdx, bword ptr [rsp+20H]
call Vector`1:op_Multiply(Vector`1,Vector`1):Vector`1
mov rax, rsi The |
SIMD_INTRINSIC("get_One", false, GetOne, "one", TYP_STRUCT, 0, {TYP_VOID, TYP_UNDEF, TYP_UNDEF}, {TYP_INT, TYP_FLOAT, TYP_DOUBLE, TYP_LONG, TYP_USHORT, TYP_UBYTE, TYP_BYTE, TYP_SHORT, TYP_UINT, TYP_ULONG}) | ||
SIMD_INTRINSIC("get_Zero", false, GetZero, "zero", TYP_STRUCT, 0, {TYP_VOID, TYP_UNDEF, TYP_UNDEF}, {TYP_INT, TYP_FLOAT, TYP_DOUBLE, TYP_LONG, TYP_USHORT, TYP_UBYTE, TYP_BYTE, TYP_SHORT, TYP_UINT, TYP_ULONG}) | ||
SIMD_INTRINSIC("get_AllOnes", false, GetAllOnes, "allOnes", TYP_STRUCT, 0, {TYP_VOID, TYP_UNDEF, TYP_UNDEF}, {TYP_INT, TYP_FLOAT, TYP_DOUBLE, TYP_LONG, TYP_USHORT, TYP_UBYTE, TYP_BYTE, TYP_SHORT, TYP_UINT, TYP_ULONG}) | ||
|
||
// .ctor call or newobj - there are four forms. | ||
// This form takes the object plus a value of the base (element) type: | ||
SIMD_INTRINSIC(".ctor", true, Init, "init", TYP_VOID, 2, {TYP_BYREF, TYP_UNKNOWN, TYP_UNDEF}, {TYP_INT, TYP_FLOAT, TYP_DOUBLE, TYP_LONG, TYP_USHORT, TYP_UBYTE, TYP_BYTE, TYP_SHORT, TYP_UINT, TYP_ULONG}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully removing SIMDIntrinsicInit
and removing gtGetSIMDZero
requires a bit more work. I logged #37043 as the more general issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that, beyond this PR, any other improvements should likely hold off for .NET 6.
CC. @dotnet/jit-contrib This should be ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor comment stuff.
} | ||
} | ||
|
||
if (isSupported && (intrinsic == NI_Vector256_ToScalar)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would make more sense to just split out this case (as it was previously), even with a separate check for the long types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the confusing part here is that the AVX
check isn't checking for some instruction support, it's just a "we support Vector256<T>
" check.
The instruction emitted is the same for 128 or 256-bit, it's just that we support Vector256<T>
if AVX is supported, so the code really is identical (we even emit the register access as a 128-bit access)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that (AFAICT) the previous calls to compExactlyDependsOn
are unnecessary for the NI_Vector256_ToScalar
case. So perhaps just checking for that first would make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's generally true except for the SSE2_X64
case, as we won't have AVX_X64
until #38460 goes in (although you also can't disable SSE2.X64
, you can only disable SSE2
itself).
If duplicated, the compExactlyDependsOn(SSE2)
checks would become compExactlyDependsOn(AVX)
and the compExactlyDependsOn(SSE2_X64)
check would become compExactlyDependsOn(AVX) && compExactlyDependsOn(SSE2_X64)
. Once #38460 is merged, it could be simplified to just compExactlyDependsOn(AVX_X64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is it correct that only the SSE2_X64
check is non-redundant for the AVX
case? It still seems confusing to me, and a lot of unnecessary checks for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, only the SSE2_X64 case is non redundant. I just pushed a fix that breaks it apart and which should make that distinction clearer.
…AllOnes, and SIMDIntrinsicGetZero
…UBLE and fixing VectorT128_get_One
Co-authored-by: Carol Eidt <[email protected]>
6db4ec3
to
3ac2b4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for the comments and code shufflings!
This ports additional intrinsics such as
SIMDIntrinsicInit
,SIMDIntrinsicGetOne
, andSIMDIntrinsicDot