-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Narrow and Widen using SIMDAsHWIntrinsic #60094
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThis continues the work on #49397 which started with #53450 In particular, this moves Narrow and Widen to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points. There will be a few more PRs after this one covering:
|
Will run outerloop jobs ( Also plan on collecting PMI diffs before marking this ready-for-review. |
366dcdd
to
c0b823c
Compare
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
Azure Pipelines successfully started running 3 pipeline(s). |
c0b823c
to
68b3e42
Compare
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
Azure Pipelines successfully started running 3 pipeline(s). |
No diff for benchmarks. ============================== Frameworks has the following improvements:
Noting that the Vector:Widen improvements are due to the code being "vectorized" now since it forwards to ; Latin1Utility:WidenLatin1ToUtf16_Fallback(long,long,long)
vmovupd ymm0, ymmword ptr[rcx+rax]
- vpermq ymm2, ymm0, -44
- vxorps ymm1, ymm1
- vpunpcklbw ymm2, ymm1
- vpermq ymm0, ymm0, -24
- vxorps ymm1, ymm1
- vpunpckhbw ymm0, ymm1
+ vmovaps ymm1, ymm0
+ vpmovzxbw ymm1, ymm1
+ vextractf128 xmm0, xmm0, 1
+ vpmovzxbw ymm0, ymm0 and (no diff shown against the old software impl, just showing a vectorized example)
============================== Tests has the following improvements:
Noting that most of the improvements come from the Vector128/256:Narrow/Widen calls being intrinsified now, where they weren't before. The interesting diffs are for JIT\SIMD where - vmovupd ymm6, ymmword ptr[rsi]
- vmovupd ymm0, ymmword ptr[rdi]
- vcvtpd2ps ymm6, ymm6
- vcvtpd2ps ymm1, ymm0
- vinsertf128 xmm6, xmm1, 1
- vcvtps2pd ymm7, ymm6
- vextractf128 xmm8, xmm6, 1
- vcvtps2pd ymm8, ymm8
+ vcvtpd2ps ymm0, ymmword ptr[rsi]
+ vcvtpd2ps ymm1, ymmword ptr[rdi]
+ vinsertf128 xmm6, xmm0, xmm1, 1
+ vmovaps ymm0, ymm6
+ vcvtps2pd ymm7, ymm0
+ vextractf128 xmm0, xmm6, 1
+ vcvtps2pd ymm8, ymm0 |
f5d21da
to
3094f4c
Compare
This should be ready for review now. |
3094f4c
to
c5fb319
Compare
Rebased onto dotnet/main. This is still ready for review. |
CC @echesakovMSFT PTAL. |
public static unsafe (Vector64<ushort> Lower, Vector64<ushort> Upper) Widen(Vector64<byte> source) | ||
=> (WidenLower(source), WidenUpper(source)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding Have you looked at the code produced for such function call? Do we pass the return values properly in two registers from a callee to the caller without copying to/from the memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codegen looked good when I had checked, let me rebuild and get a disasm example to share however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I remember seeing some strange artifacts with multi reg returns while I was working on #52424.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting this isn't a multi-reg return; rather its an a managed wrapper method that calls two helper intrinsics instead; since no platform implements this as a single instruction with multi-reg result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's not a multi-reg intrinsic, but it's still a multi-reg call returning a value in D0
and D1
registers, so I was wondering whether the codegen for such calls (incl. the method) is optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echesakovMSFT, looks like yes there is still some oddities there and it doesn't preference q0
/q1
as the target register nor does it directly do a mov
and instead does a store/load.
4EA01C10 mov v16.16b, v0.16b
0E617A10 fcvtl v16.2d, v16.2s
4E617811 fcvtl2 v17.2d, v0.4s
3D8007B0 str q16, [fp,#16]
3D800BB1 str q17, [fp,#32]
3DC007A0 ldr q0, [fp,#16]
3DC00BA1 ldr q1, [fp,#32]
It looks like this gets imported initially as:
[000014] I-CXG------- * CALL void ValueTuple`2..ctor (exactContextHnd=0x00007FFA49A115C9)
[000013] ------------ this in x0 +--* ADDR byref
[000012] -------N---- | \--* LCL_VAR struct<ValueTuple`2, 32> V02 tmp1
[000018] ---XG------- arg1 +--* OBJ simd16<Vector128`1>
[000017] ------------ | \--* ADDR byref
[000004] -------N---- | \--* HWINTRINSIC simd16 float ConvertToDouble
[000003] ------------ | \--* HWINTRINSIC simd8 float GetLower
[000002] n----------- | \--* OBJ simd16<Vector128`1>
[000001] ------------ | \--* ADDR byref
[000000] -------N---- | \--* LCL_VAR simd16<Vector128`1> V00 arg0
[000016] ---XG------- arg2 \--* OBJ simd16<Vector128`1>
[000015] ------------ \--* ADDR byref
[000008] -------N---- \--* HWINTRINSIC simd16 float ConvertToDoubleUpper
[000007] n----------- \--* OBJ simd16<Vector128`1>
[000006] ------------ \--* ADDR byref
[000005] -------N---- \--* LCL_VAR simd16<Vector128`1> V00 arg0
However, there are locals introduced and indirections kept that never truly go away; even in the face of HVA/HFA. So before rationalize we still have:
***** BB01
STMT00000 (IL 0x000...0x016)
N003 ( 5, 5) [000011] IA------R--- * ASG struct (init) $VN.Void
N002 ( 3, 2) [000009] D------N---- +--* LCL_VAR struct<ValueTuple`2, 32> V02 tmp1 d:1
N001 ( 1, 2) [000010] ------------ \--* CNS_INT int 0 $40
***** BB01
STMT00005 (IL ???... ???)
N007 ( 9, 8) [000037] -A--G---R--- * ASG simd16 (copy) $200
N006 ( 1, 1) [000035] D------N---- +--* LCL_VAR simd16<Vector128`1> V03 tmp2 d:1 $200
N005 ( 9, 8) [000004] -------N---- \--* HWINTRINSIC simd16 float ConvertToDouble $200
N004 ( 8, 7) [000003] ------------ \--* HWINTRINSIC simd8 float GetLower $1c0
N003 ( 7, 6) [000002] n----------- \--* OBJ simd16<Vector128`1> $81
N002 ( 1, 2) [000001] ------------ \--* ADDR byref $140
N001 ( 1, 1) [000000] -------N---- \--* LCL_VAR simd16<Vector128`1> V00 arg0 u:1 $80
***** BB01
STMT00006 (IL ???... ???)
N006 ( 8, 7) [000040] -A--G---R--- * ASG simd16 (copy) $84
N005 ( 1, 1) [000038] D------N---- +--* LCL_VAR simd16<Vector128`1> V04 tmp3 d:1 $84
N004 ( 8, 7) [000008] -------N---- \--* HWINTRINSIC simd16 float ConvertToDoubleUpper $84
N003 ( 7, 6) [000007] n----------- \--* OBJ simd16<Vector128`1> $83
N002 ( 1, 2) [000006] ------------ \--* ADDR byref $140
N001 ( 1, 1) [000005] -------N---- \--* LCL_VAR simd16<Vector128`1> V00 arg0 u:1 (last use) $80
***** BB01
STMT00003 (IL ???... ???)
N003 ( 5, 6) [000027] -A------R--- * ASG simd16 (copy) $200
N002 ( 3, 4) [000022] U------N---- +--* LCL_FLD simd16 V02 tmp1 ud:1->2[+0] Fseq[Item1] $300
N001 ( 1, 1) [000023] ------------ \--* LCL_VAR simd16<Vector128`1> V03 tmp2 u:1 (last use) $200
***** BB01
STMT00004 (IL ???... ???)
N003 ( 5, 6) [000034] -A------R--- * ASG simd16 (copy) $84
N002 ( 3, 4) [000029] U------N---- +--* LCL_FLD simd16 V02 tmp1 ud:2->3[+16] Fseq[Item2] $301
N001 ( 1, 1) [000030] ------------ \--* LCL_VAR simd16<Vector128`1> V04 tmp3 u:1 (last use) $84
***** BB01
STMT00002 (IL 0x016... ???)
N002 ( 4, 3) [000020] ------------ * RETURN struct $101
N001 ( 3, 2) [000019] -------N---- \--* LCL_VAR struct<ValueTuple`2, 32> V02 tmp1 u:3 (last use) $301
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- Noting this is without inlining. With inlining, the JIT sometimes does the right thing and other times does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for checking - this is what I suspected would happen - we should work on the issue in .NET 7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM.
Ubuntu-x64 improvements: dotnet/perf-autofiling-issues#2339 |
This continues the work on #49397 which started with #53450
In particular, this moves Narrow and Widen to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points.
There will be a few more PRs after this one covering: