-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Move Vector4 to be implemented using HWIntrinsics on x86 #27483
Conversation
It's not quite perfect yet, namely because HWIntrinsic vectors don't support method constants but SIMD vectors do (this is tracked by https://github.com/dotnet/coreclr/issues/17225 and isn't too hard to add).
|
The HWIntrinsics are fully VEX aware, while other parts of the JIT (SIMD Intrinsics, Scalar Ops like This means that the allocated registers and the emitted disassembly is better (we should fix this regardless and make the register allocator and codegen fully vex aware for any simd instruction): Before vxorps xmm0, xmm0
vmovupd xmmword ptr [rcx], xmm0
mov rax, rcx After: vxorps xmm0, xmm0, xmm0
vmovupd xmmword ptr [rcx], xmm0
mov rax, rcx |
HWIntrinsics currently handle duplicated inputs better: Example 1Before: vmovupd xmm0, xmmword ptr [rcx]
vmovupd xmm1, xmmword ptr [rcx]
vdpps xmm0, xmm1, -15
vsqrtss xmm0, xmm0 After: vmovupd xmm0, xmmword ptr [rcx]
vmovaps xmm1, xmm0
vdpps xmm0, xmm1, xmm0, -1
vsqrtss xmm0, xmm0 Example 2Before: vmovupd xmm0, xmmword ptr [rdx]
vmovupd xmm1, xmmword ptr [rdx]
vdpps xmm0, xmm1, -15
vsqrtss xmm0, xmm0
vmovupd xmm1, xmmword ptr [rdx]
vbroadcastss xmm0, xmm0
vdivps xmm1, xmm0
vmovupd xmmword ptr [rcx], xmm1
mov rax, rcx After: vmovupd xmm0, xmmword ptr [rdx]
vmovaps xmm1, xmm0
vmovaps xmm2, xmm0
vdpps xmm1, xmm1, xmm2, -1
vsqrtss xmm1, xmm1
vbroadcastss xmm1, xmm1
vdivps xmm0, xmm0, xmm1
vmovupd xmmword ptr [rcx], xmm0
mov rax, rcx |
The built-in codegen for Before: vxorps xmm4, xmm4
vmovss xmm4, xmm4, xmm3
vpslldq xmm4, 4
vmovss xmm4, xmm4, xmm2
vpslldq xmm4, 4
vmovss xmm4, xmm4, xmm1
vpslldq xmm4, 4
vmovss xmm4, xmm4, xmm0
vmovaps xmm0, xmm4 After: vinsertps xmm0, xmm0, xmm1, 16
vinsertps xmm0, xmm0, xmm2, 32
vinsertps xmm0, xmm0, xmm3, 48 |
For indirect invocation cases, SIMDIntrinsics were always executing the software fallback code: G_M45532_IG02:
vmovss xmm0, dword ptr [rcx]
vucomiss xmm0, dword ptr [rdx]
jp SHORT G_M45532_IG06
jne SHORT G_M45532_IG06
G_M45532_IG03: ;; bbWeight=0.50
vmovss xmm0, dword ptr [rcx+4]
vucomiss xmm0, dword ptr [rdx+4]
jp SHORT G_M45532_IG06
jne SHORT G_M45532_IG06
vmovss xmm0, dword ptr [rcx+8]
vucomiss xmm0, dword ptr [rdx+8]
jp SHORT G_M45532_IG06
jne SHORT G_M45532_IG06
vmovss xmm0, dword ptr [rcx+12]
vucomiss xmm0, dword ptr [rdx+12]
setnp al
jp SHORT G_M45532_IG04
sete al
G_M45532_IG04: ;; bbWeight=0.50
movzx rax, al
G_M45532_IG05: ;; bbWeight=0.50
ret
G_M45532_IG06: ;; bbWeight=0.50
xor eax, eax
G_M45532_IG07: ;; bbWeight=0.50
ret With HWIntrinsics, they can be faster (which should also improve Tier 0 speed): vmovupd xmm0, xmmword ptr [rcx]
vcmpps xmm0, xmm0, xmmword ptr [rdx], 0
vmovmskps xrax, xmm0
cmp eax, 15
sete al
movzx rax, al |
CC. @danmosemsft, @CarolEidt, @jkotas I've gotten an initial (rough) prototype done and it looks very promising. I'd like to get some more thoughts/opinions first and if this is something we'd like to push forward, this can be the first step. |
Also CC. @benaadams, @saucecontrol, @EgorBo who I believe have expressed prior interest in this. |
{ | ||
Vector128<float> tmp2 = Sse.Shuffle(vector2.AsVector128(), tmp, 0x40); | ||
tmp2 = Sse.Add(tmp2, tmp); | ||
tmp = Sse.Shuffle(tmp, tmp2, 0x30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the absence of the MM_SHUFFLE macro, I find binary literals are almost as easy to read, and certainly easier than hex. e.g. 0b_00_11_00_00 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really just add a helper intrinsic for this, it helps readability immensely and is going to be used in most native code people might try to port.
I don't think we have an API proposal yet, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been waiting for the _MM_SHUFFLE API since the S.R.I beginning 🙂 I guess it should somehow only allow constant values as an input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it should somehow only allow constant values as an input.
There is currently no C# language-specific way to enforce it and the only way is via analyzers.
There is a C# proposal for runtime const unmanaged
but despite preliminary OK from language team it seems that interest has faded away (see dotnet/csharplang#744)
This will generate a lot more complex trees. How much likely is it to hit JIT complexity limits with this implementation and regress real-world code? |
The trees (at least for Tier 1) are actually not that much more complex and I think Tier 0 could be improved greatly by (if nothing else) just allowing dead code path elimination for the Conversion between Vector4/Vector128 is a nop and is done entirely in importation so it doesn't generate any nodes (same as the All but a single code-path are also dropped which helps keep the trees smaller. Many operations were already 1-to-1, so with the conversion being elided, they still stay as a single node with the same 1 or 2 child nodes as before. There are some notable exceptions, such as
Where it used to be:
But, that should also allow other optimizations to start applying (where they couldn't easily have applied before). |
I'll get some JIT dumps to help share examples of how the trees changed. |
Really nice to see how easy it is to add ARM implementations side by side. That's a good motivating factor for getting this done. Side note: Is there an issue tracking containment not working with constant floating point values? I see extra mov's with Vector128.Create(1.0f) and the like. |
Most of the issues are currently tracked here: https://github.com/dotnet/coreclr/projects/7. We have a few tracking areas where codegen is not optimal (including some around the constructors). I'm not sure if we have one specifically for broadcast. |
For reference, here are the JITDumps for probably the simplest example:
JitDump_base.txt The most notable difference is that
Where-as
Another notable difference is that |
From the first glimpse, it looks like Vector3 could be handled that way quite easily providing any transfers from/to memory will be handled as SIMD12 type. Essentially it would abstract Vector3 into Vector128 when in registers while will return back to SIMD12 type once back in memory (stack for x64 target could be handled as a Vector128 due to 16-byte alignment I guess). However, what is even more important, this approach immediately opens up the possibility to implement all double VectorX variants with Vector128/Vector256 HW intrinsics what has been a big ask from the community. Furthermore, there is a clear path to some excellent further "vectorization" opportunities, particularly for matrix math with AVX2, AVX512 intrinsics. Very encouraging work, indeed. |
Right. There are conversion operators for Vector2/3/4/ to and from Vector128/256 now. This PR does the JIT hookup for Vector4 and showcases it working (and I'll be doing the other hookups, without rewriting things, separately to help get things in). It should work very similarly for Vector2/3 and provide an easy way to perform specialization on existing System.Numerics code without needing to rewrite everything to HWIntrinsics (instead you only need to specialize where System.Numerics doesn't support the functionality today). |
Let me see if I understand this correctly. You are saying that everything is fine and dandy and the simple fact that the JIT diff dump is more than 50% larger than the base doesn't sound any alarms. Basically it goes from simply recognizing the
The only reason this isn't an unmitigated disaster is that the importer manages to ignore the dead IL. Though it still does extra work, such as creating a bunch more useless basic blocks. And this tree that you claim to be smaller:
Isn't that tree cute? But it's a freakin' forest in reality:
instead of
And the current JIT implementation is able to maintain some more complex vector expressions as a single tree:
I don't know what this will generate with the new implementation. If the simplest example generates a forest this must be generating a jungle or something. So yeah, it would be good to somehow implement SIMD intrinsics on top of HWIntrinsic but this approach is likely to come with a few elephants, tigers and monkeys in the baggage. |
It's a good point to keep checks on what JIT is generating at each phase, however, since none of this is nature creation but rather a developer creation it should be manageable. I would rather argue that the current implementation of all Vector SIMD functionality is a bit odd given that it can be abstracted much better based on HW intrinsics. |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static bool operator !=(Vector4 left, Vector4 right) | ||
{ | ||
if (Sse.IsSupported) | ||
{ | ||
return Sse.MoveMask(Sse.CompareNotEqual(left.AsVector128(), right.AsVector128())) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you just call !(left == right)
(as in the !Sse.IsSupported path) will the codegen be affected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be. The ==
operator is also aggressively inlined and the Sse.IsSupported
path will be dropped from there as well, so it should be identical to the code it was previously generating.
Yes, and I had taken a look at the diff. A very large chunk of the size difference is the inliner spewing information about
This is a general problem with intrinsics and inlining in general. Having the
I believe this is, again, a general problem with intrinsics and inlining. Due to their being an explicit method call, the JIT initially tracks the two input arguments as "address taken" and requiring a copy. The return value is also explicitly tracked separately. |
Right. There will likely be a few gotchas and some improvements needed in the JIT (but this was already true for code using intrinsics, whether SIMD or HW). What I'm trying to solve is that we currently have some more general TYP_SIMD logic in the JIT and then the S.Numerics and S.R.Intrinsics support. The support for the latter two is very similar but it is different. Now, we could go and try to share some of this infrastructure in the JIT instead, but I think there would still be some disparency there which would make this less than ideal. |
This is not quite true. As seen in #24912, even though these checks are elided in the importer, the additional basic blocks aren't coalesced until later in the JIT. This means that the local assertion prop doesn't operate across the blocks, and some optimizations are missed, since global assertion prop is not identical. I attempted two different approaches to fixing this (https://github.com/CarolEidt/coreclr/tree/CopyProp1) and (https://github.com/CarolEidt/coreclr/tree/CopyProp2), but both had mixed results. Regarding the discussion above wrt the "general problem" of inlining intrinsic methods, this problem remains to be solved (or at least sufficiently mitigated) before we accept it as a given, and make changes that incur these costs where they previously were less of an issue. I would reiterate what I've said before - I think this is right direction long-term, but we need to be driven by specific requirements and priorities, and I think it's important that we don't go down this path prematurely. |
There might be point fixes we can make, but in general the importer doesn't have enough context to safely do this. I don't see annotations or similar as the right solution. I think a better plan involves some sort of post-inline cleanup pass to coalesce blocks and rebuild trees, and a more detailed assessment of how the optimizer scales with large numbers of temps, so we can relax limits or back off opts gradually rather than the current all or nothing approach. In most cases where we see HW intrinsics or vectors used we can safely assume the method is important for performance. We already have a bit of this hinting in the inliner, but it's pretty simplistic. |
Will go ahead and close this for now. It was mostly just a prototype to validate that this was indeed possible and produced good codegen 😄. There are some obvious problems that still exist and it might be good to ensure that explicit bugs exist to track those, but we can also revisit this after some more fixes go through. |
This prototypes: https://github.com/dotnet/corefx/issues/31425
There are a few benefits to switching these to being implemented using HWIntrinsics:
simdcodegenxarch.cpp
andsimd.cpp
logic could be removed.The only thing really needed is the better codegen for accessing the public fields and the JIT support for mapping these types to the TYP_SIMD8/12/16 types in the JIT (so conversions are done appropriately)