-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate moving the types in System.Numerics.Vectors
to use Hardware Intrinsics
#956
Comments
I had done some basic/preliminary investigation some time back and the numbers looked promising: https://github.com/dotnet/corefx/issues/25386#issuecomment-361137982 |
I am marking this up for grabs in case someone from the community wants to work on this. If you want to take a stab, feel free to say so and I can assign the issue out. The "Microsoft DirectX Math" library is open-source, MIT-licensed, and has existing HWIntrinsic accelerated code for many of these APIs (but for Native code): https://github.com/microsoft/directxmath. This would likely be a good starting point as the algorithms have been around for a long time, have been in active use in production code, and are known to generally be correct and performant.
|
FYI. @eerhardt, @danmosemsft, @CarolEidt, @fiigii |
Also CC. @benaadams, who I know has looked at vectorizing/improving some of these types previously |
@hughbe in case he is interested in a meaty task. |
@tannergooding Thanks for opening this issue. I agree that porting
|
@tannergooding Would you like to propose a "VectorMath" library for |
IMHO we should take the opportunity when rewriting
cc @CarolEidt @danmosemsft @eerhardt @fiigii @tannergooding |
I don't believe this is needed. You should be able to load a Users will also have their own vector/numeric types, so we should focus on making sure that getting a
I'm not sure why you think this is an issue. Could you elaborate? |
There will likely need to be some minimal helper library that operates solely on It would need to be internal at first, and discussions on making it public could happen later. |
Any port should be a two/three step process:
Doing them together makes it harder to review the code and track the changes. ARM/ARM64 support should definitely happen, but only after the APIs have been reviewed/approved/implemented. |
That would be great if the runtime can eliminate the round trip. If not, the memory load/store may have considerable overhead when we have
Agree! Related to https://github.com/dotnet/coreclr/issues/19116 |
I would certainly hope that we can. From my view, anything that would prevent us from moving the |
This seems like a good enough excuse for me to take this on if nobody else is interested 😉 I'm at a bit of a loss how to start, though. Even something as trivial as The constructors are even more confusing to me since they may be a setup for further SIMD operations or the fields may be immediately accessed. The JIT seems to know what to do about that now, but I can't imagine how managed code could get that right. As far as the porting process, would the goal be to replace all |
@saucecontrol, my initial experimentation can be found here: https://github.com/tannergooding/corefx/commit/f8625409692591fc7bc2ad5ac372f7bdbd724576
This is a case where, if the JIT isn't smart enough to do this, it probably should be. Most ABIs have some concept of
Yes, I think the initial goal should be to replace the I would expect that you will want some kind of mini "helper" library that only operates on
I think this is probably desirable as:
|
dotnet/corefx#19116 deals with vectors wrapped in a class, which is different from wrapping them in a struct. The former is what involves escape analysis or stack allocation, while the other involves appropriate struct promotion and copy elimination. The JIT has room for improvement in all of those areas. |
Yes, in that issue, I also said
@CarolEidt Do you think this is a good time to improve the struct promotion for struct-wrapped Vector128/256? |
Thanks, @tannergooding I found my way over to your |
Wouldn't it be better to replace the non-intrinsic methods first as that's actually a gain? Or is it to test that the current intrinsic methods match up when using the new intrinsics? |
@benaadams, right. I think we want to first make sure there is no loss in existing performance and then work on improving general performance on the rest of the functions after that. |
I think it would be great to do that; I don't think it rises to the top of my priority list, but I would be supportive of others working on it. |
@CarolEidt Thanks! Let me give a try next week (after I finish |
@saucecontrol, thanks! Let me know if you do want to pick this up "more officially" and we can add you as a contributor and assign the issue out 😄. Note: Being assigned the issue wouldn't mean that you have any obligation to actually complete the work or any kind of deadline for it. It is mostly to communicate that someone is "actively" working on it. If you decide to drop it at a later time, that is completely fine, and we will unassign the issue so that someone else can pick it up. I would be on point for assisting if you have any questions/concerns or if you just need help (the same goes for anyone else if they decide they want to pick this up instead). |
@tannergooding, sounds good to me! If you don't mind answering the occasional stupid question on gitter, I should be able to get this done. Let's go ahead and make it official. |
Feel free to ask as many questions as needed 😄 You should have gotten an invitation and the issue has been assigned to you. |
I started coding on this tonight, and I ran into problems pretty quickly. Take the following test for example public Vector2 AddemUp()
{
var sum = Vector2.Zero;
var vecs = ArrayOfVectors;
for (int i = 0; i < vecs.Length; i++)
sum += vecs[i];
return sum;
} With the SIMD intrinsics enabled, that G_M32824_IG03:
4C63C2 movsxd r8, edx
C4A17B1044C010 vmovsd xmm0, qword ptr [rax+8*r8+16]
C4E17B104C2428 vmovsd xmm1, qword ptr [rsp+28H]
C4E17058C8 vaddps xmm1, xmm0
C4E17B114C2428 vmovsd qword ptr [rsp+28H], xmm1
FFC2 inc edx
3BCA cmp ecx, edx
7FDD jg SHORT G_M32824_IG03 Which is not too bad. Would be better if it didn't spill Then I started my prototype simply, with only [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector2 operator +(Vector2 left, Vector2 right)
{
#if HAS_INTRINSICS
if (Sse2.IsSupported)
{
Vector2 vres = default;
var vleft = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&left.X));
var vright = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&right.X));
Sse2.StoreScalar((double*)&vres.X, Sse.StaticCast<float, double>(Sse.Add(vleft, vright)));
return vres;
}
#endif
return new Vector2(left.X + right.X, left.Y + right.Y);
} And that produces the following for the same loop: G_M32825_IG03:
C4E17A10442438 vmovss xmm0, dword ptr [rsp+38H]
C4E17A11442428 vmovss dword ptr [rsp+28H], xmm0
C4E17A1044243C vmovss xmm0, dword ptr [rsp+3CH]
C4E17A1144242C vmovss dword ptr [rsp+2CH], xmm0
4C63C2 movsxd r8, edx
4E8D44C010 lea r8, bword ptr [rax+8*r8+16]
C4C17A1000 vmovss xmm0, dword ptr [r8]
C4E17A11442420 vmovss dword ptr [rsp+20H], xmm0
C4C17A104004 vmovss xmm0, dword ptr [r8+4]
C4E17A11442424 vmovss dword ptr [rsp+24H], xmm0
C4E17857C0 vxorps xmm0, xmm0
C4E17A11442430 vmovss dword ptr [rsp+30H], xmm0
C4E17A11442434 vmovss dword ptr [rsp+34H], xmm0
C4E17A11442430 vmovss dword ptr [rsp+30H], xmm0
C4E17A11442434 vmovss dword ptr [rsp+34H], xmm0
4C8D442428 lea r8, bword ptr [rsp+28H]
C4C17B1000 vmovsd xmm0, xmmword ptr [r8]
4C8D442420 lea r8, bword ptr [rsp+20H]
C4C17B1008 vmovsd xmm1, xmmword ptr [r8]
4C8D442430 lea r8, bword ptr [rsp+30H]
C4E17858C1 vaddps xmm0, xmm0, xmm1
C4C17B1100 vmovsd xmmword ptr [r8], xmm0
4C8B442430 mov r8, qword ptr [rsp+30H]
4C89442438 mov qword ptr [rsp+38H], r8
FFC2 inc edx
3BCA cmp ecx, edx
0F8F6BFFFFFF jg G_M32825_IG03 My I'm wondering whether I've done something wrong. In order to disable the Would that have side effects beyond disabling the specific Also, I'm wondering what will have to be done to get the JIT to recognize the equivalent of |
@saucecontrol - by commenting out the code that recognizes the Vector2 type, the JIT will treat them as regular structs. A better way would be to remove the intrinsics that you are implementing from hwintrinsiclistxarch.h. You will run into additional issues with |
@CarolEidt, in this case I think it would need to be cut from @saucecontrol, you will also get slightly better codegen if you don't initialize Vector2 vres;
var vleft = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&left.X));
var vright = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&right.X));
Sse2.StoreScalar((double*)&vres, Sse.StaticCast<float, double>(Sse.Add(vleft, vright)));
return vres; which produces: G_M6507_IG03:
vmovsd xmm0, qword ptr [rsp+20H]
vmovsd qword ptr [rsp+18H], xmm0
movsxd r8, eax
lea r8, bword ptr [rdx+8*r8+16]
vmovss xmm0, dword ptr [r8]
vmovss dword ptr [rsp+10H], xmm0
vmovss xmm0, dword ptr [r8+4]
vmovss dword ptr [rsp+14H], xmm0
vxorps xmm0, xmm0
vmovss dword ptr [rsp+08H], xmm0
vmovss dword ptr [rsp+0CH], xmm0
lea r8, bword ptr [rsp+18H]
vmovsd xmm0, xmmword ptr [r8]
lea r8, bword ptr [rsp+10H]
vmovsd xmm1, xmmword ptr [r8]
vaddps xmm0, xmm0, xmm1
lea r8, bword ptr [rsp+08H]
vmovsd xmmword ptr [r8], xmm0
vmovsd xmm0, qword ptr [rsp+08H]
vmovsd qword ptr [rsp+20H], xmm0
inc eax
cmp ecx, eax
jg SHORT G_M6507_IG03 |
I still would like to say that the current approach is very inefficient (dotnet/corefx#31779 (comment)). Each For example, the code below is generated from the PacketTracer benchmark that use hardware intrinsic directly. You can imagine that memory access would be inserted around (almost) each SIMD instruction if we "rewrite" it in vaddps ymm2, ymm2, ymmword ptr [r8+0x8]
vmulps ymm2, ymm2, ymm0
vaddps ymm2, ymm2, ymmword ptr [rax]
vmulps ymm2, ymm2, ymm3
vaddps ymm0, ymm2, ymm0
vaddps ymm0, ymm0, ymmword ptr [rcx]
vpaddd ymm1, ymm1, ymmword ptr [rdx]
vpslld ymm1, ymm1, 0x17
vmulps ymm0, ymm0, ymm1
vmulps ymm1, ymm0, ymmword ptr [rsp+0x600]
vmulps ymm2, ymm0, ymmword ptr [rsp+0x5e0]
vmulps ymm0, ymm0, ymmword ptr [rsp+0x5c0]
vmovupd ymmword ptr [rsp+0x260], ymm1 ;;; write back the results after finish as many as possible SIMD computations
vmovupd ymmword ptr [rsp+0x240], ymm2
vmovupd ymmword ptr [rsp+0x220], ymm0 |
@fiigii, these are likely things that the JIT will have to fix. Without it, only code in System.Private.CoreLib will be able to remain efficient and any user defined structs will be hit with these inefficiencies. CC. @CarolEidt |
AFAIK, that would be really difficult. |
Because not everyone uses |
There's an issue for it https://github.com/dotnet/coreclr/issues/18542 |
First, without a full analysis of where the inefficiency is coming from in this case, it is hard to say how difficult this would be. On the other hand, the messiness of a conversion intrinsic would not only be a non-trivial amount of work, but would be an unfortunate way to gloss over a fundamental inefficiency. I think we need to analyze what's the real source of these inefficiencies (there are a number of candidates) and assess the cost of fixing them. |
Import @tannergooding's comments from dotnet/coreclr#21518 (comment)
@tannergooding Which |
@fiigii, not sure yet. I hit the above as part of trying to remove the |
Ah, could you try to remove those ones from |
Yes, that works. However, it requires modifying CoreCLR every iteration (longer inner loop) and impacts more than just the singular method a person may be working on (for example, the single I'd rather get the types actually respecting the |
Agree. |
I had done an initial investigation of this in dotnet/coreclr#27483, but it was closed due to generating more complex trees and therefore missing some optimizations. This is not something that could be readily handled today. @CarolEidt, @jkotas, @AndyAyersMS. Unless you have any objections, I plan on closing this as unactionable until the blocking issues are addressed. Do we have issues currently tracking the needed improvements? As an aside, with us looking at introducing more AOT/R2R options, the ISA specific paths codified in the JIT are becoming more problematic and we are needing to block more methods due to this (e.g. #33090). This also means consumers can't see the benefits of the ISA specific perf improvements when they are available. On the other hand, ISA specific paths codified in managed code don't have this problem and we are able to support both paths relatively trivially. We also have better codegen support in general (especially around containment). and so addressing some of the currently blocking issues may become more necessary as we move forward. |
I think that this can be kept open to keep track of the progress. It is something we want to do, it just cannot be done today. Agree with the rest. |
Going to close this as most off the SIMD intrinsics have already been ported to use HWIntrinsics internally during importation. |
Today, hardware acceleration of certain types in the
System.Numerics.Vectors
project is achieved via[Intrinsic]
attributes and corresponding runtime support. There are a few downsides to this approach:In
netcoreapp30
, the new Hardware Intrinsics feature is supposed to ship. This feature also allows hardware acceleration but at a much more fine-grained level (APIs typically have a 1-to-1 mapping with underlying instructions).We should investigate moving the types in
System.Numerics.Vectors
to use hardware intrinsics as this has multiple potential benefits:Matrix4x4
)The text was updated successfully, but these errors were encountered: