-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Optimize some Matrix4x4 operations with SSE #31779
Conversation
Possibly outside the scope of this PR, but it would be nice if we had some perf tests for |
CC. @eerhardt, @ViktorHofer Also CC. @benaadams, who has looked at some of this before. |
result.M14 = matrix1.M14 + (matrix2.M14 - matrix1.M14) * amount; | ||
var m1Row = Sse.LoadVector128(&matrix1.M11); | ||
var m2Row = Sse.LoadVector128(&matrix2.M11); | ||
Sse.Store(&matrix1.M11, Sse.Add(m1Row, Sse.Multiply(Sse.Subtract(m2Row, m1Row), amountVec))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this somewhat here, but it would be nice if we had some of these functions extracted to a mini-helper library.
These lines are basically a VectorLerp
function, which would allow something like:
Sse.Store(&matrix1.M11, VectorMath.Lerp(&matrix1.M11, &matrix2.M11, amountVec));
Sse.Store(&matrix1.M21, VectorMath.Lerp(&matrix1.M21, &matrix2.M21, amountVec));
Sse.Store(&matrix1.M31, VectorMath.Lerp(&matrix1.M31, &matrix2.M31, amountVec));
Sse.Store(&matrix1.M41, VectorMath.Lerp(&matrix1.M41, &matrix2.M41, amountVec));
or something similar (and would also allow reuse elsewhere, such as in Vector4.Lerp
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for all the operations below.
result.M44 = -value.M44; | ||
if (Sse.IsSupported) | ||
{ | ||
var zero = Sse.SetAllVector128(0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use SetZeroVector128
instead
result.M12 = matrix1.M12 + (matrix2.M12 - matrix1.M12) * amount; | ||
result.M13 = matrix1.M13 + (matrix2.M13 - matrix1.M13) * amount; | ||
result.M14 = matrix1.M14 + (matrix2.M14 - matrix1.M14) * amount; | ||
var m1Row = Sse.LoadVector128(&matrix1.M11); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have [StructLayout(LayoutKind.Sequential)]
on the Matrix4x4
type? That way we are assured the members will be laid out in the assumed order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Roslyn Compiler emits StructLayout=Sequential
by default for struct types. However, last I checked there wasn't anything enforcing this behavior in the language spec.
It would probably be a good idea to explicitly mark the numeric types as Sequential (as we do with types like Single
and Int32
).
Matrix4x4 m = default; | ||
|
||
Sse.Store(&m.M11, | ||
Sse.Add(Sse.Add(Sse.Multiply(Sse.SetAllVector128(value1.M11), Sse.LoadVector128(&value2.M11)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than calling SetAllVector128
multiple times, it is probably better to do a load row and permute (broadcast or permute in AVX, shuffle in SSE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, it'd be nice to expose _MM_SHUFFLE
macros to C# for Shuffle.
e.g. static byte _MM_SHUFFLE(byte fp3, byte fp2, byte fp1, byte fp0) => (byte)(((fp3) << 6) | ((fp2) << 4) | ((fp1) << 2) | ((fp0)));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've raised it before. It should go to API review soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. @tannergooding and I had the same conclusion during dotnet/machinelearning#562. However, I didn't see an issue open for this in either corefx or coreclr. @tannergooding did you ever open one?
Here is what we proposed at the time:
`Sse.Shuffle(op1, op2, Sse.GetShuffleControl(0, 3, 1, 2))`
Something like that. `GetShuffleControl`, `BuildShuffleControl`, or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at the assembly and see if I can spot what is going wrong....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logged a bug to track this here: https://github.com/dotnet/corefx/issues/31825
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding any update on _MM_SHUFFLE
API? Could not find an issue for it
m.M44 = -value.M44; | ||
|
||
return m; | ||
if (Sse.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not just calling the Negate
method (or vice-versa)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the operators below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will Negate be inlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the JIT will inline it so that the inner call is called directly, rather than trying to inline the inner call into the outer call.
If it doesn't, then this sounds like a good JIT bug to log.
Also CC. @CarolEidt |
value1.M31 != value2.M31 || value1.M32 != value2.M32 || value1.M33 != value2.M33 || value1.M34 != value2.M34 || | ||
value1.M41 != value2.M41 || value1.M42 != value2.M42 || value1.M43 != value2.M43 || value1.M44 != value2.M44); | ||
} | ||
public static bool operator !=(Matrix4x4 value1, Matrix4x4 value2) => !value1.Equals(value2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this one, will check the jit output
} | ||
|
||
/// <summary> | ||
/// Returns a new matrix with the negated elements of the given matrix. | ||
/// </summary> | ||
/// <param name="value">The source matrix.</param> | ||
/// <returns>The negated matrix.</returns> | ||
public static Matrix4x4 Negate(Matrix4x4 value) | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. The code is probably small enough to be inlined in the hardware accelerated case, but not necessarily in the software case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes for other new AggressiveInlining
attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah but without that it basically adds a call to the software Negate operator. Are you suggesting to duplicate software implementation in both -operator and Negate and do something like if (Sse.IsSupported) { return -value; } else { /* duplicate software impl */ } ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the codegen you are seeing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding Negate is not inlined 🙁 https://gist.github.com/EgorBo/a6cc52d5523d7a5fcf6fb1adfa5dfbc0
Negate_new
is a benchmark method that does return Matrix4x4X.Negate(m1x);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, using jitutils
to get PMI diffs, and for public static unsafe Matrix4x4 Negate(Matrix4x4 value) => -value;
and public static unsafe Matrix4x4 operator -(Matrix4x4 value)
containing the actual implementation....
The following two tests (aside from not being "correct"):
[Fact]
public void Matrix4x4NegateTest()
{
Matrix4x4 m = GenerateMatrixNumberFrom1To16();
var actual = Matrix4x4.Negate(m);
Assert.Equal(m, actual);
}
[Fact]
public void Matrix4x4op_NegateTest()
{
Matrix4x4 m = GenerateMatrixNumberFrom1To16();
var actual = -m;
Assert.Equal(m, actual);
}
Both generate a call directly to call Matrix4x4:op_UnaryNegation(struct):struct
:
; Assembly listing for method Matrix4x4Tests:Matrix4x4NegateTest():this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd
; V01 loc0 [V01 ] ( 3, 3 ) struct (64) [rsp+0x1A8] do-not-enreg[XSB] must-init addr-exposed
; V02 loc1 [V02 ] ( 2, 2 ) struct (64) [rsp+0x168] do-not-enreg[XSB] must-init addr-exposed
; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
; V04 tmp1 [V04,T01] ( 2, 4 ) struct (64) [rsp+0x128] do-not-enreg[SB]
; V05 tmp2 [V05,T02] ( 2, 4 ) struct (64) [rsp+0xE8] do-not-enreg[SB]
; V06 tmp3 [V06,T03] ( 2, 4 ) struct (64) [rsp+0xA8] do-not-enreg[SB]
; V07 tmp4 [V07,T00] ( 3, 6 ) ref -> rsi class-hnd exact
; V08 tmp5 [V08 ] ( 4, 8 ) struct (64) [rsp+0x68] do-not-enreg[XSB] addr-exposed
; V09 tmp6 [V09,T04] ( 2, 4 ) long -> rcx
; V10 tmp7 [V10 ] ( 2, 4 ) struct (64) [rsp+0x28] do-not-enreg[XSB] addr-exposed
;
; Lcl frame size = 488
G_M53037_IG01:
push rdi
push rsi
sub rsp, 488
vzeroupper
mov rsi, rcx
lea rdi, [rsp+168H]
mov ecx, 32
xor rax, rax
rep stosd
mov rcx, rsi
G_M53037_IG02:
lea rcx, bword ptr [rsp+1A8H]
call Matrix4x4Tests:GenerateMatrixNumberFrom1To16():struct
vmovdqu xmm0, qword ptr [rsp+1A8H]
vmovdqu qword ptr [rsp+128H], xmm0
vmovdqu xmm0, qword ptr [rsp+1B8H]
vmovdqu qword ptr [rsp+138H], xmm0
vmovdqu xmm0, qword ptr [rsp+1C8H]
vmovdqu qword ptr [rsp+148H], xmm0
vmovdqu xmm0, qword ptr [rsp+1D8H]
vmovdqu qword ptr [rsp+158H], xmm0
lea rcx, bword ptr [rsp+168H]
vmovdqu xmm0, qword ptr [rsp+128H]
vmovdqu qword ptr [rsp+68H], xmm0
vmovdqu xmm0, qword ptr [rsp+138H]
vmovdqu qword ptr [rsp+78H], xmm0
vmovdqu xmm0, qword ptr [rsp+148H]
vmovdqu qword ptr [rsp+88H], xmm0
vmovdqu xmm0, qword ptr [rsp+158H]
vmovdqu qword ptr [rsp+98H], xmm0
lea rdx, bword ptr [rsp+68H]
call Matrix4x4:op_UnaryNegation(struct):struct
vmovdqu xmm0, qword ptr [rsp+1A8H]
vmovdqu qword ptr [rsp+E8H], xmm0
vmovdqu xmm0, qword ptr [rsp+1B8H]
vmovdqu qword ptr [rsp+F8H], xmm0
vmovdqu xmm0, qword ptr [rsp+1C8H]
vmovdqu qword ptr [rsp+108H], xmm0
vmovdqu xmm0, qword ptr [rsp+1D8H]
vmovdqu qword ptr [rsp+118H], xmm0
vmovdqu xmm0, qword ptr [rsp+168H]
vmovdqu qword ptr [rsp+A8H], xmm0
vmovdqu xmm0, qword ptr [rsp+178H]
vmovdqu qword ptr [rsp+B8H], xmm0
vmovdqu xmm0, qword ptr [rsp+188H]
vmovdqu qword ptr [rsp+C8H], xmm0
vmovdqu xmm0, qword ptr [rsp+198H]
vmovdqu qword ptr [rsp+D8H], xmm0
mov rcx, 0xD1FFAB1E
call CORINFO_HELP_NEWSFAST
mov rsi, rax
mov rcx, rsi
xor rdx, rdx
call AssertEqualityComparer`1:.ctor(ref):this
vmovdqu xmm0, qword ptr [rsp+E8H]
vmovdqu qword ptr [rsp+68H], xmm0
vmovdqu xmm0, qword ptr [rsp+F8H]
vmovdqu qword ptr [rsp+78H], xmm0
vmovdqu xmm0, qword ptr [rsp+108H]
vmovdqu qword ptr [rsp+88H], xmm0
vmovdqu xmm0, qword ptr [rsp+118H]
vmovdqu qword ptr [rsp+98H], xmm0
vmovdqu xmm0, qword ptr [rsp+A8H]
vmovdqu qword ptr [rsp+28H], xmm0
vmovdqu xmm0, qword ptr [rsp+B8H]
vmovdqu qword ptr [rsp+38H], xmm0
vmovdqu xmm0, qword ptr [rsp+C8H]
vmovdqu qword ptr [rsp+48H], xmm0
vmovdqu xmm0, qword ptr [rsp+D8H]
vmovdqu qword ptr [rsp+58H], xmm0
G_M53037_IG03:
lea rcx, bword ptr [rsp+68H]
lea rdx, bword ptr [rsp+28H]
mov r8, rsi
call Assert:Equal(struct,struct,ref)
nop
G_M53037_IG04:
add rsp, 488
pop rsi
pop rdi
ret
; Total bytes of code 579, prolog size 35 for method Matrix4x4Tests:Matrix4x4NegateTest():this
; ============================================================
Unwind Info:
>> Start offset : 0x000000 (not in unwind data)
>> End offset : 0xd1ffab1e (not in unwind data)
Version : 1
Flags : 0x00
SizeOfProlog : 0x09
CountOfUnwindCodes: 4
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
CodeOffset: 0x09 UnwindOp: UWOP_ALLOC_LARGE (1) OpInfo: 0 - Scaled small
Size: 61 * 8 = 488 = 0x001E8
CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7)
and
; Assembly listing for method Matrix4x4Tests:Matrix4x4op_NegateTest():this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd
; V01 loc0 [V01 ] ( 3, 3 ) struct (64) [rsp+0x168] do-not-enreg[XSB] must-init addr-exposed
; V02 loc1 [V02 ] ( 2, 2 ) struct (64) [rsp+0x128] do-not-enreg[XSB] must-init addr-exposed
; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
; V04 tmp1 [V04,T01] ( 2, 4 ) struct (64) [rsp+0xE8] do-not-enreg[SB]
; V05 tmp2 [V05,T02] ( 2, 4 ) struct (64) [rsp+0xA8] do-not-enreg[SB]
; V06 tmp3 [V06,T00] ( 3, 6 ) ref -> rsi class-hnd exact
; V07 tmp4 [V07 ] ( 4, 8 ) struct (64) [rsp+0x68] do-not-enreg[XSB] addr-exposed
; V08 tmp5 [V08,T03] ( 2, 4 ) long -> rcx
; V09 tmp6 [V09 ] ( 2, 4 ) struct (64) [rsp+0x28] do-not-enreg[XSB] addr-exposed
;
; Lcl frame size = 424
G_M27564_IG01:
push rdi
push rsi
sub rsp, 424
vzeroupper
mov rsi, rcx
lea rdi, [rsp+128H]
mov ecx, 32
xor rax, rax
rep stosd
mov rcx, rsi
G_M27564_IG02:
lea rcx, bword ptr [rsp+168H]
call Matrix4x4Tests:GenerateMatrixNumberFrom1To16():struct
lea rcx, bword ptr [rsp+128H]
vmovdqu xmm0, qword ptr [rsp+168H]
vmovdqu qword ptr [rsp+68H], xmm0
vmovdqu xmm0, qword ptr [rsp+178H]
vmovdqu qword ptr [rsp+78H], xmm0
vmovdqu xmm0, qword ptr [rsp+188H]
vmovdqu qword ptr [rsp+88H], xmm0
vmovdqu xmm0, qword ptr [rsp+198H]
vmovdqu qword ptr [rsp+98H], xmm0
lea rdx, bword ptr [rsp+68H]
call Matrix4x4:op_UnaryNegation(struct):struct
vmovdqu xmm0, qword ptr [rsp+168H]
vmovdqu qword ptr [rsp+E8H], xmm0
vmovdqu xmm0, qword ptr [rsp+178H]
vmovdqu qword ptr [rsp+F8H], xmm0
vmovdqu xmm0, qword ptr [rsp+188H]
vmovdqu qword ptr [rsp+108H], xmm0
vmovdqu xmm0, qword ptr [rsp+198H]
vmovdqu qword ptr [rsp+118H], xmm0
vmovdqu xmm0, qword ptr [rsp+128H]
vmovdqu qword ptr [rsp+A8H], xmm0
vmovdqu xmm0, qword ptr [rsp+138H]
vmovdqu qword ptr [rsp+B8H], xmm0
vmovdqu xmm0, qword ptr [rsp+148H]
vmovdqu qword ptr [rsp+C8H], xmm0
vmovdqu xmm0, qword ptr [rsp+158H]
vmovdqu qword ptr [rsp+D8H], xmm0
mov rcx, 0xD1FFAB1E
call CORINFO_HELP_NEWSFAST
mov rsi, rax
mov rcx, rsi
xor rdx, rdx
call AssertEqualityComparer`1:.ctor(ref):this
vmovdqu xmm0, qword ptr [rsp+E8H]
vmovdqu qword ptr [rsp+68H], xmm0
vmovdqu xmm0, qword ptr [rsp+F8H]
vmovdqu qword ptr [rsp+78H], xmm0
vmovdqu xmm0, qword ptr [rsp+108H]
vmovdqu qword ptr [rsp+88H], xmm0
vmovdqu xmm0, qword ptr [rsp+118H]
vmovdqu qword ptr [rsp+98H], xmm0
vmovdqu xmm0, qword ptr [rsp+A8H]
vmovdqu qword ptr [rsp+28H], xmm0
vmovdqu xmm0, qword ptr [rsp+B8H]
vmovdqu qword ptr [rsp+38H], xmm0
vmovdqu xmm0, qword ptr [rsp+C8H]
vmovdqu qword ptr [rsp+48H], xmm0
vmovdqu xmm0, qword ptr [rsp+D8H]
vmovdqu qword ptr [rsp+58H], xmm0
lea rcx, bword ptr [rsp+68H]
lea rdx, bword ptr [rsp+28H]
mov r8, rsi
call Assert:Equal(struct,struct,ref)
nop
G_M27564_IG03:
add rsp, 424
pop rsi
pop rdi
ret
; Total bytes of code 499, prolog size 35 for method Matrix4x4Tests:Matrix4x4op_NegateTest():this
; ============================================================
Unwind Info:
>> Start offset : 0x000000 (not in unwind data)
>> End offset : 0xd1ffab1e (not in unwind data)
Version : 1
Flags : 0x00
SizeOfProlog : 0x09
CountOfUnwindCodes: 4
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
CodeOffset: 0x09 UnwindOp: UWOP_ALLOC_LARGE (1) OpInfo: 0 - Scaled small
Size: 53 * 8 = 424 = 0x001A8
CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is exactly what I expected.
@EgorBo, could you fix them up accordingly, since the JIT already does the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding um.. so how should I fix it and do not regress the software fallback (will it be inlined in .Negate
if SSE is not available?). Is not the current implementation what you meant
corefx/src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs
Lines 1855 to 1881 in 21ca6ac
if (Sse.IsSupported) | |
{ | |
return -value; | |
} | |
else | |
{ | |
Matrix4x4 result; | |
result.M11 = -value.M11; | |
result.M12 = -value.M12; | |
result.M13 = -value.M13; | |
result.M14 = -value.M14; | |
result.M21 = -value.M21; | |
result.M22 = -value.M22; | |
result.M23 = -value.M23; | |
result.M24 = -value.M24; | |
result.M31 = -value.M31; | |
result.M32 = -value.M32; | |
result.M33 = -value.M33; | |
result.M34 = -value.M34; | |
result.M41 = -value.M41; | |
result.M42 = -value.M42; | |
result.M43 = -value.M43; | |
result.M44 = -value.M44; | |
return result; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean, in both cases public static unsafe Matrix4x4 Negate(Matrix4x4 value) => -value;
will result in a call directly to op_Negate
.
The JIT will then determine if op_Negate
is appropriate to inline or not (currently it does not). It will result in code that is simpler to maintain and produces the same codegen as is being produced today (for Release builds)
|
||
return m; | ||
} | ||
public static Matrix4x4 operator -(Matrix4x4 value) => Negate(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think the normal case would be to have the implementation in the operator and have Negate
call it (more people likely use the operator than call the "friendly name")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
…tors inside static methods if SSE is enabled (operators should be small enough to be inlined)
result.M12 = matrix1.M12 + (matrix2.M12 - matrix1.M12) * amount; | ||
result.M13 = matrix1.M13 + (matrix2.M13 - matrix1.M13) * amount; | ||
result.M14 = matrix1.M14 + (matrix2.M14 - matrix1.M14) * amount; | ||
if (Sse.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question more for @tannergooding . What is our test strategy for the software fallback code? I guess on x86/x64 Sse.IsSupported is always true for us so the test strategy is: tests run on ARM?
If we later have a codepath for ARM intrinsics also, we will need a new strategy.
Perhaps there's a way to force the runtime to lie and return false for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, initially I wanted to do If (Sse.IsSupported) {} else if (Arm.Simd.IsSupported) {} else {}
🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting COMPlus_FeatureSIMD=0
should cover the Sse-Avx2 HWIntrinsics. I don't think we have a switch to cover all HWIntrinsics, however.
CC. @CarolEidt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ARM intrinsics still need to be reviewed and implemented, before they can be used.
if (Sse.IsSupported) | ||
{ | ||
return | ||
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M11), Sse.LoadVector128(&value2.M11))) != 0xF || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MoveMask(CompareNotEqual()) != 0
is probably more efficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just tested, you are right, it's 30% more efficient. Also, I tried to add some fast-out paths (compare first field or first row as simple m1.M11 == m2.M11 && ...) but it gave a minor improvement for some cases and major regression in anothers.
@@ -1824,26 +1852,33 @@ public static Matrix4x4 Lerp(Matrix4x4 matrix1, Matrix4x4 matrix2, float amount) | |||
/// <returns>The negated matrix.</returns> | |||
public static Matrix4x4 Negate(Matrix4x4 value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you saw, since the thread is now "hidden" by GitHub, but I commented here (with disassembly using current head of CoreCLR) that the JIT does the right thing for inlining public static Matrix4x4 Negate(Matrix4x4 value) => -value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's hard to follow when github collapses them 🙂
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M21), Sse.LoadVector128(&value2.M21))) != 0xF || | ||
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M31), Sse.LoadVector128(&value2.M31))) != 0xF || | ||
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M41), Sse.LoadVector128(&value2.M41))) != 0xF; | ||
Sse.MoveMask(Sse.CompareNotEqual(Sse.LoadVector128(&value1.M11), Sse.LoadVector128(&value2.M11))) == 0 || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this one should be != 0
. CompareNotEqual
does (a != b) ? 0xFFFFFFFF : 0
, so it will be non-zero if one of rows in value1
!= the same row in value2
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M21), Sse.LoadVector128(&value2.M21))) == 0xF && | ||
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M31), Sse.LoadVector128(&value2.M31))) == 0xF && | ||
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M41), Sse.LoadVector128(&value2.M41))) == 0xF; | ||
Sse.MoveMask(Sse.CompareNotEqual(Sse.LoadVector128(&value1.M11), Sse.LoadVector128(&value2.M11))) != 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this one should be == 0
. CompareNotEqual
does (a != b) ? 0xFFFFFFFF : 0
, so it will be zero if all of the rows in value1
== the same row in value2
@@ -2346,9 +2181,9 @@ public bool Equals(Matrix4x4 other) | |||
/// <returns>True if the Object is equal to this matrix; False otherwise.</returns> | |||
public override bool Equals(object obj) | |||
{ | |||
if (obj is Matrix4x4) | |||
if (obj is Matrix4x4 m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You could do this in one line with (obj is Matrix4x4 other) && (this == other)
Sse.Store(&matrix.M31, Sse.MoveLowToHigh(h12, h34)); | ||
Sse.Store(&matrix.M41, Sse.MoveHighToLow(h34, h12)); | ||
|
||
return matrix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also end up with a "sub-optimal" return block:
vmovdqu xmm0, qword ptr [rdx]
vmovdqu qword ptr [rcx], xmm0
vmovdqu xmm0, qword ptr [rdx+16]
vmovdqu qword ptr [rcx+16], xmm0
vmovdqu xmm0, qword ptr [rdx+32]
vmovdqu qword ptr [rcx+32], xmm0
vmovdqu xmm0, qword ptr [rdx+48]
vmovdqu qword ptr [rcx+48], xmm0
I would hope we could be smart enough to have the above stores go directly to the return buffer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
public static Vector128<float> Lerp(Vector128<float> a, Vector128<float> b, Vector128<float> t) | ||
{ | ||
if (Sse.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably sufficient to just Debug.Assert(Sse.IsSupported)
, since this is internal and we have already checked the code higher up.
IMO, the current implementation has a very inefficient code pattern that loads fields of So, the more efficient implementation should be defining |
@fiigii how would a single field look like (e.g. var m4x4prj = *(Matrix4x4*)(void*)&appleMatrix;
m4x4prj.M43 /= 2f;
m4x4prj.M33 = far / (far - near);
m4x4prj.M34 *= -1;
m4x4prj = Matrix4x4.Transpose(m4x4prj);
Camera.SetProjection(&m4x4prj.M11); |
@fiigii, I don't believe we can take such a change because the various We cannot:
This also prevents other changes that likely would have been generally good for these types, such as marking the structs as |
I would hope that, instead, we could make the JIT smarter for these types of things. Having CC. @CarolEidt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
The perf improvements look great and show just how useful adding HWIntrinsic code paths can be (and how simple it can be, as compared to adding direct JIT support instead).
There were a couple codegen issues this exposed, that we can hopefully get resolved longer term.
@EgorBo, are there any other changes you were looking to make to this PR? |
@tannergooding let me switch back to Shuffle-based impl for Multiply 🙂 |
// Licensed to the .NET Foundation under one or more agreements. | ||
// The .NET Foundation licenses this file to you under the MIT license. | ||
// See the LICENSE file in the project root for more information. | ||
#if HAS_INTRINSICS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of completely #if
on this whole file, we should only include it when it is used.
So, we can remove this #if
, and in the .csproj add it to <Compile>
when $(TargetsNetCoreApp)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt yeah I was thinking about it, but I guess I have to declare a new property $(IntrinsicsSupport) something like this in the csproj since we rely on HAS_INTRINSICS symbol
{ | ||
internal static class VectorMath | ||
{ | ||
public static Vector128<float> Lerp(Vector128<float> a, Vector128<float> b, Vector128<float> t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen guidance on other PRs that putting [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
on functions that take/return Vector128 is a good idea.
See briancylui/machinelearning#3 (comment) where we were seeing perf degradation when we refactored some "helper" vector methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can help, but looking at the existing codegen it looks good/correct as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(that is, the method is small enough to be automatically inlined)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've just checked both outputs - both Lerp and Equals\NotEquals are inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for checking. It's good to know that this refactoring didn't hurt perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @EgorBo. I just had 2 relatively minor comments. The rest looks great.
I've updated benchmark results in the description with the latest changes (e.g. Shuffle made operator* 20% faster), latest runtime and added macOS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New changes LGTM as well.
@eerhardt, I'll let you merge since I'll be OOF shortly
@EgorBo - thanks! This is great to see. |
* Optimize some Matrix4x4 operations with SSE * fix typo * use SetZeroVector128 instead of SetAllVector128(0.0f) * [StructLayout(LayoutKind.Sequential)] on top of Matrix4x4 * collapse operators (call corresponding methods) * mark operators with [MethodImpl(MethodImplOptions.AggressiveInlining)] * remove [MethodImpl(MethodImplOptions.AggressiveInlining)], call operators inside static methods if SSE is enabled (operators should be small enough to be inlined) * overwrite value1 in Multiply instead of new instance * Optimize == and != * prefer CompareNotEqual than CompareEqual * fix typo in == and != * clean up methods-operators * optimize Transpose * simplify Equals * improve Transpose * Lerp as a separate method * remove SSE from != as it fails some NaN-related tests * remove unsafe from != * wrap intrinsics with #if netcoreapp * forgot Letp method and usings * define netcoreapp in csproj * rename netcoreapp symbol to HAS_INTRINSICS * Move Equal and Lerp to VectorMath.cs internal helper * Implement != operator * Debug.Assert(Sse.IsSupported) in VectorMath * replace SetAllVector with Shuffle in Multiply * remove #if HAS_INTRINSICS from VectorMath * fix indention in System.Numerics.Vectors.csproj Commit migrated from dotnet/corefx@5ab0d37
This PR optimizes some
Matrix4x4
operations with SSE (see https://github.com/dotnet/corefx/issues/31425). Some of the operations could also be optimized with AVX but for some reason on my PC it performs worse than SSE (VZEROUPPER? or maybe CPU decreases AVX frequency due to rapid benchmarking?).Environment:
.NET Core: .NET Core SDK=3.0.100-alpha1-20180720-2 Windows 10: Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores macOS 10.13: Intel Core i7-4980HQ CPU 2.80GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
Matrix4x4.Add (Matrix4x4, Matrix4x4) and Subtract
Matrix4x4 result = matrix1 + matrix2;
Windows (Coffee Lake):
macOS (Haswell):
Matrix4x4.Lerp (Matrix4x4, Matrix4x4, float)
Matrix4x4 result = Matrix4x4.Lerp(matrix1, matrix2, amount);
Windows (Coffee Lake):
macOS (Haswell):
Matrix4x4.Multiply (Matrix4x4, Matrix4x4)
Matrix4x4 result = matrix1 * matrix2;
Windows (Coffee Lake):
macOS (Haswell):
Matrix4x4.Multiply (Matrix4x4, float)
Matrix4x4 result = matrix1 * scalar;
Windows (Coffee Lake):
macOS (Haswell):
Matrix4x4.Negate (Matrix4x4)
Matrix4x4 result = -matrix1;
Windows (Coffee Lake):
macOS (Haswell):
Matrix4x4.Equals (Matrix4x4)
bool result = matrix1 == matrix2;
Windows (Coffee Lake):
macOS (Haswell):
Matrix4x4.Transpose (Matrix4x4)
Matrix4x4 result = Matrix4x4.Transpose(matrix1);
Windows (Coffee Lake):
macOS (Haswell):
Benchmarks: https://gist.github.com/EgorBo/c80a25517245374c8dcdca2af4536ffe