Optimize some Matrix4x4 operations with SSE #31779

EgorBo · 2018-08-15T15:39:33Z

This PR optimizes some Matrix4x4 operations with SSE (see https://github.com/dotnet/corefx/issues/31425). Some of the operations could also be optimized with AVX but for some reason on my PC it performs worse than SSE (VZEROUPPER? or maybe CPU decreases AVX frequency due to rapid benchmarking?).

Environment:

.NET Core: .NET Core SDK=3.0.100-alpha1-20180720-2

Windows 10:
   Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores

macOS 10.13:
   Intel Core i7-4980HQ CPU 2.80GHz (Haswell), 1 CPU, 8 logical and 4 physical cores

Matrix4x4.Add (Matrix4x4, Matrix4x4) and Subtract

Matrix4x4 result = matrix1 + matrix2;

Windows (Coffee Lake):

Method	Mean	Scaled
Add_old	13.353 ns	1.00
Add_new	4.486 ns	0.34

macOS (Haswell):

Method	Mean	Scaled
Add_old	15.347 ns	1.00
Add_new	7.473 ns	0.49

Matrix4x4.Lerp (Matrix4x4, Matrix4x4, float)

Matrix4x4 result = Matrix4x4.Lerp(matrix1, matrix2, amount);

Windows (Coffee Lake):

Method	Mean	Scaled
Lerp_old	15.286 ns	1.00
Lerp_new	5.365 ns	0.35

macOS (Haswell):

Method	Mean	Scaled
Lerp_old	17.047 ns	1.00
Lerp_new	7.657 ns	0.45

Matrix4x4.Multiply (Matrix4x4, Matrix4x4)

Matrix4x4 result = matrix1 * matrix2;

Windows (Coffee Lake):

Method	Mean	Scaled
Multiply_old	27.146 ns	1.00
Multiply_new	7.461 ns	0.27

macOS (Haswell):

Method	Mean	Scaled
Multiply_old	32.05 ns	1.00
Multiply_new	11.24 ns	0.35

Matrix4x4.Multiply (Matrix4x4, float)

Matrix4x4 result = matrix1 * scalar;

Windows (Coffee Lake):

Method	Mean	Scaled
MultiplyByScalar_old	12.927 ns	1.00
MultiplyByScalar_new	3.284 ns	0.25

macOS (Haswell):

Method	Mean	Scaled
MultiplyByScalar_old	14.334 ns	1.00
MultiplyByScalar_new	5.086 ns	0.35

Matrix4x4.Negate (Matrix4x4)

Matrix4x4 result = -matrix1;

Windows (Coffee Lake):

Method	Mean	Scaled
Negate_old	12.932 ns	1.00
Negate_new	3.187 ns	0.25

macOS (Haswell):

Method	Mean	Scaled
Negate_old	14.877 ns	1.00
Negate_new	5.201 ns	0.35

Matrix4x4.Equals (Matrix4x4)

bool result = matrix1 == matrix2;

Windows (Coffee Lake):

Method	Mean
Equals_NotEqual_old	1.742 ns
Equals_NotEqual_new	1.581 ns
Equals_Equal_old	7.081 ns
Equals_Equal_new	2.960 ns

macOS (Haswell):

Method	Mean
Equals_NotEqual_old	3.172 ns
Equals_NotEqual_new	3.022 ns
Equals_Equal_old	8.180 ns
Equals_Equal_new	4.618 ns

Matrix4x4.Transpose (Matrix4x4)

Matrix4x4 result = Matrix4x4.Transpose(matrix1);

Windows (Coffee Lake):

Method	Mean	Scaled
Transpose_old	12.720 ns	1.00
Transpose_new	3.156 ns	0.25

macOS (Haswell):

Method	Mean	Scaled
Transpose_old	14.297 ns	1.00
Transpose_new	5.548 ns	0.39

Benchmarks: https://gist.github.com/EgorBo/c80a25517245374c8dcdca2af4536ffe

tannergooding · 2018-08-15T16:13:11Z

Possibly outside the scope of this PR, but it would be nice if we had some perf tests for Matrix4x4 (and the other types) in: https://github.com/dotnet/corefx/tree/master/src/System.Numerics.Vectors/tests/Performance

tannergooding · 2018-08-15T16:13:51Z

CC. @eerhardt, @ViktorHofer

Also CC. @benaadams, who has looked at some of this before.

tannergooding · 2018-08-15T16:24:45Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

-            result.M14 = matrix1.M14 + (matrix2.M14 - matrix1.M14) * amount;
+                var m1Row = Sse.LoadVector128(&matrix1.M11);
+                var m2Row = Sse.LoadVector128(&matrix2.M11);
+                Sse.Store(&matrix1.M11, Sse.Add(m1Row, Sse.Multiply(Sse.Subtract(m2Row, m1Row), amountVec)));


I mentioned this somewhat here, but it would be nice if we had some of these functions extracted to a mini-helper library.

These lines are basically a VectorLerp function, which would allow something like:

Sse.Store(&matrix1.M11, VectorMath.Lerp(&matrix1.M11, &matrix2.M11, amountVec)); Sse.Store(&matrix1.M21, VectorMath.Lerp(&matrix1.M21, &matrix2.M21, amountVec)); Sse.Store(&matrix1.M31, VectorMath.Lerp(&matrix1.M31, &matrix2.M31, amountVec)); Sse.Store(&matrix1.M41, VectorMath.Lerp(&matrix1.M41, &matrix2.M41, amountVec));

or something similar (and would also allow reuse elsewhere, such as in Vector4.Lerp)

Same for all the operations below.

tannergooding · 2018-08-15T16:37:25Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

-            result.M44 = -value.M44;
+            if (Sse.IsSupported)
+            {
+                var zero = Sse.SetAllVector128(0f);


Please use SetZeroVector128 instead

eerhardt · 2018-08-15T16:39:25Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

-            result.M12 = matrix1.M12 + (matrix2.M12 - matrix1.M12) * amount;
-            result.M13 = matrix1.M13 + (matrix2.M13 - matrix1.M13) * amount;
-            result.M14 = matrix1.M14 + (matrix2.M14 - matrix1.M14) * amount;
+                var m1Row = Sse.LoadVector128(&matrix1.M11);


Do we need to have [StructLayout(LayoutKind.Sequential)] on the Matrix4x4 type? That way we are assured the members will be laid out in the assumed order?

The Roslyn Compiler emits StructLayout=Sequential by default for struct types. However, last I checked there wasn't anything enforcing this behavior in the language spec.

It would probably be a good idea to explicitly mark the numeric types as Sequential (as we do with types like Single and Int32).

tannergooding · 2018-08-15T16:41:54Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

+                Matrix4x4 m = default;
+
+                Sse.Store(&m.M11,
+                    Sse.Add(Sse.Add(Sse.Multiply(Sse.SetAllVector128(value1.M11), Sse.LoadVector128(&value2.M11)),


Rather than calling SetAllVector128 multiple times, it is probably better to do a load row and permute (broadcast or permute in AVX, shuffle in SSE)

btw, it'd be nice to expose _MM_SHUFFLE macros to C# for Shuffle.
e.g. static byte _MM_SHUFFLE(byte fp3, byte fp2, byte fp1, byte fp0) => (byte)(((fp3) << 6) | ((fp2) << 4) | ((fp1) << 2) | ((fp0)));

I've raised it before. It should go to API review soon

Agreed. @tannergooding and I had the same conclusion during dotnet/machinelearning#562. However, I didn't see an issue open for this in either corefx or coreclr. @tannergooding did you ever open one?

Here is what we proposed at the time:

`Sse.Shuffle(op1, op2, Sse.GetShuffleControl(0, 3, 1, 2))` Something like that. `GetShuffleControl`, `BuildShuffleControl`, or something

So I've tried to use shuffle here but it became slower, maybe I did something wrong?

I'll take a look at the assembly and see if I can spot what is going wrong....

Logged a bug to track this here: https://github.com/dotnet/corefx/issues/31825

@tannergooding any update on _MM_SHUFFLE API? Could not find an issue for it

tannergooding · 2018-08-15T16:44:58Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

-            m.M44 = -value.M44;
-
-            return m;
+            if (Sse.IsSupported)


Why is this not just calling the Negate method (or vice-versa)?

Same for the operators below

will Negate be inlined?

I believe the JIT will inline it so that the inner call is called directly, rather than trying to inline the inner call into the outer call.
If it doesn't, then this sounds like a good JIT bug to log.

tannergooding · 2018-08-15T16:45:51Z

Also CC. @CarolEidt

EgorBo · 2018-08-15T17:04:01Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

-                    value1.M31 != value2.M31 || value1.M32 != value2.M32 || value1.M33 != value2.M33 || value1.M34 != value2.M34 ||
-                    value1.M41 != value2.M41 || value1.M42 != value2.M42 || value1.M43 != value2.M43 || value1.M44 != value2.M44);
-        }
+        public static bool operator !=(Matrix4x4 value1, Matrix4x4 value2) => !value1.Equals(value2);


I am not sure about this one, will check the jit output

tannergooding · 2018-08-15T17:17:01Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

        }

        /// <summary>
        /// Returns a new matrix with the negated elements of the given matrix.
        /// </summary>
        /// <param name="value">The source matrix.</param>
        /// <returns>The negated matrix.</returns>
-        public static Matrix4x4 Negate(Matrix4x4 value)
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]


I'm not sure about this. The code is probably small enough to be inlined in the hardware accelerated case, but not necessarily in the software case.

Same goes for other new AggressiveInlining attributes

yeah but without that it basically adds a call to the software Negate operator. Are you suggesting to duplicate software implementation in both -operator and Negate and do something like if (Sse.IsSupported) { return -value; } else { /* duplicate software impl */ } ?

What is the codegen you are seeing here?

@tannergooding Negate is not inlined 🙁 https://gist.github.com/EgorBo/a6cc52d5523d7a5fcf6fb1adfa5dfbc0
Negate_new is a benchmark method that does return Matrix4x4X.Negate(m1x);

um.. I expected Negate_new() to just call op_UnaryNegation directly

Ok, using jitutils to get PMI diffs, and for public static unsafe Matrix4x4 Negate(Matrix4x4 value) => -value; and public static unsafe Matrix4x4 operator -(Matrix4x4 value) containing the actual implementation....

The following two tests (aside from not being "correct"):

[Fact] public void Matrix4x4NegateTest() { Matrix4x4 m = GenerateMatrixNumberFrom1To16(); var actual = Matrix4x4.Negate(m); Assert.Equal(m, actual); } [Fact] public void Matrix4x4op_NegateTest() { Matrix4x4 m = GenerateMatrixNumberFrom1To16(); var actual = -m; Assert.Equal(m, actual); }

Both generate a call directly to call Matrix4x4:op_UnaryNegation(struct):struct:

; Assembly listing for method Matrix4x4Tests:Matrix4x4NegateTest():this ; Emitting BLENDED_CODE for X64 CPU with AVX ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd ; V01 loc0 [V01 ] ( 3, 3 ) struct (64) [rsp+0x1A8] do-not-enreg[XSB] must-init addr-exposed ; V02 loc1 [V02 ] ( 2, 2 ) struct (64) [rsp+0x168] do-not-enreg[XSB] must-init addr-exposed ; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] ; V04 tmp1 [V04,T01] ( 2, 4 ) struct (64) [rsp+0x128] do-not-enreg[SB] ; V05 tmp2 [V05,T02] ( 2, 4 ) struct (64) [rsp+0xE8] do-not-enreg[SB] ; V06 tmp3 [V06,T03] ( 2, 4 ) struct (64) [rsp+0xA8] do-not-enreg[SB] ; V07 tmp4 [V07,T00] ( 3, 6 ) ref -> rsi class-hnd exact ; V08 tmp5 [V08 ] ( 4, 8 ) struct (64) [rsp+0x68] do-not-enreg[XSB] addr-exposed ; V09 tmp6 [V09,T04] ( 2, 4 ) long -> rcx ; V10 tmp7 [V10 ] ( 2, 4 ) struct (64) [rsp+0x28] do-not-enreg[XSB] addr-exposed ; ; Lcl frame size = 488 G_M53037_IG01: push rdi push rsi sub rsp, 488 vzeroupper mov rsi, rcx lea rdi, [rsp+168H] mov ecx, 32 xor rax, rax rep stosd mov rcx, rsi G_M53037_IG02: lea rcx, bword ptr [rsp+1A8H] call Matrix4x4Tests:GenerateMatrixNumberFrom1To16():struct vmovdqu xmm0, qword ptr [rsp+1A8H] vmovdqu qword ptr [rsp+128H], xmm0 vmovdqu xmm0, qword ptr [rsp+1B8H] vmovdqu qword ptr [rsp+138H], xmm0 vmovdqu xmm0, qword ptr [rsp+1C8H] vmovdqu qword ptr [rsp+148H], xmm0 vmovdqu xmm0, qword ptr [rsp+1D8H] vmovdqu qword ptr [rsp+158H], xmm0 lea rcx, bword ptr [rsp+168H] vmovdqu xmm0, qword ptr [rsp+128H] vmovdqu qword ptr [rsp+68H], xmm0 vmovdqu xmm0, qword ptr [rsp+138H] vmovdqu qword ptr [rsp+78H], xmm0 vmovdqu xmm0, qword ptr [rsp+148H] vmovdqu qword ptr [rsp+88H], xmm0 vmovdqu xmm0, qword ptr [rsp+158H] vmovdqu qword ptr [rsp+98H], xmm0 lea rdx, bword ptr [rsp+68H] call Matrix4x4:op_UnaryNegation(struct):struct vmovdqu xmm0, qword ptr [rsp+1A8H] vmovdqu qword ptr [rsp+E8H], xmm0 vmovdqu xmm0, qword ptr [rsp+1B8H] vmovdqu qword ptr [rsp+F8H], xmm0 vmovdqu xmm0, qword ptr [rsp+1C8H] vmovdqu qword ptr [rsp+108H], xmm0 vmovdqu xmm0, qword ptr [rsp+1D8H] vmovdqu qword ptr [rsp+118H], xmm0 vmovdqu xmm0, qword ptr [rsp+168H] vmovdqu qword ptr [rsp+A8H], xmm0 vmovdqu xmm0, qword ptr [rsp+178H] vmovdqu qword ptr [rsp+B8H], xmm0 vmovdqu xmm0, qword ptr [rsp+188H] vmovdqu qword ptr [rsp+C8H], xmm0 vmovdqu xmm0, qword ptr [rsp+198H] vmovdqu qword ptr [rsp+D8H], xmm0 mov rcx, 0xD1FFAB1E call CORINFO_HELP_NEWSFAST mov rsi, rax mov rcx, rsi xor rdx, rdx call AssertEqualityComparer`1:.ctor(ref):this vmovdqu xmm0, qword ptr [rsp+E8H] vmovdqu qword ptr [rsp+68H], xmm0 vmovdqu xmm0, qword ptr [rsp+F8H] vmovdqu qword ptr [rsp+78H], xmm0 vmovdqu xmm0, qword ptr [rsp+108H] vmovdqu qword ptr [rsp+88H], xmm0 vmovdqu xmm0, qword ptr [rsp+118H] vmovdqu qword ptr [rsp+98H], xmm0 vmovdqu xmm0, qword ptr [rsp+A8H] vmovdqu qword ptr [rsp+28H], xmm0 vmovdqu xmm0, qword ptr [rsp+B8H] vmovdqu qword ptr [rsp+38H], xmm0 vmovdqu xmm0, qword ptr [rsp+C8H] vmovdqu qword ptr [rsp+48H], xmm0 vmovdqu xmm0, qword ptr [rsp+D8H] vmovdqu qword ptr [rsp+58H], xmm0 G_M53037_IG03: lea rcx, bword ptr [rsp+68H] lea rdx, bword ptr [rsp+28H] mov r8, rsi call Assert:Equal(struct,struct,ref) nop G_M53037_IG04: add rsp, 488 pop rsi pop rdi ret ; Total bytes of code 579, prolog size 35 for method Matrix4x4Tests:Matrix4x4NegateTest():this ; ============================================================ Unwind Info: >> Start offset : 0x000000 (not in unwind data) >> End offset : 0xd1ffab1e (not in unwind data) Version : 1 Flags : 0x00 SizeOfProlog : 0x09 CountOfUnwindCodes: 4 FrameRegister : none (0) FrameOffset : N/A (no FrameRegister) (Value=0) UnwindCodes : CodeOffset: 0x09 UnwindOp: UWOP_ALLOC_LARGE (1) OpInfo: 0 - Scaled small Size: 61 * 8 = 488 = 0x001E8 CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7)

and

; Assembly listing for method Matrix4x4Tests:Matrix4x4op_NegateTest():this ; Emitting BLENDED_CODE for X64 CPU with AVX ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd ; V01 loc0 [V01 ] ( 3, 3 ) struct (64) [rsp+0x168] do-not-enreg[XSB] must-init addr-exposed ; V02 loc1 [V02 ] ( 2, 2 ) struct (64) [rsp+0x128] do-not-enreg[XSB] must-init addr-exposed ; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] ; V04 tmp1 [V04,T01] ( 2, 4 ) struct (64) [rsp+0xE8] do-not-enreg[SB] ; V05 tmp2 [V05,T02] ( 2, 4 ) struct (64) [rsp+0xA8] do-not-enreg[SB] ; V06 tmp3 [V06,T00] ( 3, 6 ) ref -> rsi class-hnd exact ; V07 tmp4 [V07 ] ( 4, 8 ) struct (64) [rsp+0x68] do-not-enreg[XSB] addr-exposed ; V08 tmp5 [V08,T03] ( 2, 4 ) long -> rcx ; V09 tmp6 [V09 ] ( 2, 4 ) struct (64) [rsp+0x28] do-not-enreg[XSB] addr-exposed ; ; Lcl frame size = 424 G_M27564_IG01: push rdi push rsi sub rsp, 424 vzeroupper mov rsi, rcx lea rdi, [rsp+128H] mov ecx, 32 xor rax, rax rep stosd mov rcx, rsi G_M27564_IG02: lea rcx, bword ptr [rsp+168H] call Matrix4x4Tests:GenerateMatrixNumberFrom1To16():struct lea rcx, bword ptr [rsp+128H] vmovdqu xmm0, qword ptr [rsp+168H] vmovdqu qword ptr [rsp+68H], xmm0 vmovdqu xmm0, qword ptr [rsp+178H] vmovdqu qword ptr [rsp+78H], xmm0 vmovdqu xmm0, qword ptr [rsp+188H] vmovdqu qword ptr [rsp+88H], xmm0 vmovdqu xmm0, qword ptr [rsp+198H] vmovdqu qword ptr [rsp+98H], xmm0 lea rdx, bword ptr [rsp+68H] call Matrix4x4:op_UnaryNegation(struct):struct vmovdqu xmm0, qword ptr [rsp+168H] vmovdqu qword ptr [rsp+E8H], xmm0 vmovdqu xmm0, qword ptr [rsp+178H] vmovdqu qword ptr [rsp+F8H], xmm0 vmovdqu xmm0, qword ptr [rsp+188H] vmovdqu qword ptr [rsp+108H], xmm0 vmovdqu xmm0, qword ptr [rsp+198H] vmovdqu qword ptr [rsp+118H], xmm0 vmovdqu xmm0, qword ptr [rsp+128H] vmovdqu qword ptr [rsp+A8H], xmm0 vmovdqu xmm0, qword ptr [rsp+138H] vmovdqu qword ptr [rsp+B8H], xmm0 vmovdqu xmm0, qword ptr [rsp+148H] vmovdqu qword ptr [rsp+C8H], xmm0 vmovdqu xmm0, qword ptr [rsp+158H] vmovdqu qword ptr [rsp+D8H], xmm0 mov rcx, 0xD1FFAB1E call CORINFO_HELP_NEWSFAST mov rsi, rax mov rcx, rsi xor rdx, rdx call AssertEqualityComparer`1:.ctor(ref):this vmovdqu xmm0, qword ptr [rsp+E8H] vmovdqu qword ptr [rsp+68H], xmm0 vmovdqu xmm0, qword ptr [rsp+F8H] vmovdqu qword ptr [rsp+78H], xmm0 vmovdqu xmm0, qword ptr [rsp+108H] vmovdqu qword ptr [rsp+88H], xmm0 vmovdqu xmm0, qword ptr [rsp+118H] vmovdqu qword ptr [rsp+98H], xmm0 vmovdqu xmm0, qword ptr [rsp+A8H] vmovdqu qword ptr [rsp+28H], xmm0 vmovdqu xmm0, qword ptr [rsp+B8H] vmovdqu qword ptr [rsp+38H], xmm0 vmovdqu xmm0, qword ptr [rsp+C8H] vmovdqu qword ptr [rsp+48H], xmm0 vmovdqu xmm0, qword ptr [rsp+D8H] vmovdqu qword ptr [rsp+58H], xmm0 lea rcx, bword ptr [rsp+68H] lea rdx, bword ptr [rsp+28H] mov r8, rsi call Assert:Equal(struct,struct,ref) nop G_M27564_IG03: add rsp, 424 pop rsi pop rdi ret ; Total bytes of code 499, prolog size 35 for method Matrix4x4Tests:Matrix4x4op_NegateTest():this ; ============================================================ Unwind Info: >> Start offset : 0x000000 (not in unwind data) >> End offset : 0xd1ffab1e (not in unwind data) Version : 1 Flags : 0x00 SizeOfProlog : 0x09 CountOfUnwindCodes: 4 FrameRegister : none (0) FrameOffset : N/A (no FrameRegister) (Value=0) UnwindCodes : CodeOffset: 0x09 UnwindOp: UWOP_ALLOC_LARGE (1) OpInfo: 0 - Scaled small Size: 53 * 8 = 424 = 0x001A8 CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7)

Which is exactly what I expected.

@EgorBo, could you fix them up accordingly, since the JIT already does the right thing.

@tannergooding um.. so how should I fix it and do not regress the software fallback (will it be inlined in .Negate if SSE is not available?). Is not the current implementation what you meant

corefx/src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

Lines 1855 to 1881 in 21ca6ac

if (Sse.IsSupported)

{

return -value;

}

else

{

Matrix4x4 result;

result.M11 = -value.M11;

result.M12 = -value.M12;

result.M13 = -value.M13;

result.M14 = -value.M14;

result.M21 = -value.M21;

result.M22 = -value.M22;

result.M23 = -value.M23;

result.M24 = -value.M24;

result.M31 = -value.M31;

result.M32 = -value.M32;

result.M33 = -value.M33;

result.M34 = -value.M34;

result.M41 = -value.M41;

result.M42 = -value.M42;

result.M43 = -value.M43;

result.M44 = -value.M44;

return result;

}

I'm not sure what you mean, in both cases public static unsafe Matrix4x4 Negate(Matrix4x4 value) => -value; will result in a call directly to op_Negate.

The JIT will then determine if op_Negate is appropriate to inline or not (currently it does not). It will result in code that is simpler to maintain and produces the same codegen as is being produced today (for Release builds)

tannergooding · 2018-08-15T17:18:30Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

-
-            return m;
-        }
+        public static Matrix4x4 operator -(Matrix4x4 value) => Negate(value);


Nit: I think the normal case would be to have the implementation in the operator and have Negate call it (more people likely use the operator than call the "friendly name")

Makes sense!

…tors inside static methods if SSE is enabled (operators should be small enough to be inlined)

danmoseley · 2018-08-15T19:22:45Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

-            result.M12 = matrix1.M12 + (matrix2.M12 - matrix1.M12) * amount;
-            result.M13 = matrix1.M13 + (matrix2.M13 - matrix1.M13) * amount;
-            result.M14 = matrix1.M14 + (matrix2.M14 - matrix1.M14) * amount;
+            if (Sse.IsSupported)


General question more for @tannergooding . What is our test strategy for the software fallback code? I guess on x86/x64 Sse.IsSupported is always true for us so the test strategy is: tests run on ARM?

If we later have a codepath for ARM intrinsics also, we will need a new strategy.

Perhaps there's a way to force the runtime to lie and return false for this.

btw, initially I wanted to do If (Sse.IsSupported) {} else if (Arm.Simd.IsSupported) {} else {} 🙂

Setting COMPlus_FeatureSIMD=0 should cover the Sse-Avx2 HWIntrinsics. I don't think we have a switch to cover all HWIntrinsics, however.

CC. @CarolEidt

The ARM intrinsics still need to be reviewed and implemented, before they can be used.

tannergooding · 2018-08-15T20:45:19Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

+            if (Sse.IsSupported)
+            {
+                return
+                    Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M11), Sse.LoadVector128(&value2.M11))) != 0xF ||


MoveMask(CompareNotEqual()) != 0 is probably more efficient

just tested, you are right, it's 30% more efficient. Also, I tried to add some fast-out paths (compare first field or first row as simple m1.M11 == m2.M11 && ...) but it gave a minor improvement for some cases and major regression in anothers.

tannergooding · 2018-08-15T21:18:08Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

@@ -1824,26 +1852,33 @@ public static Matrix4x4 Lerp(Matrix4x4 matrix1, Matrix4x4 matrix2, float amount)
        /// <returns>The negated matrix.</returns>
        public static Matrix4x4 Negate(Matrix4x4 value)


Not sure if you saw, since the thread is now "hidden" by GitHub, but I commented here (with disassembly using current head of CoreCLR) that the JIT does the right thing for inlining public static Matrix4x4 Negate(Matrix4x4 value) => -value

yeah it's hard to follow when github collapses them 🙂

tannergooding · 2018-08-15T21:45:06Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

-                    Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M21), Sse.LoadVector128(&value2.M21))) != 0xF ||
-                    Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M31), Sse.LoadVector128(&value2.M31))) != 0xF ||
-                    Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M41), Sse.LoadVector128(&value2.M41))) != 0xF;
+                    Sse.MoveMask(Sse.CompareNotEqual(Sse.LoadVector128(&value1.M11), Sse.LoadVector128(&value2.M11))) == 0 ||


Pretty sure this one should be != 0. CompareNotEqual does (a != b) ? 0xFFFFFFFF : 0, so it will be non-zero if one of rows in value1 != the same row in value2

tannergooding · 2018-08-15T21:45:50Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

-                    Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M21), Sse.LoadVector128(&value2.M21))) == 0xF &&
-                    Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M31), Sse.LoadVector128(&value2.M31))) == 0xF &&
-                    Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M41), Sse.LoadVector128(&value2.M41))) == 0xF;
+                    Sse.MoveMask(Sse.CompareNotEqual(Sse.LoadVector128(&value1.M11), Sse.LoadVector128(&value2.M11))) != 0 &&


Pretty sure this one should be == 0. CompareNotEqual does (a != b) ? 0xFFFFFFFF : 0, so it will be zero if all of the rows in value1 == the same row in value2

tannergooding · 2018-08-15T22:11:36Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

@@ -2346,9 +2181,9 @@ public bool Equals(Matrix4x4 other)
        /// <returns>True if the Object is equal to this matrix; False otherwise.</returns>
        public override bool Equals(object obj)
        {
-            if (obj is Matrix4x4)
+            if (obj is Matrix4x4 m)


Nit: You could do this in one line with (obj is Matrix4x4 other) && (this == other)

tannergooding · 2018-08-16T22:23:41Z

src/System.Numerics.Vectors/src/System/Numerics/Matrix4x4.cs

+                Sse.Store(&matrix.M31, Sse.MoveLowToHigh(h12, h34));
+                Sse.Store(&matrix.M41, Sse.MoveHighToLow(h34, h12));
+
+                return matrix;


We also end up with a "sub-optimal" return block:

vmovdqu xmm0, qword ptr [rdx] vmovdqu qword ptr [rcx], xmm0 vmovdqu xmm0, qword ptr [rdx+16] vmovdqu qword ptr [rcx+16], xmm0 vmovdqu xmm0, qword ptr [rdx+32] vmovdqu qword ptr [rcx+32], xmm0 vmovdqu xmm0, qword ptr [rdx+48] vmovdqu qword ptr [rcx+48], xmm0

I would hope we could be smart enough to have the above stores go directly to the return buffer...

Logged https://github.com/dotnet/coreclr/issues/19522

tannergooding · 2018-08-16T23:30:20Z

src/System.Numerics.Vectors/src/System/Numerics/VectorMath.cs

+    {
+        public static Vector128<float> Lerp(Vector128<float> a, Vector128<float> b, Vector128<float> t)
+        {
+            if (Sse.IsSupported)


It's probably sufficient to just Debug.Assert(Sse.IsSupported), since this is internal and we have already checked the code higher up.

fiigii · 2018-08-17T04:47:09Z

IMO, the current implementation has a very inefficient code pattern that loads fields of Matrix4x4 at the beginning of a function and stores the results back at the end. Each operation could generate many memory access operations, so some scenarios that heavily use Matrix4x4 will possibly become memory bounded.

So, the more efficient implementation should be defining Matrix4x4 with 4 Vector128<float> fields, and other operations directly read the Vector128<float> fields then return a new Matrix4x4(...). Because of RyuJIT's struct promotion optimization and inlining, all the loads of local struct's fields and "new" of results will be eliminated (replaced by direct SIMD register operations). According to my experience, that will make some scenarios much faster and more CPU bounded.

cc @tannergooding @CarolEidt

EgorBo · 2018-08-17T08:19:04Z

@fiigii how would a single field look like (e.g. M11)?
UPD: a small prototype: https://gist.github.com/EgorBo/658bf41aad52af45866272037d1973af
Will it be interop friendly?
E.g. in my pet project I have:

var m4x4prj = *(Matrix4x4*)(void*)&appleMatrix;
m4x4prj.M43 /= 2f;
m4x4prj.M33 = far / (far - near);
m4x4prj.M34 *= -1;
m4x4prj = Matrix4x4.Transpose(m4x4prj);

Camera.SetProjection(&m4x4prj.M11);

tannergooding · 2018-08-17T15:44:35Z

@fiigii, I don't believe we can take such a change because the various System.Numerics.Vector types were shipped with their fields publicly exposed.

We cannot:

Remove the existing fields, as it is a major binary breaking change
Add new fields with explicit layout, as it would
- Break any existing code using these as interop types (generics can't be marshaled today)
- Likely regress perf in other areas, due to JIT limitations with explicit layout structs

This also prevents other changes that likely would have been generally good for these types, such as marking the structs as readonly, but it is the state of the world and we will need to work with it 😄

tannergooding · 2018-08-17T15:45:53Z

I would hope that, instead, we could make the JIT smarter for these types of things. Having __vectorcall support, for example, would allow the values to be passed around in register and would elide many of the memory concerns (https://github.com/dotnet/coreclr/issues/12120).

CC. @CarolEidt

tannergooding

Overall LGTM.

The perf improvements look great and show just how useful adding HWIntrinsic code paths can be (and how simple it can be, as compared to adding direct JIT support instead).

There were a couple codegen issues this exposed, that we can hopefully get resolved longer term.

tannergooding · 2018-08-17T16:14:12Z

@EgorBo, are there any other changes you were looking to make to this PR?

EgorBo · 2018-08-17T16:38:32Z

@tannergooding let me switch back to Shuffle-based impl for Multiply 🙂

eerhardt · 2018-08-17T16:39:11Z

src/System.Numerics.Vectors/src/System/Numerics/VectorMath.cs

+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+#if HAS_INTRINSICS


Instead of completely #if on this whole file, we should only include it when it is used.

So, we can remove this #if, and in the .csproj add it to <Compile> when $(TargetsNetCoreApp).

@eerhardt yeah I was thinking about it, but I guess I have to declare a new property $(IntrinsicsSupport) something like this in the csproj since we rely on HAS_INTRINSICS symbol

eerhardt · 2018-08-17T16:51:46Z

src/System.Numerics.Vectors/src/System/Numerics/VectorMath.cs

+{
+    internal static class VectorMath
+    {
+        public static Vector128<float> Lerp(Vector128<float> a, Vector128<float> b, Vector128<float> t)


I've seen guidance on other PRs that putting [MethodImplAttribute(MethodImplOptions.AggressiveInlining)] on functions that take/return Vector128 is a good idea.

See briancylui/machinelearning#3 (comment) where we were seeing perf degradation when we refactored some "helper" vector methods.

It can help, but looking at the existing codegen it looks good/correct as is.

(that is, the method is small enough to be automatically inlined)

Yeah I've just checked both outputs - both Lerp and Equals\NotEquals are inlined.

OK, thanks for checking. It's good to know that this refactoring didn't hurt perf.

eerhardt

Looks good @EgorBo. I just had 2 relatively minor comments. The rest looks great.

EgorBo · 2018-08-17T17:44:14Z

I've updated benchmark results in the description with the latest changes (e.g. Shuffle made operator* 20% faster), latest runtime and added macOS.
macOS numbers a little bit worse because most of the used intrinsics have better throughput on Coffee Lake (e.g. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_ps&expand=127)

tannergooding

New changes LGTM as well.

@eerhardt, I'll let you merge since I'll be OOF shortly

CarolEidt · 2018-08-17T18:29:25Z

@EgorBo - thanks! This is great to see.
@tannergooding @eerhardt - thanks for getting this reviewed and merged. I've commented on a couple of the JIT issues that were raised. Let me know if there are others.

* Optimize some Matrix4x4 operations with SSE * fix typo * use SetZeroVector128 instead of SetAllVector128(0.0f) * [StructLayout(LayoutKind.Sequential)] on top of Matrix4x4 * collapse operators (call corresponding methods) * mark operators with [MethodImpl(MethodImplOptions.AggressiveInlining)] * remove [MethodImpl(MethodImplOptions.AggressiveInlining)], call operators inside static methods if SSE is enabled (operators should be small enough to be inlined) * overwrite value1 in Multiply instead of new instance * Optimize == and != * prefer CompareNotEqual than CompareEqual * fix typo in == and != * clean up methods-operators * optimize Transpose * simplify Equals * improve Transpose * Lerp as a separate method * remove SSE from != as it fails some NaN-related tests * remove unsafe from != * wrap intrinsics with #if netcoreapp * forgot Letp method and usings * define netcoreapp in csproj * rename netcoreapp symbol to HAS_INTRINSICS * Move Equal and Lerp to VectorMath.cs internal helper * Implement != operator * Debug.Assert(Sse.IsSupported) in VectorMath * replace SetAllVector with Shuffle in Multiply * remove #if HAS_INTRINSICS from VectorMath * fix indention in System.Numerics.Vectors.csproj Commit migrated from dotnet/corefx@5ab0d37

EgorBo added 2 commits August 15, 2018 18:24

Optimize some Matrix4x4 operations with SSE

bb1b5d0

fix typo

f84f81e

tannergooding reviewed Aug 15, 2018

View reviewed changes

use SetZeroVector128 instead of SetAllVector128(0.0f)

c07c315

eerhardt reviewed Aug 15, 2018

View reviewed changes

tannergooding reviewed Aug 15, 2018

View reviewed changes

EgorBo added 3 commits August 15, 2018 19:51

[StructLayout(LayoutKind.Sequential)] on top of Matrix4x4

39a1455

collapse operators (call corresponding methods)

a738760

mark operators with [MethodImpl(MethodImplOptions.AggressiveInlining)]

a6aa63c

EgorBo commented Aug 15, 2018

View reviewed changes

tannergooding reviewed Aug 15, 2018

View reviewed changes

EgorBo added 3 commits August 15, 2018 20:26

remove [MethodImpl(MethodImplOptions.AggressiveInlining)], call opera…

6070dbd

…tors inside static methods if SSE is enabled (operators should be small enough to be inlined)

overwrite value1 in Multiply instead of new instance

8731888

Optimize == and !=

f59cf11

danmoseley reviewed Aug 15, 2018

View reviewed changes

tannergooding reviewed Aug 15, 2018

View reviewed changes

prefer CompareNotEqual than CompareEqual

21ca6ac

tannergooding reviewed Aug 15, 2018

View reviewed changes

EgorBo added 2 commits August 16, 2018 00:59

fix typo in == and !=

bac5113

clean up methods-operators

35f0d7f

tannergooding reviewed Aug 15, 2018

View reviewed changes

optimize Transpose

2ee0164

tannergooding reviewed Aug 16, 2018

View reviewed changes

EgorBo added 2 commits August 17, 2018 02:39

Implement != operator

0436823

Debug.Assert(Sse.IsSupported) in VectorMath

3822f40

EgorBo force-pushed the matrix4x4-sse branch from 53658de to 3822f40 Compare August 16, 2018 23:57

tannergooding approved these changes Aug 17, 2018

View reviewed changes

eerhardt reviewed Aug 17, 2018

View reviewed changes

replace SetAllVector with Shuffle in Multiply

e5f03f8

eerhardt approved these changes Aug 17, 2018

View reviewed changes

EgorBo added 2 commits August 17, 2018 19:57

remove #if HAS_INTRINSICS from VectorMath

c0ac090

fix indention in System.Numerics.Vectors.csproj

15717e5

eerhardt approved these changes Aug 17, 2018

View reviewed changes

tannergooding approved these changes Aug 17, 2018

View reviewed changes

eerhardt merged commit 5ab0d37 into dotnet:master Aug 17, 2018

karelz added this to the 3.0 milestone Aug 21, 2018

stephentoub mentioned this pull request Jul 29, 2019

fix Matrix4x4 + and - operator bugs #39838

Merged

tannergooding mentioned this pull request Jul 30, 2019

[release/3.0] fix Matrix4x4 + and - operator bugs (#39838) #39889

Merged

aashikgowda mentioned this pull request Aug 24, 2019

System.Numeric.Tests tests now have unique inputs #40564

Merged

This was referenced Jan 31, 2020

Investigate perf difference between multiple SetAllVector128 and a Load + Permute sequence dotnet/runtime#27166

Closed

Audit the existing System.Numerics tests to validate the inputs are not all the same dotnet/runtime#30420

Closed

	if (Sse.IsSupported)
	{
	return -value;
	}
	else
	{
	Matrix4x4 result;

	result.M11 = -value.M11;
	result.M12 = -value.M12;
	result.M13 = -value.M13;
	result.M14 = -value.M14;
	result.M21 = -value.M21;
	result.M22 = -value.M22;
	result.M23 = -value.M23;
	result.M24 = -value.M24;
	result.M31 = -value.M31;
	result.M32 = -value.M32;
	result.M33 = -value.M33;
	result.M34 = -value.M34;
	result.M41 = -value.M41;
	result.M42 = -value.M42;
	result.M43 = -value.M43;
	result.M44 = -value.M44;

	return result;
	}

		@@ -1824,26 +1852,33 @@ public static Matrix4x4 Lerp(Matrix4x4 matrix1, Matrix4x4 matrix2, float amount)
		/// <returns>The negated matrix.</returns>
		public static Matrix4x4 Negate(Matrix4x4 value)

Optimize some Matrix4x4 operations with SSE #31779

Optimize some Matrix4x4 operations with SSE #31779

Conversation

EgorBo commented Aug 15, 2018 • edited Loading

Matrix4x4.Add (Matrix4x4, Matrix4x4) and Subtract

Matrix4x4.Lerp (Matrix4x4, Matrix4x4, float)

Matrix4x4.Multiply (Matrix4x4, Matrix4x4)

Matrix4x4.Multiply (Matrix4x4, float)

Matrix4x4.Negate (Matrix4x4)

Matrix4x4.Equals (Matrix4x4)

Matrix4x4.Transpose (Matrix4x4)

tannergooding commented Aug 15, 2018

tannergooding commented Aug 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Aug 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorBo Aug 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Aug 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danmoseley Aug 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Aug 17, 2018

EgorBo commented Aug 17, 2018 • edited Loading

tannergooding commented Aug 17, 2018

tannergooding commented Aug 17, 2018 • edited Loading

tannergooding left a comment

Choose a reason for hiding this comment

tannergooding commented Aug 17, 2018

EgorBo commented Aug 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eerhardt left a comment

Choose a reason for hiding this comment

EgorBo commented Aug 17, 2018 • edited Loading

tannergooding left a comment

EgorBo commented Aug 15, 2018 •

edited

Loading

tannergooding Aug 15, 2018 •

edited

Loading

EgorBo Aug 15, 2018 •

edited

Loading

danmoseley Aug 15, 2018 •

edited

Loading

EgorBo commented Aug 17, 2018 •

edited

Loading

tannergooding commented Aug 17, 2018 •

edited

Loading

EgorBo commented Aug 17, 2018 •

edited

Loading

EgorBo commented Aug 17, 2018 •

edited

Loading