Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Optimize some Matrix4x4 operations with SSE #31779

Merged
merged 28 commits into from
Aug 17, 2018
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Aug 15, 2018

This PR optimizes some Matrix4x4 operations with SSE (see https://github.com/dotnet/corefx/issues/31425). Some of the operations could also be optimized with AVX but for some reason on my PC it performs worse than SSE (VZEROUPPER? or maybe CPU decreases AVX frequency due to rapid benchmarking?).

Environment:

.NET Core: .NET Core SDK=3.0.100-alpha1-20180720-2

Windows 10:
   Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores

macOS 10.13:
   Intel Core i7-4980HQ CPU 2.80GHz (Haswell), 1 CPU, 8 logical and 4 physical cores

Matrix4x4.Add (Matrix4x4, Matrix4x4) and Subtract

Matrix4x4 result = matrix1 + matrix2;

Windows (Coffee Lake):

Method Mean Scaled
Add_old 13.353 ns 1.00
Add_new 4.486 ns 0.34

macOS (Haswell):

Method Mean Scaled
Add_old 15.347 ns 1.00
Add_new 7.473 ns 0.49

Matrix4x4.Lerp (Matrix4x4, Matrix4x4, float)

Matrix4x4 result = Matrix4x4.Lerp(matrix1, matrix2, amount);

Windows (Coffee Lake):

Method Mean Scaled
Lerp_old 15.286 ns 1.00
Lerp_new 5.365 ns 0.35

macOS (Haswell):

Method Mean Scaled
Lerp_old 17.047 ns 1.00
Lerp_new 7.657 ns 0.45

Matrix4x4.Multiply (Matrix4x4, Matrix4x4)

Matrix4x4 result = matrix1 * matrix2;

Windows (Coffee Lake):

Method Mean Scaled
Multiply_old 27.146 ns 1.00
Multiply_new 7.461 ns 0.27

macOS (Haswell):

Method Mean Scaled
Multiply_old 32.05 ns 1.00
Multiply_new 11.24 ns 0.35

Matrix4x4.Multiply (Matrix4x4, float)

Matrix4x4 result = matrix1 * scalar;

Windows (Coffee Lake):

Method Mean Scaled
MultiplyByScalar_old 12.927 ns 1.00
MultiplyByScalar_new 3.284 ns 0.25

macOS (Haswell):

Method Mean Scaled
MultiplyByScalar_old 14.334 ns 1.00
MultiplyByScalar_new 5.086 ns 0.35

Matrix4x4.Negate (Matrix4x4)

Matrix4x4 result = -matrix1;

Windows (Coffee Lake):

Method Mean Scaled
Negate_old 12.932 ns 1.00
Negate_new 3.187 ns 0.25

macOS (Haswell):

Method Mean Scaled
Negate_old 14.877 ns 1.00
Negate_new 5.201 ns 0.35

Matrix4x4.Equals (Matrix4x4)

bool result = matrix1 == matrix2;

Windows (Coffee Lake):

Method Mean
Equals_NotEqual_old 1.742 ns
Equals_NotEqual_new 1.581 ns
Equals_Equal_old 7.081 ns
Equals_Equal_new 2.960 ns

macOS (Haswell):

Method Mean
Equals_NotEqual_old 3.172 ns
Equals_NotEqual_new 3.022 ns
Equals_Equal_old 8.180 ns
Equals_Equal_new 4.618 ns

Matrix4x4.Transpose (Matrix4x4)

Matrix4x4 result = Matrix4x4.Transpose(matrix1);

Windows (Coffee Lake):

Method Mean Scaled
Transpose_old 12.720 ns 1.00
Transpose_new 3.156 ns 0.25

macOS (Haswell):

Method Mean Scaled
Transpose_old 14.297 ns 1.00
Transpose_new 5.548 ns 0.39

Benchmarks: https://gist.github.com/EgorBo/c80a25517245374c8dcdca2af4536ffe

@tannergooding
Copy link
Member

Possibly outside the scope of this PR, but it would be nice if we had some perf tests for Matrix4x4 (and the other types) in: https://github.com/dotnet/corefx/tree/master/src/System.Numerics.Vectors/tests/Performance

@tannergooding
Copy link
Member

CC. @eerhardt, @ViktorHofer

Also CC. @benaadams, who has looked at some of this before.

result.M14 = matrix1.M14 + (matrix2.M14 - matrix1.M14) * amount;
var m1Row = Sse.LoadVector128(&matrix1.M11);
var m2Row = Sse.LoadVector128(&matrix2.M11);
Sse.Store(&matrix1.M11, Sse.Add(m1Row, Sse.Multiply(Sse.Subtract(m2Row, m1Row), amountVec)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this somewhat here, but it would be nice if we had some of these functions extracted to a mini-helper library.

These lines are basically a VectorLerp function, which would allow something like:

Sse.Store(&matrix1.M11, VectorMath.Lerp(&matrix1.M11, &matrix2.M11, amountVec));
Sse.Store(&matrix1.M21, VectorMath.Lerp(&matrix1.M21, &matrix2.M21, amountVec));
Sse.Store(&matrix1.M31, VectorMath.Lerp(&matrix1.M31, &matrix2.M31, amountVec));
Sse.Store(&matrix1.M41, VectorMath.Lerp(&matrix1.M41, &matrix2.M41, amountVec));

or something similar (and would also allow reuse elsewhere, such as in Vector4.Lerp)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for all the operations below.

result.M44 = -value.M44;
if (Sse.IsSupported)
{
var zero = Sse.SetAllVector128(0f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use SetZeroVector128 instead

result.M12 = matrix1.M12 + (matrix2.M12 - matrix1.M12) * amount;
result.M13 = matrix1.M13 + (matrix2.M13 - matrix1.M13) * amount;
result.M14 = matrix1.M14 + (matrix2.M14 - matrix1.M14) * amount;
var m1Row = Sse.LoadVector128(&matrix1.M11);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have [StructLayout(LayoutKind.Sequential)] on the Matrix4x4 type? That way we are assured the members will be laid out in the assumed order?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Roslyn Compiler emits StructLayout=Sequential by default for struct types. However, last I checked there wasn't anything enforcing this behavior in the language spec.

It would probably be a good idea to explicitly mark the numeric types as Sequential (as we do with types like Single and Int32).

Matrix4x4 m = default;

Sse.Store(&m.M11,
Sse.Add(Sse.Add(Sse.Multiply(Sse.SetAllVector128(value1.M11), Sse.LoadVector128(&value2.M11)),
Copy link
Member

@tannergooding tannergooding Aug 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than calling SetAllVector128 multiple times, it is probably better to do a load row and permute (broadcast or permute in AVX, shuffle in SSE)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, it'd be nice to expose _MM_SHUFFLE macros to C# for Shuffle.
e.g. static byte _MM_SHUFFLE(byte fp3, byte fp2, byte fp1, byte fp0) => (byte)(((fp3) << 6) | ((fp2) << 4) | ((fp1) << 2) | ((fp0)));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've raised it before. It should go to API review soon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. @tannergooding and I had the same conclusion during dotnet/machinelearning#562. However, I didn't see an issue open for this in either corefx or coreclr. @tannergooding did you ever open one?

Here is what we proposed at the time:

 `Sse.Shuffle(op1, op2, Sse.GetShuffleControl(0, 3, 1, 2))` 
Something like that. `GetShuffleControl`, `BuildShuffleControl`, or something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've tried to use shuffle here but it became slower, maybe I did something wrong?
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at the assembly and see if I can spot what is going wrong....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding any update on _MM_SHUFFLE API? Could not find an issue for it

m.M44 = -value.M44;

return m;
if (Sse.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not just calling the Negate method (or vice-versa)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the operators below

Copy link
Member Author

@EgorBo EgorBo Aug 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will Negate be inlined?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the JIT will inline it so that the inner call is called directly, rather than trying to inline the inner call into the outer call.
If it doesn't, then this sounds like a good JIT bug to log.

@tannergooding
Copy link
Member

Also CC. @CarolEidt

value1.M31 != value2.M31 || value1.M32 != value2.M32 || value1.M33 != value2.M33 || value1.M34 != value2.M34 ||
value1.M41 != value2.M41 || value1.M42 != value2.M42 || value1.M43 != value2.M43 || value1.M44 != value2.M44);
}
public static bool operator !=(Matrix4x4 value1, Matrix4x4 value2) => !value1.Equals(value2);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this one, will check the jit output

}

/// <summary>
/// Returns a new matrix with the negated elements of the given matrix.
/// </summary>
/// <param name="value">The source matrix.</param>
/// <returns>The negated matrix.</returns>
public static Matrix4x4 Negate(Matrix4x4 value)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. The code is probably small enough to be inlined in the hardware accelerated case, but not necessarily in the software case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for other new AggressiveInlining attributes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but without that it basically adds a call to the software Negate operator. Are you suggesting to duplicate software implementation in both -operator and Negate and do something like if (Sse.IsSupported) { return -value; } else { /* duplicate software impl */ } ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the codegen you are seeing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Negate is not inlined 🙁 https://gist.github.com/EgorBo/a6cc52d5523d7a5fcf6fb1adfa5dfbc0
Negate_new is a benchmark method that does return Matrix4x4X.Negate(m1x);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

um.. I expected Negate_new() to just call op_UnaryNegation directly
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, using jitutils to get PMI diffs, and for public static unsafe Matrix4x4 Negate(Matrix4x4 value) => -value; and public static unsafe Matrix4x4 operator -(Matrix4x4 value) containing the actual implementation....

The following two tests (aside from not being "correct"):

[Fact]
public void Matrix4x4NegateTest()
{
    Matrix4x4 m = GenerateMatrixNumberFrom1To16();
    var actual = Matrix4x4.Negate(m);
    Assert.Equal(m, actual);
}

[Fact]
public void Matrix4x4op_NegateTest()
{
    Matrix4x4 m = GenerateMatrixNumberFrom1To16();
    var actual = -m;
    Assert.Equal(m, actual);
}

Both generate a call directly to call Matrix4x4:op_UnaryNegation(struct):struct:

; Assembly listing for method Matrix4x4Tests:Matrix4x4NegateTest():this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd
;  V01 loc0         [V01    ] (  3,  3   )  struct (64) [rsp+0x1A8]   do-not-enreg[XSB] must-init addr-exposed
;  V02 loc1         [V02    ] (  2,  2   )  struct (64) [rsp+0x168]   do-not-enreg[XSB] must-init addr-exposed
;  V03 OutArgs      [V03    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;  V04 tmp1         [V04,T01] (  2,  4   )  struct (64) [rsp+0x128]   do-not-enreg[SB]
;  V05 tmp2         [V05,T02] (  2,  4   )  struct (64) [rsp+0xE8]   do-not-enreg[SB]
;  V06 tmp3         [V06,T03] (  2,  4   )  struct (64) [rsp+0xA8]   do-not-enreg[SB]
;  V07 tmp4         [V07,T00] (  3,  6   )     ref  ->  rsi         class-hnd exact
;  V08 tmp5         [V08    ] (  4,  8   )  struct (64) [rsp+0x68]   do-not-enreg[XSB] addr-exposed
;  V09 tmp6         [V09,T04] (  2,  4   )    long  ->  rcx        
;  V10 tmp7         [V10    ] (  2,  4   )  struct (64) [rsp+0x28]   do-not-enreg[XSB] addr-exposed
;
; Lcl frame size = 488
G_M53037_IG01:
       push     rdi
       push     rsi
       sub      rsp, 488
       vzeroupper 
       mov      rsi, rcx
       lea      rdi, [rsp+168H]
       mov      ecx, 32
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
G_M53037_IG02:
       lea      rcx, bword ptr [rsp+1A8H]
       call     Matrix4x4Tests:GenerateMatrixNumberFrom1To16():struct
       vmovdqu  xmm0, qword ptr [rsp+1A8H]
       vmovdqu  qword ptr [rsp+128H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+1B8H]
       vmovdqu  qword ptr [rsp+138H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+1C8H]
       vmovdqu  qword ptr [rsp+148H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+1D8H]
       vmovdqu  qword ptr [rsp+158H], xmm0
       lea      rcx, bword ptr [rsp+168H]
       vmovdqu  xmm0, qword ptr [rsp+128H]
       vmovdqu  qword ptr [rsp+68H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+138H]
       vmovdqu  qword ptr [rsp+78H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+148H]
       vmovdqu  qword ptr [rsp+88H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+158H]
       vmovdqu  qword ptr [rsp+98H], xmm0
       lea      rdx, bword ptr [rsp+68H]
       call     Matrix4x4:op_UnaryNegation(struct):struct
       vmovdqu  xmm0, qword ptr [rsp+1A8H]
       vmovdqu  qword ptr [rsp+E8H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+1B8H]
       vmovdqu  qword ptr [rsp+F8H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+1C8H]
       vmovdqu  qword ptr [rsp+108H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+1D8H]
       vmovdqu  qword ptr [rsp+118H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+168H]
       vmovdqu  qword ptr [rsp+A8H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+178H]
       vmovdqu  qword ptr [rsp+B8H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+188H]
       vmovdqu  qword ptr [rsp+C8H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+198H]
       vmovdqu  qword ptr [rsp+D8H], xmm0
       mov      rcx, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      rsi, rax
       mov      rcx, rsi
       xor      rdx, rdx
       call     AssertEqualityComparer`1:.ctor(ref):this
       vmovdqu  xmm0, qword ptr [rsp+E8H]
       vmovdqu  qword ptr [rsp+68H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+F8H]
       vmovdqu  qword ptr [rsp+78H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+108H]
       vmovdqu  qword ptr [rsp+88H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+118H]
       vmovdqu  qword ptr [rsp+98H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+A8H]
       vmovdqu  qword ptr [rsp+28H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+B8H]
       vmovdqu  qword ptr [rsp+38H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+C8H]
       vmovdqu  qword ptr [rsp+48H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+D8H]
       vmovdqu  qword ptr [rsp+58H], xmm0
G_M53037_IG03:
       lea      rcx, bword ptr [rsp+68H]
       lea      rdx, bword ptr [rsp+28H]
       mov      r8, rsi
       call     Assert:Equal(struct,struct,ref)
       nop      
G_M53037_IG04:
       add      rsp, 488
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code 579, prolog size 35 for method Matrix4x4Tests:Matrix4x4NegateTest():this
; ============================================================
Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0xd1ffab1e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x09
  CountOfUnwindCodes: 4
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x09 UnwindOp: UWOP_ALLOC_LARGE (1)     OpInfo: 0 - Scaled small  
      Size: 61 * 8 = 488 = 0x001E8
    CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
    CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)

and

; Assembly listing for method Matrix4x4Tests:Matrix4x4op_NegateTest():this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd
;  V01 loc0         [V01    ] (  3,  3   )  struct (64) [rsp+0x168]   do-not-enreg[XSB] must-init addr-exposed
;  V02 loc1         [V02    ] (  2,  2   )  struct (64) [rsp+0x128]   do-not-enreg[XSB] must-init addr-exposed
;  V03 OutArgs      [V03    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;  V04 tmp1         [V04,T01] (  2,  4   )  struct (64) [rsp+0xE8]   do-not-enreg[SB]
;  V05 tmp2         [V05,T02] (  2,  4   )  struct (64) [rsp+0xA8]   do-not-enreg[SB]
;  V06 tmp3         [V06,T00] (  3,  6   )     ref  ->  rsi         class-hnd exact
;  V07 tmp4         [V07    ] (  4,  8   )  struct (64) [rsp+0x68]   do-not-enreg[XSB] addr-exposed
;  V08 tmp5         [V08,T03] (  2,  4   )    long  ->  rcx        
;  V09 tmp6         [V09    ] (  2,  4   )  struct (64) [rsp+0x28]   do-not-enreg[XSB] addr-exposed
;
; Lcl frame size = 424
G_M27564_IG01:
       push     rdi
       push     rsi
       sub      rsp, 424
       vzeroupper 
       mov      rsi, rcx
       lea      rdi, [rsp+128H]
       mov      ecx, 32
       xor      rax, rax
       rep stosd 
       mov      rcx, rsi
G_M27564_IG02:
       lea      rcx, bword ptr [rsp+168H]
       call     Matrix4x4Tests:GenerateMatrixNumberFrom1To16():struct
       lea      rcx, bword ptr [rsp+128H]
       vmovdqu  xmm0, qword ptr [rsp+168H]
       vmovdqu  qword ptr [rsp+68H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+178H]
       vmovdqu  qword ptr [rsp+78H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+188H]
       vmovdqu  qword ptr [rsp+88H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+198H]
       vmovdqu  qword ptr [rsp+98H], xmm0
       lea      rdx, bword ptr [rsp+68H]
       call     Matrix4x4:op_UnaryNegation(struct):struct
       vmovdqu  xmm0, qword ptr [rsp+168H]
       vmovdqu  qword ptr [rsp+E8H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+178H]
       vmovdqu  qword ptr [rsp+F8H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+188H]
       vmovdqu  qword ptr [rsp+108H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+198H]
       vmovdqu  qword ptr [rsp+118H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+128H]
       vmovdqu  qword ptr [rsp+A8H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+138H]
       vmovdqu  qword ptr [rsp+B8H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+148H]
       vmovdqu  qword ptr [rsp+C8H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+158H]
       vmovdqu  qword ptr [rsp+D8H], xmm0
       mov      rcx, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      rsi, rax
       mov      rcx, rsi
       xor      rdx, rdx
       call     AssertEqualityComparer`1:.ctor(ref):this
       vmovdqu  xmm0, qword ptr [rsp+E8H]
       vmovdqu  qword ptr [rsp+68H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+F8H]
       vmovdqu  qword ptr [rsp+78H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+108H]
       vmovdqu  qword ptr [rsp+88H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+118H]
       vmovdqu  qword ptr [rsp+98H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+A8H]
       vmovdqu  qword ptr [rsp+28H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+B8H]
       vmovdqu  qword ptr [rsp+38H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+C8H]
       vmovdqu  qword ptr [rsp+48H], xmm0
       vmovdqu  xmm0, qword ptr [rsp+D8H]
       vmovdqu  qword ptr [rsp+58H], xmm0
       lea      rcx, bword ptr [rsp+68H]
       lea      rdx, bword ptr [rsp+28H]
       mov      r8, rsi
       call     Assert:Equal(struct,struct,ref)
       nop      
G_M27564_IG03:
       add      rsp, 424
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code 499, prolog size 35 for method Matrix4x4Tests:Matrix4x4op_NegateTest():this
; ============================================================
Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0xd1ffab1e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x09
  CountOfUnwindCodes: 4
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x09 UnwindOp: UWOP_ALLOC_LARGE (1)     OpInfo: 0 - Scaled small  
      Size: 53 * 8 = 424 = 0x001A8
    CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
    CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is exactly what I expected.

@EgorBo, could you fix them up accordingly, since the JIT already does the right thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding um.. so how should I fix it and do not regress the software fallback (will it be inlined in .Negate if SSE is not available?). Is not the current implementation what you meant

if (Sse.IsSupported)
{
return -value;
}
else
{
Matrix4x4 result;
result.M11 = -value.M11;
result.M12 = -value.M12;
result.M13 = -value.M13;
result.M14 = -value.M14;
result.M21 = -value.M21;
result.M22 = -value.M22;
result.M23 = -value.M23;
result.M24 = -value.M24;
result.M31 = -value.M31;
result.M32 = -value.M32;
result.M33 = -value.M33;
result.M34 = -value.M34;
result.M41 = -value.M41;
result.M42 = -value.M42;
result.M43 = -value.M43;
result.M44 = -value.M44;
return result;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean, in both cases public static unsafe Matrix4x4 Negate(Matrix4x4 value) => -value; will result in a call directly to op_Negate.

The JIT will then determine if op_Negate is appropriate to inline or not (currently it does not). It will result in code that is simpler to maintain and produces the same codegen as is being produced today (for Release builds)


return m;
}
public static Matrix4x4 operator -(Matrix4x4 value) => Negate(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think the normal case would be to have the implementation in the operator and have Negate call it (more people likely use the operator than call the "friendly name")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

result.M12 = matrix1.M12 + (matrix2.M12 - matrix1.M12) * amount;
result.M13 = matrix1.M13 + (matrix2.M13 - matrix1.M13) * amount;
result.M14 = matrix1.M14 + (matrix2.M14 - matrix1.M14) * amount;
if (Sse.IsSupported)
Copy link
Member

@danmoseley danmoseley Aug 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question more for @tannergooding . What is our test strategy for the software fallback code? I guess on x86/x64 Sse.IsSupported is always true for us so the test strategy is: tests run on ARM?

If we later have a codepath for ARM intrinsics also, we will need a new strategy.

Perhaps there's a way to force the runtime to lie and return false for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, initially I wanted to do If (Sse.IsSupported) {} else if (Arm.Simd.IsSupported) {} else {} 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting COMPlus_FeatureSIMD=0 should cover the Sse-Avx2 HWIntrinsics. I don't think we have a switch to cover all HWIntrinsics, however.

CC. @CarolEidt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ARM intrinsics still need to be reviewed and implemented, before they can be used.

if (Sse.IsSupported)
{
return
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M11), Sse.LoadVector128(&value2.M11))) != 0xF ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MoveMask(CompareNotEqual()) != 0 is probably more efficient

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just tested, you are right, it's 30% more efficient. Also, I tried to add some fast-out paths (compare first field or first row as simple m1.M11 == m2.M11 && ...) but it gave a minor improvement for some cases and major regression in anothers.

@@ -1824,26 +1852,33 @@ public static Matrix4x4 Lerp(Matrix4x4 matrix1, Matrix4x4 matrix2, float amount)
/// <returns>The negated matrix.</returns>
public static Matrix4x4 Negate(Matrix4x4 value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you saw, since the thread is now "hidden" by GitHub, but I commented here (with disassembly using current head of CoreCLR) that the JIT does the right thing for inlining public static Matrix4x4 Negate(Matrix4x4 value) => -value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's hard to follow when github collapses them 🙂

Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M21), Sse.LoadVector128(&value2.M21))) != 0xF ||
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M31), Sse.LoadVector128(&value2.M31))) != 0xF ||
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M41), Sse.LoadVector128(&value2.M41))) != 0xF;
Sse.MoveMask(Sse.CompareNotEqual(Sse.LoadVector128(&value1.M11), Sse.LoadVector128(&value2.M11))) == 0 ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this one should be != 0. CompareNotEqual does (a != b) ? 0xFFFFFFFF : 0, so it will be non-zero if one of rows in value1 != the same row in value2

Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M21), Sse.LoadVector128(&value2.M21))) == 0xF &&
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M31), Sse.LoadVector128(&value2.M31))) == 0xF &&
Sse.MoveMask(Sse.CompareEqual(Sse.LoadVector128(&value1.M41), Sse.LoadVector128(&value2.M41))) == 0xF;
Sse.MoveMask(Sse.CompareNotEqual(Sse.LoadVector128(&value1.M11), Sse.LoadVector128(&value2.M11))) != 0 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this one should be == 0. CompareNotEqual does (a != b) ? 0xFFFFFFFF : 0, so it will be zero if all of the rows in value1 == the same row in value2

@@ -2346,9 +2181,9 @@ public bool Equals(Matrix4x4 other)
/// <returns>True if the Object is equal to this matrix; False otherwise.</returns>
public override bool Equals(object obj)
{
if (obj is Matrix4x4)
if (obj is Matrix4x4 m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You could do this in one line with (obj is Matrix4x4 other) && (this == other)

Sse.Store(&matrix.M31, Sse.MoveLowToHigh(h12, h34));
Sse.Store(&matrix.M41, Sse.MoveHighToLow(h34, h12));

return matrix;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also end up with a "sub-optimal" return block:

       vmovdqu  xmm0, qword ptr [rdx]
       vmovdqu  qword ptr [rcx], xmm0
       vmovdqu  xmm0, qword ptr [rdx+16]
       vmovdqu  qword ptr [rcx+16], xmm0
       vmovdqu  xmm0, qword ptr [rdx+32]
       vmovdqu  qword ptr [rcx+32], xmm0
       vmovdqu  xmm0, qword ptr [rdx+48]
       vmovdqu  qword ptr [rcx+48], xmm0

I would hope we could be smart enough to have the above stores go directly to the return buffer...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
public static Vector128<float> Lerp(Vector128<float> a, Vector128<float> b, Vector128<float> t)
{
if (Sse.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably sufficient to just Debug.Assert(Sse.IsSupported), since this is internal and we have already checked the code higher up.

@fiigii
Copy link
Contributor

fiigii commented Aug 17, 2018

IMO, the current implementation has a very inefficient code pattern that loads fields of Matrix4x4 at the beginning of a function and stores the results back at the end. Each operation could generate many memory access operations, so some scenarios that heavily use Matrix4x4 will possibly become memory bounded.

So, the more efficient implementation should be defining Matrix4x4 with 4 Vector128<float> fields, and other operations directly read the Vector128<float> fields then return a new Matrix4x4(...). Because of RyuJIT's struct promotion optimization and inlining, all the loads of local struct's fields and "new" of results will be eliminated (replaced by direct SIMD register operations). According to my experience, that will make some scenarios much faster and more CPU bounded.

cc @tannergooding @CarolEidt

@EgorBo
Copy link
Member Author

EgorBo commented Aug 17, 2018

@fiigii how would a single field look like (e.g. M11)?
UPD: a small prototype: https://gist.github.com/EgorBo/658bf41aad52af45866272037d1973af
Will it be interop friendly?
E.g. in my pet project I have:

var m4x4prj = *(Matrix4x4*)(void*)&appleMatrix;
m4x4prj.M43 /= 2f;
m4x4prj.M33 = far / (far - near);
m4x4prj.M34 *= -1;
m4x4prj = Matrix4x4.Transpose(m4x4prj);

Camera.SetProjection(&m4x4prj.M11);

@tannergooding
Copy link
Member

@fiigii, I don't believe we can take such a change because the various System.Numerics.Vector types were shipped with their fields publicly exposed.

We cannot:

  • Remove the existing fields, as it is a major binary breaking change
  • Add new fields with explicit layout, as it would
    • Break any existing code using these as interop types (generics can't be marshaled today)
    • Likely regress perf in other areas, due to JIT limitations with explicit layout structs

This also prevents other changes that likely would have been generally good for these types, such as marking the structs as readonly, but it is the state of the world and we will need to work with it 😄

@tannergooding
Copy link
Member

tannergooding commented Aug 17, 2018

I would hope that, instead, we could make the JIT smarter for these types of things. Having __vectorcall support, for example, would allow the values to be passed around in register and would elide many of the memory concerns (https://github.com/dotnet/coreclr/issues/12120).

CC. @CarolEidt

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

The perf improvements look great and show just how useful adding HWIntrinsic code paths can be (and how simple it can be, as compared to adding direct JIT support instead).

There were a couple codegen issues this exposed, that we can hopefully get resolved longer term.

@tannergooding
Copy link
Member

@EgorBo, are there any other changes you were looking to make to this PR?

@EgorBo
Copy link
Member Author

EgorBo commented Aug 17, 2018

@tannergooding let me switch back to Shuffle-based impl for Multiply 🙂

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
#if HAS_INTRINSICS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of completely #if on this whole file, we should only include it when it is used.

So, we can remove this #if, and in the .csproj add it to <Compile> when $(TargetsNetCoreApp).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt yeah I was thinking about it, but I guess I have to declare a new property $(IntrinsicsSupport) something like this in the csproj since we rely on HAS_INTRINSICS symbol

{
internal static class VectorMath
{
public static Vector128<float> Lerp(Vector128<float> a, Vector128<float> b, Vector128<float> t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen guidance on other PRs that putting [MethodImplAttribute(MethodImplOptions.AggressiveInlining)] on functions that take/return Vector128 is a good idea.

See briancylui/machinelearning#3 (comment) where we were seeing perf degradation when we refactored some "helper" vector methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can help, but looking at the existing codegen it looks good/correct as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that is, the method is small enough to be automatically inlined)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've just checked both outputs - both Lerp and Equals\NotEquals are inlined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for checking. It's good to know that this refactoring didn't hurt perf.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @EgorBo. I just had 2 relatively minor comments. The rest looks great.

@EgorBo
Copy link
Member Author

EgorBo commented Aug 17, 2018

I've updated benchmark results in the description with the latest changes (e.g. Shuffle made operator* 20% faster), latest runtime and added macOS.
macOS numbers a little bit worse because most of the used intrinsics have better throughput on Coffee Lake (e.g. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_ps&expand=127)

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes LGTM as well.

@eerhardt, I'll let you merge since I'll be OOF shortly

@eerhardt eerhardt merged commit 5ab0d37 into dotnet:master Aug 17, 2018
@CarolEidt
Copy link

@EgorBo - thanks! This is great to see.
@tannergooding @eerhardt - thanks for getting this reviewed and merged. I've commented on a couple of the JIT issues that were raised. Let me know if there are others.

@karelz karelz added this to the 3.0 milestone Aug 21, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Optimize some Matrix4x4 operations with SSE

* fix typo

* use SetZeroVector128 instead of SetAllVector128(0.0f)

* [StructLayout(LayoutKind.Sequential)] on top of Matrix4x4

* collapse operators (call corresponding methods)

* mark operators with [MethodImpl(MethodImplOptions.AggressiveInlining)]

* remove [MethodImpl(MethodImplOptions.AggressiveInlining)], call operators inside static methods if SSE is enabled (operators should be small enough to be inlined)

* overwrite value1 in Multiply instead of new instance

* Optimize == and !=

* prefer CompareNotEqual than CompareEqual

* fix typo in == and !=

* clean up methods-operators

* optimize Transpose

* simplify Equals

* improve Transpose

* Lerp as a separate method

* remove SSE from != as it fails some NaN-related tests

* remove unsafe from !=

* wrap intrinsics with #if netcoreapp

* forgot Letp method and usings

* define netcoreapp in csproj

* rename netcoreapp symbol to HAS_INTRINSICS

* Move Equal and Lerp to VectorMath.cs internal helper

* Implement != operator

* Debug.Assert(Sse.IsSupported) in VectorMath

* replace SetAllVector with Shuffle in Multiply

* remove #if HAS_INTRINSICS from VectorMath

* fix indention in System.Numerics.Vectors.csproj


Commit migrated from dotnet/corefx@5ab0d37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants