Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting additional SIMD Intrinsics to use SimdAsHWIntrinsic #37882

Merged
merged 25 commits into from
Jul 3, 2020

Conversation

tannergooding
Copy link
Member

This ports additional intrinsics such as SIMDIntrinsicInit, SIMDIntrinsicGetOne, and SIMDIntrinsicDot

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 15, 2020
@tannergooding tannergooding force-pushed the simd-as-hwintrinsic branch 3 times, most recently from 4044add to e98213d Compare June 17, 2020 19:16
@tannergooding tannergooding reopened this Jun 17, 2020
@tannergooding tannergooding reopened this Jun 17, 2020
@tannergooding tannergooding marked this pull request as ready for review June 19, 2020 22:08
@tannergooding
Copy link
Member Author

CC. @CarolEidt, @echesakovMSFT

@tannergooding
Copy link
Member Author

Benchmarks x64

Total bytes of diff: -146 (-0.029% of base)
    diff is an improvement.

Top file improvements (bytes):
         -63 : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-0.214% of base)
         -51 : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.210% of base)
         -32 : diff\BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-1.812% of base)

3 total files with Code Size differences (3 improved, 0 regressed), 79 unchanged.

Top method improvements (bytes):
         -32 (-5.024% of base) : diff\BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:DoBench(int,int):ref
         -12 (-1.469% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera
         -12 (-3.183% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this
          -9 (-0.532% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods)
          -7 (-0.657% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
          -7 (-0.657% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
          -7 (-1.220% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this
          -6 (-0.560% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
          -6 (-0.546% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
          -6 (-1.056% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedNoADT>b__1(int):this
          -6 (-1.047% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this
          -4 (-1.408% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this
          -4 (-2.899% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Normal(Vector):Vector:this
          -4 (-9.524% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Dot(Vector,Vector):float
          -4 (-10.000% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Mag(Vector):float
          -4 (-3.509% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Norm(Vector):Vector
          -4 (-7.018% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Equals(Vector,Vector):bool
          -3 (-0.275% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
          -2 (-0.176% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods)
          -1 (-2.564% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:.cctor()

Top method improvements (percentages):
          -4 (-10.000% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Mag(Vector):float
          -4 (-9.524% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Dot(Vector,Vector):float
          -4 (-7.018% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Equals(Vector,Vector):bool
         -32 (-5.024% of base) : diff\BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:DoBench(int,int):ref
          -4 (-3.509% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Norm(Vector):Vector
         -12 (-3.183% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this
          -4 (-2.899% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Normal(Vector):Vector:this
          -1 (-2.564% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:.cctor()
          -1 (-2.564% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:.cctor()
          -1 (-2.564% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:.cctor()
         -12 (-1.469% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera
          -4 (-1.408% of base) : diff\SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this
          -7 (-1.220% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this
          -6 (-1.056% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedNoADT>b__1(int):this
          -6 (-1.047% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this
          -7 (-0.657% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
          -7 (-0.657% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
          -6 (-0.560% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
          -6 (-0.546% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
          -9 (-0.532% of base) : diff\SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods)

Frameworks x64

Found 266 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of diff: -616 (-0.001% of base)
    diff is an improvement.

Top file improvements (bytes):
        -547 : diff\System.Private.CoreLib.dasm (-0.012% of base)
         -63 : diff\System.Collections.dasm (-0.014% of base)
          -6 : diff\System.Text.Json.dasm (-0.001% of base)

3 total files with Code Size differences (3 improved, 0 regressed), 260 unchanged.

Top method regressions (bytes):
          43 (860.000% of base) : diff\System.Private.CoreLib.dasm - Vector:Dot(Vector`1,Vector`1):short
          29 (580.000% of base) : diff\System.Private.CoreLib.dasm - Vector:Dot(Vector`1,Vector`1):Vector`1
           2 (0.699% of base) : diff\System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Default(long,long):long
           2 (0.697% of base) : diff\System.Private.CoreLib.dasm - Latin1Utility:GetIndexOfFirstNonLatin1Char_Default(long,long):long

Top method improvements (bytes):
        -106 (-23.297% of base) : diff\System.Private.CoreLib.dasm - Vector:Multiply(Vector`1,Vector`1):Vector`1 (8 methods)
         -89 (-9.966% of base) : diff\System.Private.CoreLib.dasm - Vector`1:.cctor() (6 methods)
         -52 (-5.640% of base) : diff\System.Private.CoreLib.dasm - Vector`1:op_Multiply(Vector`1,Vector`1):Vector`1 (8 methods)
         -47 (-24.227% of base) : diff\System.Collections.dasm - BitArray:.cctor()
         -24 (-2.067% of base) : diff\System.Private.CoreLib.dasm - Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
         -24 (-1.489% of base) : diff\System.Private.CoreLib.dasm - Matrix4x4:Decompose(Matrix4x4,byref,byref,byref):bool
         -20 (-18.182% of base) : diff\System.Private.CoreLib.dasm - Vector128`1:get_AllBitsSet():Vector128`1 (6 methods)
         -20 (-16.000% of base) : diff\System.Private.CoreLib.dasm - Vector256`1:get_AllBitsSet():Vector256`1 (6 methods)
         -12 (-0.787% of base) : diff\System.Collections.dasm - BitArray:CopyTo(Array,int):this
          -8 (-1.292% of base) : diff\System.Private.CoreLib.dasm - Matrix4x4:CreateBillboard(Vector3,Vector3,Vector3,Vector3):Matrix4x4
          -8 (-1.235% of base) : diff\System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4
          -8 (-1.569% of base) : diff\System.Private.CoreLib.dasm - Matrix4x4:CreateWorld(Vector3,Vector3,Vector3):Matrix4x4
          -8 (-2.817% of base) : diff\System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane
          -8 (-30.769% of base) : diff\System.Private.CoreLib.dasm - Vector2:Length():float:this
          -8 (-36.364% of base) : diff\System.Private.CoreLib.dasm - Vector2:LengthSquared():float:this
          -8 (-22.222% of base) : diff\System.Private.CoreLib.dasm - Vector2:Distance(Vector2,Vector2):float
          -8 (-25.000% of base) : diff\System.Private.CoreLib.dasm - Vector2:DistanceSquared(Vector2,Vector2):float
          -8 (-14.815% of base) : diff\System.Private.CoreLib.dasm - Vector3:Distance(Vector3,Vector3):float
          -8 (-16.000% of base) : diff\System.Private.CoreLib.dasm - Vector3:DistanceSquared(Vector3,Vector3):float
          -8 (-30.769% of base) : diff\System.Private.CoreLib.dasm - Vector4:Length():float:this

Top method regressions (percentages):
          43 (860.000% of base) : diff\System.Private.CoreLib.dasm - Vector:Dot(Vector`1,Vector`1):short
          29 (580.000% of base) : diff\System.Private.CoreLib.dasm - Vector:Dot(Vector`1,Vector`1):Vector`1
           2 (0.699% of base) : diff\System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Default(long,long):long
           2 (0.697% of base) : diff\System.Private.CoreLib.dasm - Latin1Utility:GetIndexOfFirstNonLatin1Char_Default(long,long):long

Top method improvements (percentages):
          -8 (-36.364% of base) : diff\System.Private.CoreLib.dasm - Vector2:LengthSquared():float:this
          -8 (-36.364% of base) : diff\System.Private.CoreLib.dasm - Vector4:LengthSquared():float:this
          -8 (-30.769% of base) : diff\System.Private.CoreLib.dasm - Vector2:Length():float:this
          -8 (-30.769% of base) : diff\System.Private.CoreLib.dasm - Vector4:Length():float:this
          -8 (-30.769% of base) : diff\System.Private.CoreLib.dasm - Vector4:DistanceSquared(Vector4,Vector4):float
          -8 (-26.667% of base) : diff\System.Private.CoreLib.dasm - Vector4:Distance(Vector4,Vector4):float
          -8 (-25.000% of base) : diff\System.Private.CoreLib.dasm - Vector2:DistanceSquared(Vector2,Vector2):float
         -47 (-24.227% of base) : diff\System.Collections.dasm - BitArray:.cctor()
        -106 (-23.297% of base) : diff\System.Private.CoreLib.dasm - Vector:Multiply(Vector`1,Vector`1):Vector`1 (8 methods)
          -8 (-22.222% of base) : diff\System.Private.CoreLib.dasm - Vector2:Distance(Vector2,Vector2):float
         -20 (-18.182% of base) : diff\System.Private.CoreLib.dasm - Vector128`1:get_AllBitsSet():Vector128`1 (6 methods)
          -8 (-16.000% of base) : diff\System.Private.CoreLib.dasm - Vector3:DistanceSquared(Vector3,Vector3):float
         -20 (-16.000% of base) : diff\System.Private.CoreLib.dasm - Vector256`1:get_AllBitsSet():Vector256`1 (6 methods)
          -8 (-14.815% of base) : diff\System.Private.CoreLib.dasm - Vector3:Distance(Vector3,Vector3):float
          -4 (-14.286% of base) : diff\System.Private.CoreLib.dasm - Vector3:LengthSquared():float:this
          -4 (-12.500% of base) : diff\System.Private.CoreLib.dasm - Vector3:Length():float:this
          -4 (-12.121% of base) : diff\System.Private.CoreLib.dasm - Vector:Dot(Vector`1,Vector`1):double
          -4 (-10.526% of base) : diff\System.Private.CoreLib.dasm - Vector2:Equals(Vector2):bool:this
          -4 (-10.526% of base) : diff\System.Private.CoreLib.dasm - Vector4:Normalize(Vector4):Vector4
          -4 (-10.256% of base) : diff\System.Private.CoreLib.dasm - Vector2:op_Inequality(Vector2,Vector2):bool

@tannergooding
Copy link
Member Author

tannergooding commented Jun 19, 2020

Working on getting perf numbers but this should resolve the following perf regression #37425

@tannergooding
Copy link
Member Author

tannergooding commented Jun 19, 2020

Most of the diffs are essentially the following:

-       vmovaps  xmm1, xmm0
-       vdpps    xmm1, xmm0, 113
-       vmovaps  xmm0, xmm1
+       vdpps    xmm0, xmm0, xmm0, 113

There are also a few similar too:

        vmovmskps xrax, xmm0
-       mov      edx, 7
-       and      eax, edx
+       and      eax, 7

Likewise a few that go from:
```diff
-       mov      eax, 0xD1FFAB1E
-       vmovd    xmm0, eax
-       vpbroadcastd ymm0, ymm0
+       vmovupd  ymm0, ymmword ptr[reloc @RWD32]

A more extreme example:

       vxorps   xmm0, xmm0
       vmovdqu  xmmword ptr [rsp+48H], xmm0
       vmovdqu  xmmword ptr [rsp+58H], xmm0
       lea      rcx, bword ptr [rsp+48H]
       call     Vector`1:.ctor(Vector`1):this
       mov      rcx, rdi
       vmovdqu  xmm0, xmmword ptr [rsp+48H]
       vmovdqu  xmmword ptr [rsp+28H], xmm0
       vmovdqu  xmm0, xmmword ptr [rsp+58H]
       vmovdqu  xmmword ptr [rsp+38H], xmm0
       lea      rdx, bword ptr [rsp+28H]
       mov      r8, rsi
       call     Vector`1:op_Multiply(Vector`1,Vector`1):Vector`1
       mov      rax, rdi

To:

       vmovupd  ymm0, ymmword ptr[rdx]
       vbroadcastss ymm0, ymm0
       vmovupd  ymmword ptr[rsp+50H], ymm0
       mov      rcx, rsi
       vmovupd  ymm0, ymmword ptr[rsp+50H]
       vmovupd  ymmword ptr[rsp+20H], ymm0
       lea      rdx, bword ptr [rsp+20H]
       call     Vector`1:op_Multiply(Vector`1,Vector`1):Vector`1
       mov      rax, rsi

The Vector:Dot regressions are because ushort is recognized as an intrinsic on x86, where is wasn't previously.

SIMD_INTRINSIC("get_One", false, GetOne, "one", TYP_STRUCT, 0, {TYP_VOID, TYP_UNDEF, TYP_UNDEF}, {TYP_INT, TYP_FLOAT, TYP_DOUBLE, TYP_LONG, TYP_USHORT, TYP_UBYTE, TYP_BYTE, TYP_SHORT, TYP_UINT, TYP_ULONG})
SIMD_INTRINSIC("get_Zero", false, GetZero, "zero", TYP_STRUCT, 0, {TYP_VOID, TYP_UNDEF, TYP_UNDEF}, {TYP_INT, TYP_FLOAT, TYP_DOUBLE, TYP_LONG, TYP_USHORT, TYP_UBYTE, TYP_BYTE, TYP_SHORT, TYP_UINT, TYP_ULONG})
SIMD_INTRINSIC("get_AllOnes", false, GetAllOnes, "allOnes", TYP_STRUCT, 0, {TYP_VOID, TYP_UNDEF, TYP_UNDEF}, {TYP_INT, TYP_FLOAT, TYP_DOUBLE, TYP_LONG, TYP_USHORT, TYP_UBYTE, TYP_BYTE, TYP_SHORT, TYP_UINT, TYP_ULONG})

// .ctor call or newobj - there are four forms.
// This form takes the object plus a value of the base (element) type:
SIMD_INTRINSIC(".ctor", true, Init, "init", TYP_VOID, 2, {TYP_BYREF, TYP_UNKNOWN, TYP_UNDEF}, {TYP_INT, TYP_FLOAT, TYP_DOUBLE, TYP_LONG, TYP_USHORT, TYP_UBYTE, TYP_BYTE, TYP_SHORT, TYP_UINT, TYP_ULONG})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully removing SIMDIntrinsicInit and removing gtGetSIMDZero requires a bit more work. I logged #37043 as the more general issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, beyond this PR, any other improvements should likely hold off for .NET 6.

@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib

This should be ready for review.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comment stuff.

}
}

if (isSupported && (intrinsic == NI_Vector256_ToScalar))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would make more sense to just split out this case (as it was previously), even with a separate check for the long types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the confusing part here is that the AVX check isn't checking for some instruction support, it's just a "we support Vector256<T>" check.

The instruction emitted is the same for 128 or 256-bit, it's just that we support Vector256<T> if AVX is supported, so the code really is identical (we even emit the register access as a 128-bit access)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that (AFAICT) the previous calls to compExactlyDependsOn are unnecessary for the NI_Vector256_ToScalar case. So perhaps just checking for that first would make more sense.

Copy link
Member Author

@tannergooding tannergooding Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's generally true except for the SSE2_X64 case, as we won't have AVX_X64 until #38460 goes in (although you also can't disable SSE2.X64, you can only disable SSE2 itself).

If duplicated, the compExactlyDependsOn(SSE2) checks would become compExactlyDependsOn(AVX) and the compExactlyDependsOn(SSE2_X64) check would become compExactlyDependsOn(AVX) && compExactlyDependsOn(SSE2_X64). Once #38460 is merged, it could be simplified to just compExactlyDependsOn(AVX_X64)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it correct that only the SSE2_X64 check is non-redundant for the AVX case? It still seems confusing to me, and a lot of unnecessary checks for that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, only the SSE2_X64 case is non redundant. I just pushed a fix that breaks it apart and which should make that distinction clearer.

src/coreclr/src/jit/importer.cpp Show resolved Hide resolved
src/coreclr/src/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/lowerarmarch.cpp Show resolved Hide resolved
src/coreclr/src/jit/lowerxarch.cpp Show resolved Hide resolved
src/coreclr/src/jit/lowerxarch.cpp Show resolved Hide resolved
src/coreclr/src/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
@tannergooding tannergooding force-pushed the simd-as-hwintrinsic branch from 6db4ec3 to 3ac2b4a Compare July 2, 2020 20:00
Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for the comments and code shufflings!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants