[WIP] Remove tailcall limitations on unix64 and arm64 #25932

jakobbotsch · 2019-07-30T03:30:10Z

No description provided.

jakobbotsch · 2019-07-30T19:48:40Z

The basic problem is the following: PUTARG_STK will write values into the arg space area which, for fast tail calls, is the area of the incoming args. A PUTARG_STK may therefore end up overwrite (parts of) an argument that is used later in the call sequence. On Windows the ABI allows us to know that, if we are doing a PUTARG_STK for argument at index i, then this will overwrite the ith argument to the caller. Thus it is easy to introduce a temp for the ith arg if it is used later than the PUTARG_STK.

On Unix this assumption about the ABI does not hold. The argument position does not tell us anything about where the argument will go on the stack; for instance, void foo(S32 a, int b) and void foo(int b, S32 a) will both have a in the first 4 stack slots and b in a register.

This PR fixes this in a simple way: just introduce temps for any stack arg with uses before tailcalls. The harder way is to figure out if they are read after they would be overwritten by a PUTARG_STK and only introduce a temp in that situation. This is what the previous Windows implementation did based on the argument position.

Generally this might introduce copies in cases where there previously would have been none. However since we now allow tailcalls in more situations, this still turns out to be a net win space wise on unix64 (following is PMI over frameworks):

Found 38 files with textual diffs.

Summary:
(Lower is better)

Total bytes of diff: -296 (-0.001% of base)
    diff is an improvement.

Top file regressions by size (bytes):
        1547 : System.Linq.Parallel.dasm (0.089% of base)
         335 : System.Data.Common.dasm (0.022% of base)
         262 : Microsoft.CodeAnalysis.dasm (0.014% of base)
          44 : System.Numerics.Vectors.dasm (0.089% of base)
          42 : System.Threading.Tasks.Parallel.dasm (0.025% of base)

Top file improvements by size (bytes):
       -1002 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.017% of base)
        -598 : Microsoft.CodeAnalysis.CSharp.dasm (-0.013% of base)
        -516 : System.Private.CoreLib.dasm (-0.011% of base)
        -113 : System.Collections.Immutable.dasm (-0.010% of base)
         -49 : System.Reflection.Metadata.dasm (-0.011% of base)

35 total files with size differences (21 improved, 14 regressed), 94 unchanged.

Top method regressions by size (bytes):
         490 (3.163% of base) : System.Linq.Parallel.dasm - FirstQueryOperator`1:WrapPartitionedStream(ref,ref,bool,struct):this (49 methods)
         490 (3.163% of base) : System.Linq.Parallel.dasm - LastQueryOperator`1:WrapPartitionedStream(ref,ref,bool,struct):this (49 methods)
         490 (3.163% of base) : System.Linq.Parallel.dasm - TakeOrSkipQueryOperator`1:WrapPartitionedStream(ref,ref,bool,struct):this (49 methods)
         490 (3.163% of base) : System.Linq.Parallel.dasm - TakeOrSkipWhileQueryOperator`1:WrapPartitionedStream(ref,ref,bool,struct):this (49 methods)
         100 (0.345% of base) : System.Linq.Parallel.dasm - ConcatQueryOperator`1:WrapPartitionedStream(ref,ref,ref,bool,struct):this (49 methods)

Top method improvements by size (bytes):
        -442 (-1.724% of base) : System.Linq.Parallel.dasm - ILStubClass:IL_STUB_InstantiatingStub(ref,ref,bool,struct):this (221 methods)
        -205 (-50.000% of base) : System.Private.CoreLib.dasm - Marshal:AllocCoTaskMem(int):long (2 base, 1 diff methods)
        -144 (-50.000% of base) : System.Private.CoreLib.dasm - AnsiCharMarshaler:DoAnsiConversion(ref,bool,bool,byref):ref (2 base, 1 diff methods)
        -120 (-17.595% of base) : System.Collections.Immutable.dasm - ILStubClass:IL_STUB_InstantiatingStub(ref,struct):struct (6 methods)
        -106 (-50.000% of base) : System.Private.CoreLib.dasm - Buffer:Memcpy(long,int,ref,int,int) (2 base, 1 diff methods)

Top method regressions by size (percentage):
          55 (98.214% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PEParameterSymbol:Create(ref,ref,int,struct,byref):ref
          35 (36.082% of base) : Microsoft.CodeAnalysis.dasm - VtblGap:Microsoft.Cci.ISignature.GetType(struct):ref:this
          35 (36.082% of base) : Microsoft.CodeAnalysis.dasm - ArrayMethodParameterInfo:GetType(struct):ref:this
          35 (36.082% of base) : Microsoft.CodeAnalysis.dasm - ArrayConstructor:GetType(struct):ref:this
          35 (36.082% of base) : Microsoft.CodeAnalysis.dasm - ArraySet:GetType(struct):ref:this

Top method improvements by size (percentage):
        -106 (-50.000% of base) : System.Private.CoreLib.dasm - Buffer:Memcpy(long,int,ref,int,int) (2 base, 1 diff methods)
        -144 (-50.000% of base) : System.Private.CoreLib.dasm - AnsiCharMarshaler:DoAnsiConversion(ref,bool,bool,byref):ref (2 base, 1 diff methods)
        -205 (-50.000% of base) : System.Private.CoreLib.dasm - Marshal:AllocCoTaskMem(int):long (2 base, 1 diff methods)
         -21 (-35.593% of base) : Microsoft.DotNet.Cli.Utils.dasm - ILStubClass:IL_STUB_InstantiatingStub(struct):struct:this
         -23 (-27.711% of base) : System.Threading.Tasks.Dataflow.dasm - WriteOnceBlock`1:CloneItem(struct):struct:this

917 total methods with size differences (543 improved, 374 regressed), 200968 unchanged.

How this affects perf is harder to measure, however. Instead I will continue to see if I can improve this to only introduce necessary temps by computing the correct stack offsets of arguments and then using this information in the same way that the code used the argument positions before.

cc @jashook @AndyAyersMS @RussKeldorph

Set the stack offset during init of args so that we can use this info to determine whether it is necessary to introduce temps for fast tailcalls. Optimize the logic to use these to determine if PUTARG_STK nodes will override incoming args.

jakobbotsch · 2019-07-30T22:42:24Z

/azp run coreclr-outerloop

azure-pipelines · 2019-07-30T22:42:34Z

Azure Pipelines successfully started running 1 pipeline(s).

NinoFloris · 2019-07-31T11:28:57Z

@jakobbotsch could you easily add the F# compiler? It makes heavy use of explicit tail calls.

/cc: @cartermp

jakobbotsch · 2019-07-31T15:02:58Z

@NinoFloris there is some data collected over F# tests in dotnet/jitutils#213 (comment). Once this PR is in a working state I will try to collect this data for F# tests again. What would really help would be perf tests making heavy use of tail calls – does F# have any of those?

cartermp · 2019-07-31T15:42:02Z

@jakobbotsch We'll work to get a good repro.

adamsitnik · 2019-07-31T16:36:15Z

@cartermp @NinoFloris it would be great if you could add a new F# project with BenchmarkDotNet microbenchmarks to the performance repo. We are using it to continuously track the performance of the entire .NET Runtime for all hardware architectures and most commonly used OSes.

/cc @billwert

On Unix64 it is possible for us to get into cases where we need to move an argument but where we previously did not introduce a temp. This is a problem because codegen cannot handle overlapping disjoint struct copies. The case observed was (with all args on the stack): caller(S32 erStack1) -> callee(S16 eeStack1, S32 eeStack2) caller passing erStack1 as eeStack2, so it is needed to be moved 16 bytes ahead in the arg list. Additionally, the PUTARG_STK node for this argument ends up as the _first_ arg. Since there are no additional uses no temp was introduced before. Fix this by detecting the disjoint overlapping case and looking for uses of the arg from the current PUTARG_STK node's operand.

On Windows we can use the arg positions to much more simply find the args being overwritten.

This gives us access to the number of slots in PUTARG_STK

jashook · 2019-08-14T05:33:55Z

/azp run coreclr-ci

azure-pipelines · 2019-08-14T05:34:20Z

Azure Pipelines successfully started running 1 pipeline(s).

jakobbotsch · 2019-08-15T23:09:25Z

The only diffs I see on Windows x64/x86 are situations where we now insert the temps a bit earlier:

 N021 ( 22, 17) [000034] DA-XG-----L-              *  STORE_LCL_VAR ref    V06 tmp1         d:1                                                                    +               [000053] ------------                 START_NONGC void
+N001 (  3,  2) [000054] ------------        t54 =    LCL_VAR   int    V04 arg4
+                                                  /--*  t54    int
+N002 (  4,  3) [000056] DA----------              *  STORE_LCL_VAR int    V07 rat0
 N024 (  1,  1) [000014] ------------        t14 =    CNS_INT   int    0 $40
-N001 (  3,  2) [000053] ------------        t53 =    LCL_VAR   int    V04 arg4
-                                                  /--*  t53    int
-N002 (  4,  3) [000055] DA----------              *  STORE_LCL_VAR int    V07 rat0
-               [000056] ------------                 START_NONGC void
                                                   /--*  t14    int
                [000047] ------------              *  PUTARG_STK [+0x20] void
 N029 (  1,  1) [000035] ------------        t35 =    LCL_VAR   ref    V06 tmp1         u:1 (last use) $143
                                                   /--*  t35    ref
                [000048] ------------        t48 = *  PUTARG_REG ref    REG rdx
 N030 (  1,  1) [000000] ------------         t0 =    LCL_VAR   ref    V00 this         u:1 (last use) $80
                                                   /--*  t0     ref
                [000049] ------------        t49 = *  PUTARG_REG ref    REG rcx
 N031 (  1,  1) [000012] ------------        t12 =    LCL_VAR   ref    V03 arg3         u:1 (last use) $82
                                                   /--*  t12    ref
                [000050] ------------        t50 = *  PUTARG_REG ref    REG r8
 N032 (  3,  2) [000013] ------------        t13 =    LCL_VAR   int    V07 rat0          $100
                                                   /--*  t13    int
                [000051] ------------        t51 = *  PUTARG_REG int    REG r9
 N001 (  3, 10) [000052] ------------        t52 =    CNS_INT(h) long   0x7ffbdcbc07a0 ftn
 N001 (  3, 10) [000052] ------------        t52 =    CNS_INT(h) long   0x7ffbdcbb07a0 ftn
                                                   /--*  t48    ref    arg1 in rdx
                                                   +--*  t49    ref    this in rcx
                                                   +--*  t50    ref    arg2 in r8
                                                   +--*  t51    int    arg3 in r9
                                                   +--*  t52    long   control expr
 N037 ( 49, 33) [000015] --CXG-------              *  CALL      void   System.IO.StreamWriter..ctor $VN.Void

This leads to swapped order in some cases:

        mov      rdx, rax
-       xor      ecx, ecx
-       mov      r9d, ebx

 G_M5466_IG03:
+       mov      r9d, ebx
+       xor      ecx, ecx
        mov      dword ptr [rsp+60H], ecx
        mov      rcx, rdi
        mov      r8, rsi

 G_M5466_IG04:
        add      rsp, 32
        pop      rbx
        pop      rsi
        pop      rdi
        jmp      System.IO.StreamWriter:.ctor(ref,ref,int,bool):this

I will investigate the OSX failure next.

jakobbotsch · 2019-08-16T16:13:44Z

/azp run coreclr-ci

azure-pipelines · 2019-08-16T16:14:11Z

Azure Pipelines successfully started running 1 pipeline(s).

jakobbotsch · 2019-08-19T19:41:57Z

Subsumed by #26255 (I was not able to reopen this after force-pushing my branch)

Remove tailcall limitations on unix64 and arm64

Verified

This commit was signed with the committer’s verified signature.

oseoin

GPG key ID: E339AFC497D32569

Learn about vigilant mode

830679f

jakobbotsch force-pushed the tailcall-fix-limitations branch from 4622b4f to 830679f Compare July 30, 2019 04:40

jashook changed the title ~~Remove tailcall limitations on unix64 and arm64~~ [WIP] Remove tailcall limitations on unix64 and arm64 Jul 30, 2019

Optimize temp introduction logic

052b417

Set the stack offset during init of args so that we can use this info to determine whether it is necessary to introduce temps for fast tailcalls. Optimize the logic to use these to determine if PUTARG_STK nodes will override incoming args.

cartermp mentioned this pull request Jul 31, 2019

dotnet crash on macOS ionide/FsAutoComplete#356

Closed

jakobbotsch and others added 5 commits August 1, 2019 20:24

Fix LowerFastTailCall for Windows

5684995

On Windows we can use the arg positions to much more simply find the args being overwritten.

Add some logging

385fc95

Define FEATURE_PUT_STRUCT_ARG_STK on Arm64

2557b16

This gives us access to the number of slots in PUTARG_STK

Fix compiler errors in release

1ddea41

jashook mentioned this pull request Aug 14, 2019

Add ABI stress tests #26090

Merged

jakobbotsch closed this Aug 17, 2019

jakobbotsch reopened this Aug 17, 2019

jakobbotsch closed this Aug 19, 2019

jakobbotsch mentioned this pull request Aug 20, 2019

Remove tailcall limitations on unix64 and arm64 #26255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Remove tailcall limitations on unix64 and arm64 #25932

[WIP] Remove tailcall limitations on unix64 and arm64 #25932

jakobbotsch commented Jul 30, 2019

jakobbotsch commented Jul 30, 2019

jakobbotsch commented Jul 30, 2019

azure-pipelines bot commented Jul 30, 2019

NinoFloris commented Jul 31, 2019

jakobbotsch commented Jul 31, 2019

cartermp commented Jul 31, 2019

adamsitnik commented Jul 31, 2019

jashook commented Aug 14, 2019

azure-pipelines bot commented Aug 14, 2019

jakobbotsch commented Aug 15, 2019 •

edited

Loading

jakobbotsch commented Aug 16, 2019

azure-pipelines bot commented Aug 16, 2019

jakobbotsch commented Aug 19, 2019

[WIP] Remove tailcall limitations on unix64 and arm64 #25932

[WIP] Remove tailcall limitations on unix64 and arm64 #25932

Conversation

jakobbotsch commented Jul 30, 2019

jakobbotsch commented Jul 30, 2019

jakobbotsch commented Jul 30, 2019

azure-pipelines bot commented Jul 30, 2019

NinoFloris commented Jul 31, 2019

jakobbotsch commented Jul 31, 2019

cartermp commented Jul 31, 2019

adamsitnik commented Jul 31, 2019

jashook commented Aug 14, 2019

azure-pipelines bot commented Aug 14, 2019

jakobbotsch commented Aug 15, 2019 • edited Loading

jakobbotsch commented Aug 16, 2019

azure-pipelines bot commented Aug 16, 2019

jakobbotsch commented Aug 19, 2019

jakobbotsch commented Aug 15, 2019 •

edited

Loading