Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] Remove tailcall limitations on unix64 and arm64 #25932

Closed

Conversation

jakobbotsch
Copy link
Member

No description provided.

Verified

This commit was signed with the committer’s verified signature.
@jakobbotsch jakobbotsch force-pushed the tailcall-fix-limitations branch from 4622b4f to 830679f Compare July 30, 2019 04:40
@jashook jashook changed the title Remove tailcall limitations on unix64 and arm64 [WIP] Remove tailcall limitations on unix64 and arm64 Jul 30, 2019
@jakobbotsch
Copy link
Member Author

The basic problem is the following: PUTARG_STK will write values into the arg space area which, for fast tail calls, is the area of the incoming args. A PUTARG_STK may therefore end up overwrite (parts of) an argument that is used later in the call sequence. On Windows the ABI allows us to know that, if we are doing a PUTARG_STK for argument at index i, then this will overwrite the ith argument to the caller. Thus it is easy to introduce a temp for the ith arg if it is used later than the PUTARG_STK.

On Unix this assumption about the ABI does not hold. The argument position does not tell us anything about where the argument will go on the stack; for instance, void foo(S32 a, int b) and void foo(int b, S32 a) will both have a in the first 4 stack slots and b in a register.

This PR fixes this in a simple way: just introduce temps for any stack arg with uses before tailcalls. The harder way is to figure out if they are read after they would be overwritten by a PUTARG_STK and only introduce a temp in that situation. This is what the previous Windows implementation did based on the argument position.

Generally this might introduce copies in cases where there previously would have been none. However since we now allow tailcalls in more situations, this still turns out to be a net win space wise on unix64 (following is PMI over frameworks):

Found 38 files with textual diffs.

Summary:
(Lower is better)

Total bytes of diff: -296 (-0.001% of base)
    diff is an improvement.

Top file regressions by size (bytes):
        1547 : System.Linq.Parallel.dasm (0.089% of base)
         335 : System.Data.Common.dasm (0.022% of base)
         262 : Microsoft.CodeAnalysis.dasm (0.014% of base)
          44 : System.Numerics.Vectors.dasm (0.089% of base)
          42 : System.Threading.Tasks.Parallel.dasm (0.025% of base)

Top file improvements by size (bytes):
       -1002 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.017% of base)
        -598 : Microsoft.CodeAnalysis.CSharp.dasm (-0.013% of base)
        -516 : System.Private.CoreLib.dasm (-0.011% of base)
        -113 : System.Collections.Immutable.dasm (-0.010% of base)
         -49 : System.Reflection.Metadata.dasm (-0.011% of base)

35 total files with size differences (21 improved, 14 regressed), 94 unchanged.

Top method regressions by size (bytes):
         490 (3.163% of base) : System.Linq.Parallel.dasm - FirstQueryOperator`1:WrapPartitionedStream(ref,ref,bool,struct):this (49 methods)
         490 (3.163% of base) : System.Linq.Parallel.dasm - LastQueryOperator`1:WrapPartitionedStream(ref,ref,bool,struct):this (49 methods)
         490 (3.163% of base) : System.Linq.Parallel.dasm - TakeOrSkipQueryOperator`1:WrapPartitionedStream(ref,ref,bool,struct):this (49 methods)
         490 (3.163% of base) : System.Linq.Parallel.dasm - TakeOrSkipWhileQueryOperator`1:WrapPartitionedStream(ref,ref,bool,struct):this (49 methods)
         100 (0.345% of base) : System.Linq.Parallel.dasm - ConcatQueryOperator`1:WrapPartitionedStream(ref,ref,ref,bool,struct):this (49 methods)

Top method improvements by size (bytes):
        -442 (-1.724% of base) : System.Linq.Parallel.dasm - ILStubClass:IL_STUB_InstantiatingStub(ref,ref,bool,struct):this (221 methods)
        -205 (-50.000% of base) : System.Private.CoreLib.dasm - Marshal:AllocCoTaskMem(int):long (2 base, 1 diff methods)
        -144 (-50.000% of base) : System.Private.CoreLib.dasm - AnsiCharMarshaler:DoAnsiConversion(ref,bool,bool,byref):ref (2 base, 1 diff methods)
        -120 (-17.595% of base) : System.Collections.Immutable.dasm - ILStubClass:IL_STUB_InstantiatingStub(ref,struct):struct (6 methods)
        -106 (-50.000% of base) : System.Private.CoreLib.dasm - Buffer:Memcpy(long,int,ref,int,int) (2 base, 1 diff methods)

Top method regressions by size (percentage):
          55 (98.214% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PEParameterSymbol:Create(ref,ref,int,struct,byref):ref
          35 (36.082% of base) : Microsoft.CodeAnalysis.dasm - VtblGap:Microsoft.Cci.ISignature.GetType(struct):ref:this
          35 (36.082% of base) : Microsoft.CodeAnalysis.dasm - ArrayMethodParameterInfo:GetType(struct):ref:this
          35 (36.082% of base) : Microsoft.CodeAnalysis.dasm - ArrayConstructor:GetType(struct):ref:this
          35 (36.082% of base) : Microsoft.CodeAnalysis.dasm - ArraySet:GetType(struct):ref:this

Top method improvements by size (percentage):
        -106 (-50.000% of base) : System.Private.CoreLib.dasm - Buffer:Memcpy(long,int,ref,int,int) (2 base, 1 diff methods)
        -144 (-50.000% of base) : System.Private.CoreLib.dasm - AnsiCharMarshaler:DoAnsiConversion(ref,bool,bool,byref):ref (2 base, 1 diff methods)
        -205 (-50.000% of base) : System.Private.CoreLib.dasm - Marshal:AllocCoTaskMem(int):long (2 base, 1 diff methods)
         -21 (-35.593% of base) : Microsoft.DotNet.Cli.Utils.dasm - ILStubClass:IL_STUB_InstantiatingStub(struct):struct:this
         -23 (-27.711% of base) : System.Threading.Tasks.Dataflow.dasm - WriteOnceBlock`1:CloneItem(struct):struct:this

917 total methods with size differences (543 improved, 374 regressed), 200968 unchanged.

How this affects perf is harder to measure, however. Instead I will continue to see if I can improve this to only introduce necessary temps by computing the correct stack offsets of arguments and then using this information in the same way that the code used the argument positions before.

cc @jashook @AndyAyersMS @RussKeldorph

Set the stack offset during init of args so that we can use this info
to determine whether it is necessary to introduce temps for fast
tailcalls. Optimize the logic to use these to determine if PUTARG_STK
nodes will override incoming args.
@jakobbotsch
Copy link
Member Author

/azp run coreclr-outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@NinoFloris
Copy link

@jakobbotsch could you easily add the F# compiler? It makes heavy use of explicit tail calls.

/cc: @cartermp

@jakobbotsch
Copy link
Member Author

@NinoFloris there is some data collected over F# tests in dotnet/jitutils#213 (comment). Once this PR is in a working state I will try to collect this data for F# tests again. What would really help would be perf tests making heavy use of tail calls – does F# have any of those?

@cartermp
Copy link

@jakobbotsch We'll work to get a good repro.

@adamsitnik
Copy link
Member

@cartermp @NinoFloris it would be great if you could add a new F# project with BenchmarkDotNet microbenchmarks to the performance repo. We are using it to continuously track the performance of the entire .NET Runtime for all hardware architectures and most commonly used OSes.

/cc @billwert

jakobbotsch and others added 5 commits August 1, 2019 20:24
On Unix64 it is possible for us to get into cases where we need to move
an argument but where we previously did not introduce a temp. This is a
problem because codegen cannot handle overlapping disjoint struct
copies. The case observed was (with all args on the stack):
caller(S32 erStack1) -> callee(S16 eeStack1, S32 eeStack2)
caller passing erStack1 as eeStack2, so it is needed to be moved 16
bytes ahead in the arg list. Additionally, the PUTARG_STK node for this
argument ends up as the _first_ arg. Since there are no additional uses
no temp was introduced before.

Fix this by detecting the disjoint overlapping case and looking for uses
of the arg from the current PUTARG_STK node's operand.
On Windows we can use the arg positions to much more simply find the
args being overwritten.
This gives us access to the number of slots in PUTARG_STK
@jashook jashook mentioned this pull request Aug 14, 2019
@jashook
Copy link

jashook commented Aug 14, 2019

/azp run coreclr-ci

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Aug 15, 2019

The only diffs I see on Windows x64/x86 are situations where we now insert the temps a bit earlier:

 N021 ( 22, 17) [000034] DA-XG-----L-              *  STORE_LCL_VAR ref    V06 tmp1         d:1                                                                    +               [000053] ------------                 START_NONGC void
+N001 (  3,  2) [000054] ------------        t54 =    LCL_VAR   int    V04 arg4
+                                                  /--*  t54    int
+N002 (  4,  3) [000056] DA----------              *  STORE_LCL_VAR int    V07 rat0
 N024 (  1,  1) [000014] ------------        t14 =    CNS_INT   int    0 $40
-N001 (  3,  2) [000053] ------------        t53 =    LCL_VAR   int    V04 arg4
-                                                  /--*  t53    int
-N002 (  4,  3) [000055] DA----------              *  STORE_LCL_VAR int    V07 rat0
-               [000056] ------------                 START_NONGC void
                                                   /--*  t14    int
                [000047] ------------              *  PUTARG_STK [+0x20] void
 N029 (  1,  1) [000035] ------------        t35 =    LCL_VAR   ref    V06 tmp1         u:1 (last use) $143
                                                   /--*  t35    ref
                [000048] ------------        t48 = *  PUTARG_REG ref    REG rdx
 N030 (  1,  1) [000000] ------------         t0 =    LCL_VAR   ref    V00 this         u:1 (last use) $80
                                                   /--*  t0     ref
                [000049] ------------        t49 = *  PUTARG_REG ref    REG rcx
 N031 (  1,  1) [000012] ------------        t12 =    LCL_VAR   ref    V03 arg3         u:1 (last use) $82
                                                   /--*  t12    ref
                [000050] ------------        t50 = *  PUTARG_REG ref    REG r8
 N032 (  3,  2) [000013] ------------        t13 =    LCL_VAR   int    V07 rat0          $100
                                                   /--*  t13    int
                [000051] ------------        t51 = *  PUTARG_REG int    REG r9
 N001 (  3, 10) [000052] ------------        t52 =    CNS_INT(h) long   0x7ffbdcbc07a0 ftn
 N001 (  3, 10) [000052] ------------        t52 =    CNS_INT(h) long   0x7ffbdcbb07a0 ftn
                                                   /--*  t48    ref    arg1 in rdx
                                                   +--*  t49    ref    this in rcx
                                                   +--*  t50    ref    arg2 in r8
                                                   +--*  t51    int    arg3 in r9
                                                   +--*  t52    long   control expr
 N037 ( 49, 33) [000015] --CXG-------              *  CALL      void   System.IO.StreamWriter..ctor $VN.Void

This leads to swapped order in some cases:

        mov      rdx, rax
-       xor      ecx, ecx
-       mov      r9d, ebx

 G_M5466_IG03:
+       mov      r9d, ebx
+       xor      ecx, ecx
        mov      dword ptr [rsp+60H], ecx
        mov      rcx, rdi
        mov      r8, rsi

 G_M5466_IG04:
        add      rsp, 32
        pop      rbx
        pop      rsi
        pop      rdi
        jmp      System.IO.StreamWriter:.ctor(ref,ref,int,bool):this

I will investigate the OSX failure next.

@jakobbotsch
Copy link
Member Author

/azp run coreclr-ci

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch
Copy link
Member Author

Subsumed by #26255 (I was not able to reopen this after force-pushing my branch)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants