-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[WIP] Remove tailcall limitations on unix64 and arm64 #25932
Conversation
4622b4f
to
830679f
Compare
The basic problem is the following: On Unix this assumption about the ABI does not hold. The argument position does not tell us anything about where the argument will go on the stack; for instance, This PR fixes this in a simple way: just introduce temps for any stack arg with uses before tailcalls. The harder way is to figure out if they are read after they would be overwritten by a Generally this might introduce copies in cases where there previously would have been none. However since we now allow tailcalls in more situations, this still turns out to be a net win space wise on unix64 (following is PMI over frameworks):
How this affects perf is harder to measure, however. Instead I will continue to see if I can improve this to only introduce necessary temps by computing the correct stack offsets of arguments and then using this information in the same way that the code used the argument positions before. |
Set the stack offset during init of args so that we can use this info to determine whether it is necessary to introduce temps for fast tailcalls. Optimize the logic to use these to determine if PUTARG_STK nodes will override incoming args.
/azp run coreclr-outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
@jakobbotsch could you easily add the F# compiler? It makes heavy use of explicit tail calls. /cc: @cartermp |
@NinoFloris there is some data collected over F# tests in dotnet/jitutils#213 (comment). Once this PR is in a working state I will try to collect this data for F# tests again. What would really help would be perf tests making heavy use of tail calls – does F# have any of those? |
@jakobbotsch We'll work to get a good repro. |
@cartermp @NinoFloris it would be great if you could add a new F# project with BenchmarkDotNet microbenchmarks to the performance repo. We are using it to continuously track the performance of the entire .NET Runtime for all hardware architectures and most commonly used OSes. /cc @billwert |
On Unix64 it is possible for us to get into cases where we need to move an argument but where we previously did not introduce a temp. This is a problem because codegen cannot handle overlapping disjoint struct copies. The case observed was (with all args on the stack): caller(S32 erStack1) -> callee(S16 eeStack1, S32 eeStack2) caller passing erStack1 as eeStack2, so it is needed to be moved 16 bytes ahead in the arg list. Additionally, the PUTARG_STK node for this argument ends up as the _first_ arg. Since there are no additional uses no temp was introduced before. Fix this by detecting the disjoint overlapping case and looking for uses of the arg from the current PUTARG_STK node's operand.
On Windows we can use the arg positions to much more simply find the args being overwritten.
This gives us access to the number of slots in PUTARG_STK
/azp run coreclr-ci |
Azure Pipelines successfully started running 1 pipeline(s). |
The only diffs I see on Windows x64/x86 are situations where we now insert the temps a bit earlier: N021 ( 22, 17) [000034] DA-XG-----L- * STORE_LCL_VAR ref V06 tmp1 d:1 + [000053] ------------ START_NONGC void
+N001 ( 3, 2) [000054] ------------ t54 = LCL_VAR int V04 arg4
+ /--* t54 int
+N002 ( 4, 3) [000056] DA---------- * STORE_LCL_VAR int V07 rat0
N024 ( 1, 1) [000014] ------------ t14 = CNS_INT int 0 $40
-N001 ( 3, 2) [000053] ------------ t53 = LCL_VAR int V04 arg4
- /--* t53 int
-N002 ( 4, 3) [000055] DA---------- * STORE_LCL_VAR int V07 rat0
- [000056] ------------ START_NONGC void
/--* t14 int
[000047] ------------ * PUTARG_STK [+0x20] void
N029 ( 1, 1) [000035] ------------ t35 = LCL_VAR ref V06 tmp1 u:1 (last use) $143
/--* t35 ref
[000048] ------------ t48 = * PUTARG_REG ref REG rdx
N030 ( 1, 1) [000000] ------------ t0 = LCL_VAR ref V00 this u:1 (last use) $80
/--* t0 ref
[000049] ------------ t49 = * PUTARG_REG ref REG rcx
N031 ( 1, 1) [000012] ------------ t12 = LCL_VAR ref V03 arg3 u:1 (last use) $82
/--* t12 ref
[000050] ------------ t50 = * PUTARG_REG ref REG r8
N032 ( 3, 2) [000013] ------------ t13 = LCL_VAR int V07 rat0 $100
/--* t13 int
[000051] ------------ t51 = * PUTARG_REG int REG r9
N001 ( 3, 10) [000052] ------------ t52 = CNS_INT(h) long 0x7ffbdcbc07a0 ftn
N001 ( 3, 10) [000052] ------------ t52 = CNS_INT(h) long 0x7ffbdcbb07a0 ftn
/--* t48 ref arg1 in rdx
+--* t49 ref this in rcx
+--* t50 ref arg2 in r8
+--* t51 int arg3 in r9
+--* t52 long control expr
N037 ( 49, 33) [000015] --CXG------- * CALL void System.IO.StreamWriter..ctor $VN.Void This leads to swapped order in some cases: mov rdx, rax
- xor ecx, ecx
- mov r9d, ebx
G_M5466_IG03:
+ mov r9d, ebx
+ xor ecx, ecx
mov dword ptr [rsp+60H], ecx
mov rcx, rdi
mov r8, rsi
G_M5466_IG04:
add rsp, 32
pop rbx
pop rsi
pop rdi
jmp System.IO.StreamWriter:.ctor(ref,ref,int,bool):this I will investigate the OSX failure next. |
/azp run coreclr-ci |
Azure Pipelines successfully started running 1 pipeline(s). |
Subsumed by #26255 (I was not able to reopen this after force-pushing my branch) |
No description provided.