-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recognize more "local addr" trees as invariant #68484
Conversation
This will ensure we do not evaluate e.g. the retbuffer into a temp when unnecessary.
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThis will ensure we do not evaluate e.g. the retbuffer into a temp when I first used
|
Hmm, this does not really make sense now that I think about it. Will need to look again. |
I switched to |
@@ -22865,7 +22865,7 @@ bool GenTreeLclFld::IsOffsetMisaligned() const | |||
|
|||
bool GenTree::IsInvariant() const | |||
{ | |||
return OperIsConst() || Compiler::impIsAddressInLocal(this); | |||
return OperIsConst() || IsLocalAddrExpr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's notable that combined with #68469, this will bind IsLocalAddrExpr
"the simple version" to LocalAddressVisitor
like DefinesLocalAddr
and IsLocalAddrExpr
"the not simple version" are bound right now (which may also not be a new thing, it may have been like this already due to struct promotion requirements).
BTW, this is also the reason we must recognize ADD
s in all these functions, they're produced for structs with fields at offsets larger than GenTreeLclFld
(and the emitter) can represent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even without #68469 there are/were non-trivial invariants between the two. While before this PR we 'supported' call morphing introducing a local store for the address, there was already an implicit assumption that no extra uses of this temp were ever created. So in a way we are just trading one assumption for another and have a slightly more explicit assert about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unsatisfying, but I am hard pressed to think about how to improve the situation.
Probably it would be ideal if we could unify DefinesLocalAddr
, the two variants of IsLocalAddrExpr
and what LocalAddressVisitor
is doing somehow, but that also seems difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, this is what #11057 is about :).
The original "north star" design was that we'd only have STORE_LCL_VAR|FLD
for tracked locals, this return buffer optimization makes that "hard" unfortunately, but even with it (assuming I won't find a way to represent it as a store after all), we'd have:
DefinesLocalAddr
becomes equivalent toOperIsLocalAddr
.- All its variations (
IsLocalAddrExpr
"simple" and "not simple", as well asfgIsIndirOfAddrOfLocal
) are deleted. LocalAddressVisitor
transforms allASG
s intoSTORE_LCL_VAR|FLD
, and for the return buffer optimization,LCL_VAR|FLD_ADDR
.- All indirect uses become
LCL_VAR|FLD
(this is Remove theOBJ(ADDR)
wrapping created for calls by the importer #51569)
As a point-in-time improvement, the "not simple" IsLocalAddrExpr
will be deleted once/if physical VN changes are in.
cc @dotnet/jit-contrib |
The diffs are largely from better staging of arguments to calls? Can you share an example or two? |
Yes, previously we would end up creating temps for a lot of common address trees when a later argument required spilling previous arguments. In particular we did not recognize diff --git "a/C:\\dev\\temp\\foobar\\asm.libraries.pmi.windows.x64.checked\\base\\15134.dasm" "b/C:\\dev\\temp\\foobar\\asm.libraries.pmi.windows.x64.checked\\diff\\15134.dasm"
index fb40ace..a78f69d 100644
--- "a/C:\\dev\\temp\\foobar\\asm.libraries.pmi.windows.x64.checked\\base\\15134.dasm"
+++ "b/C:\\dev\\temp\\foobar\\asm.libraries.pmi.windows.x64.checked\\diff\\15134.dasm"
@@ -4,49 +4,46 @@
; rbp based frame
; partially interruptible
; No matching PGO data
; Final local variable assignments
;
; V00 OutArgs [V00 ] ( 1, 1 ) lclBlk (32) [rsp+00H] do-not-enreg[] "OutgoingArgSpace"
; V01 tmp1 [V01 ] ( 1, 1 ) struct (72) [rbp-58H] do-not-enreg[XS] must-init addr-exposed "NewObj constructor temp"
; V02 tmp2 [V02 ] ( 1, 1 ) struct (24) [rbp-70H] do-not-enreg[XS] must-init addr-exposed "by-value struct argument"
; V03 tmp3 [V03 ] ( 1, 1 ) struct (24) [rbp-88H] do-not-enreg[XS] must-init addr-exposed "by-value struct argument"
; V04 tmp4 [V04 ] ( 1, 1 ) struct (24) [rbp-A0H] do-not-enreg[XS] must-init addr-exposed "by-value struct argument"
-; V05 tmp5 [V05 ] ( 1, 1 ) long -> [rbp-A8H] do-not-enreg[] "argument with side effect"
;
-; Lcl frame size = 192
+; Lcl frame size = 176
G_M9930_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
push rbp
push rdi
push rsi
- sub rsp, 192
+ sub rsp, 176
vzeroupper
- lea rbp, [rsp+D0H]
+ lea rbp, [rsp+C0H]
vxorps xmm4, xmm4
mov rax, -144
vmovdqa xmmword ptr [rbp+rax-10H], xmm4
vmovdqa xmmword ptr [rax+rbp], xmm4
vmovdqa xmmword ptr [rbp+rax+10H], xmm4
add rax, 48
jne SHORT -5 instr
;; size=62 bbWeight=1 PerfScore 12.58
G_M9930_IG02: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
- lea rcx, [rbp-58H]
- mov qword ptr [rbp-A8H], rcx
mov rcx, 0xD1FFAB1E
mov edx, 232
call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
mov rcx, 0xD1FFAB1E
mov rcx, gword ptr [rcx]
; gcrRegs +[rcx]
- ;; size=44 bbWeight=1 PerfScore 5.25
+ ;; size=33 bbWeight=1 PerfScore 3.75
G_M9930_IG03: ; , nogc, extend
vmovdqu xmm0, xmmword ptr [rcx+8]
vmovdqu xmmword ptr [rbp-70H], xmm0
mov rdx, qword ptr [rcx+24]
mov qword ptr [rbp-60H], rdx
;; size=20 bbWeight=1 PerfScore 8.00
G_M9930_IG04: ; , extend
mov rcx, 0xD1FFAB1E
; gcrRegs -[rcx]
mov edx, 232
@@ -77,58 +74,58 @@ G_M9930_IG07: ; , nogc, extend
mov qword ptr [rbp-90H], r8
;; size=26 bbWeight=1 PerfScore 8.00
G_M9930_IG08: ; , extend
lea rdx, bword ptr [rbp-70H]
; gcrRegs -[rdx]
; byrRegs +[rdx]
lea r8, bword ptr [rbp-88H]
; byrRegs +[r8]
lea r9, bword ptr [rbp-A0H]
; byrRegs +[r9]
- mov rcx, qword ptr [rbp-A8H]
+ lea rcx, [rbp-58H]
call [hackishModuleName:hackishMethodName(Microsoft.CodeAnalysis.SymbolInfo,Microsoft.CodeAnalysis.SymbolInfo,Microsoft.CodeAnalysis.SymbolInfo):this]
; byrRegs -[rdx r8-r9]
mov rax, 0xD1FFAB1E
mov rdi, gword ptr [rax]
; gcrRegs +[rdi]
add rdi, 8
; gcrRegs -[rdi]
; byrRegs +[rdi]
lea rsi, bword ptr [rbp-58H]
; byrRegs +[rsi]
call CORINFO_HELP_ASSIGN_BYREF
movsq
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
movsq
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
movsq
call CORINFO_HELP_ASSIGN_BYREF
nop
- ;; size=89 bbWeight=1 PerfScore 17.75
+ ;; size=86 bbWeight=1 PerfScore 17.25
G_M9930_IG09: ; , epilog, nogc, extend
- add rsp, 192
+ add rsp, 176
pop rsi
pop rdi
pop rbp
ret
or e.g. vxorps xmm0, xmm0
vmovdqu xmmword ptr [rsp+100H], xmm0
mov dword ptr [rsp+110H], r8d
- lea r8, [rsp+100H]
- mov dword ptr [rsp+48H], ecx
- mov dword ptr [rsp+4CH], edx
+ mov dword ptr [rsp+48H], edx
+ mov dword ptr [rsp+4CH], ecx
mov qword ptr [rsp+50H], rax
- mov rcx, r8
lea rdx, bword ptr [rsp+48H]
; byrRegs +[rdx]
+ lea rcx, [rsp+100H]
call [hackishModuleName:hackishMethodName(System.Decimal):this] The regressions seem to be worse register allocation due to the movement of those trees: ; V16 tmp10 [V16 ] ( 4, 8 ) struct (16) [rsp+28H] do-not-enreg[XSF] must-init addr-exposed "by-value struct argument"
-;* V17 tmp11 [V17,T08] ( 0, 0 ) ref -> zero-ref single-def "argument with side effect"
-; V18 tmp12 [V18,T04] ( 2, 4 ) long -> rdx "argument with side effect"
+;* V17 tmp11 [V17,T07] ( 0, 0 ) ref -> zero-ref single-def "argument with side effect"
;
; Lcl frame size = 72
@@ -56,19 +55,19 @@ G_M42744_IG02:
;; size=23 bbWeight=1 PerfScore 5.25
G_M42744_IG03:
mov ecx, ebx
- mov edx, ebp
- add rcx, rdx
- mov edx, dword ptr [rsi+8]
- cmp rcx, rdx
+ mov r8d, ebp
+ add rcx, r8
+ mov r8d, dword ptr [rsi+8]
+ cmp rcx, r8
ja SHORT G_M42744_IG07
- ;; size=15 bbWeight=0.50 PerfScore 2.00
+ ;; size=17 bbWeight=0.50 PerfScore 2.00
G_M42744_IG04:
- lea rdx, [rsp+38H]
mov gword ptr [rsp+28H], rsi
mov dword ptr [rsp+30H], ebx
mov dword ptr [rsp+34H], ebp
mov rcx, rdi
lea r8, bword ptr [rsp+28H]
+ lea rdx, [rsp+38H]
mov r9, gword ptr [rsp+90H]
call [System.Security.Cryptography.CryptoStream:ReadAsyncInternal(System.Memory`1[Byte],System.Threading.CancellationToken):System.Threading.Tasks.ValueTask`1[Int32]:this]
lea rcx, [rsp+38H]
@@ -97,11 +96,11 @@ G_M42744_IG07:
int3
;; size=7 bbWeight=0 PerfScore 0.00
-; Total bytes of code 162, prolog size 30, PerfScore 38.90, instruction count 53, allocated bytes for code 164 (MethodHash=8dce5907) for method System.Security.Cryptography.CryptoStream:ReadAsync(System.Byte[],int,int,System.Threading.CancellationToken):System.Threading.Tasks.Task`1[Int32]:this
+; Total bytes of code 164, prolog size 30, PerfScore 39.10, instruction count 53, allocated bytes for code 166 (MethodHash=8dce5907) for method System.Security.Cryptography.CryptoStream:ReadAsync(System.Byte[],int,int,System.Threading.CancellationToken):System.Threading.Tasks.Task`1[Int32]:this In some of the bigger regressions we end up with new spills due to it. I think the aspnet improvements are so much bigger because we have tier-0 contexts there and avoiding the creation of these temps is much more impactful when the temps are not tracked. |
Here's another regression where we now need to zero a smaller area of stack and we end up with a bunch of extra instructions due to it: ; Assembly listing for method RuntimeCustomAttributeData:get_ConstructorArguments():IList`1:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
; V00 this [V00 ] ( 1, 1 ) ref -> [rbp+10H] do-not-enreg[] this class-hnd
; V01 loc0 [V01 ] ( 1, 1 ) ref -> [rbp-40H] do-not-enreg[] must-init class-hnd exact
; V02 loc1 [V02 ] ( 1, 1 ) int -> [rbp-44H] do-not-enreg[]
; V03 loc2 [V03 ] ( 1, 1 ) struct (48) [rbp-78H] do-not-enreg[XS] must-init addr-exposed
; V04 OutArgs [V04 ] ( 1, 1 ) lclBlk (32) [rsp+00H] do-not-enreg[] "OutgoingArgSpace"
; V05 tmp1 [V05 ] ( 1, 1 ) struct (16) [rbp-88H] do-not-enreg[XS] must-init addr-exposed "NewObj constructor temp"
; V06 tmp2 [V06 ] ( 1, 1 ) int -> [rbp-90H] do-not-enreg[X] addr-exposed "patchpoint counter"
; V07 tmp3 [V07 ] ( 1, 1 ) struct (48) [rbp-C0H] do-not-enreg[XS] must-init addr-exposed "by-value struct argument"
-; V08 tmp4 [V08 ] ( 1, 1 ) long -> [rbp-C8H] do-not-enreg[] "argument with side effect"
-; V09 tmp5 [V09 ] ( 1, 1 ) ref -> [rbp-D0H] do-not-enreg[] must-init "argument with side effect"
+; V08 tmp4 [V08 ] ( 1, 1 ) ref -> [rbp-C8H] do-not-enreg[] must-init "argument with side effect"
;
; Lcl frame size = 224
G_M33983_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
push rbp
push rdi
push rsi
sub rsp, 224
vzeroupper
lea rbp, [rsp+F0H]
+ xor eax, eax
+ mov qword ptr [rbp-C8H], rax
vxorps xmm4, xmm4
- mov rax, -144
+ vmovdqa xmmword ptr [rbp-C0H], xmm4
+ vmovdqa xmmword ptr [rbp-B0H], xmm4
+ mov rax, -96
vmovdqa xmmword ptr [rbp+rax-40H], xmm4
vmovdqa xmmword ptr [rbp+rax-30H], xmm4
vmovdqa xmmword ptr [rbp+rax-20H], xmm4
add rax, 48
jne SHORT -5 instr
mov qword ptr [rbp-40H], rax
mov gword ptr [rbp+10H], rcx Might be worth it to zero more than strictly necessary to avoid handling alignment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I'm not all that worried about the regressions, any time there different/fewer temps there are ripple effects.
This will ensure we do not evaluate e.g. the retbuffer into a temp when
unnecessary.
I first usedIsLocalAddrExpr
here but it did not bring any extra diffs and it was much more expensive.