-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate duplicate zero initializations more aggressively. #31960
Conversation
Framework diffs:
|
Sample diff from Microsoft.CodeAnalysis.dll: ; Assembly listing for method ReversedEnumeratorImpl:.ctor(byref):this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 this [V00,T00] ( 3, 3 ) ref -> rdi this class-hnd
; V01 arg1 [V01,T01] ( 3, 3 ) byref -> rdx
; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
-; V03 tmp1 [V03 ] ( 3, 6 ) struct (56) [rsp+0x20] do-not-enreg[XSB] must-init addr-exposed "NewObj constructor temp"
+; V03 tmp1 [V03 ] ( 2, 4 ) struct (56) [rsp+0x20] do-not-enreg[XSB] must-init addr-exposed "NewObj constructor temp"
;
; Lcl frame size = 88
G_M63972_IG01:
push rdi
push rsi
sub rsp, 88
- vzeroupper
mov rsi, rcx
lea rdi, [rsp+20H]
mov ecx, 14
xor rax, rax
rep stosd
mov rcx, rsi
mov rdi, rcx
- ;; bbWeight=1 PerfScore 30.00
+ ;; bbWeight=1 PerfScore 29.00
G_M63972_IG02:
- xor ecx, ecx
- vxorps xmm0, xmm0
- vmovdqu xmmword ptr [rsp+20H], xmm0
- vmovdqu xmmword ptr [rsp+30H], xmm0
- vmovdqu xmmword ptr [rsp+40H], xmm0
- mov qword ptr [rsp+50H], rcx
lea rcx, bword ptr [rsp+20H]
call Enumerator:.ctor(byref):this
add rdi, 8
lea rsi, bword ptr [rsp+20H]
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
movsq
movsq
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
movsq
- ;; bbWeight=1 PerfScore 13.83
+ ;; bbWeight=1 PerfScore 9.25
G_M63972_IG03:
add rsp, 88
pop rsi
pop rdi
ret
;; bbWeight=1 PerfScore 2.25
-; Total bytes of code 113, prolog size 29, PerfScore 57.78, (MethodHash=e622061c) for method ReversedEnumeratorImpl:.ctor(byref):this
+; Total bytes of code 81, prolog size 26, PerfScore 48.60, (MethodHash=e622061c) for method ReversedEnumeratorImpl:.ctor(byref):this
|
@dotnet/jit-contrib @AndyAyersMS PTAL |
Benchmark diffs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement!
|
||
newStmt = gtNewStmt(tree, callILOffset); | ||
fgInsertStmtAfter(block, afterStmt, newStmt); | ||
afterStmt = newStmt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add jitdump here for the case where the local is used by we decide we don't need to zero init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done.
src/coreclr/src/jit/flowgraph.cpp
Outdated
@@ -23579,10 +23579,14 @@ Statement* Compiler::fgInlinePrependStatements(InlineInfo* inlineInfo) | |||
|
|||
CORINFO_METHOD_INFO* InlineeMethodInfo = InlineeCompiler->info.compMethodInfo; | |||
|
|||
unsigned lclCnt = InlineeMethodInfo->locals.numArgs; | |||
unsigned lclCnt = InlineeMethodInfo->locals.numArgs; | |||
bool bbInALoop = (block->bbFlags & BBF_BACKWARD_JUMP) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set BBF_BACKWARD_JUMP
very early, just based on lexical IL. So it may encompass non-loop blocks (eg loop with a branch that leads to a return
).
We may also optimize away backwards branches (probably rare, but not impossible).
I wonder how often we end up with extra zeroing from BBF_BACKWARD_JUMP
being approximate? Any Idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't know. I guess I can try to allow this optimization for BBJ_RETURN
basic blocks in root methods since we can execute those blocks only once even in a loop. That won't account for all cases you are asking about but I don't see an easy way to capture those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To estimate how often this happens, you could perhaps set a new BBF in blocks where we add zero inits. Then later when we run some more robust form of loop detection, see how many non-loop blocks have this BBF.
0a80c31
to
b9a2f2b
Compare
src/coreclr/src/jit/flowgraph.cpp
Outdated
if (tmpNum != BAD_VAR_NUM) | ||
{ | ||
if (!fgVarNeedsExplicitZeroInit(&lvaTable[tmpNum], bbInALoop)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please use lvaGetDesc(varNum)
instead of &lvaTable[tmpNum]
here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
We have 3 places in the code where we may need to insert explicit zero initialization. In some cases the initialization may become redundant with prolog initialization so we have logic to avoid explicit initialization in some cases. This change makes the logic less conservative. 1. If the variable is not being initialized in a loop, we can avoid explicit zero initialization if - the variable is a gc pointer, or - the variable is a struct with gc pointer fields and either all fields are gc pointer fields or the struct is big enough to guarantee block initialization, or - compInitMem is set and the variable has a long lifetime or has gc fields. Before this change we always inserted explicit zero initalization when compInitMem wasn't set and the variable was a gc pointer or a struct with gc fields. This relaxes that as described above. 2. When we were allocating structs via newobj, we were always inserting explicit zero initializations when importing an inlinee. This change applies the same optimization if the basic block can't be in a loop after inlining. 3. When we were initializing inlinee locals, we were only optimizing explicit zero initializations for structs; however, the same logic applies to non-structs. 4. If the initialization basic block is in a loop but is BBJ_RETURN, we may avoid zero initialization since the block will be executed at most once per method invocation.
b9a2f2b
to
66c810e
Compare
It turns out that the initial PR was too aggressive. My assumption that we always fully zero initialize structs with GC fields in the prolog was incorrect. The prolog zero initialization logic lives in runtime/src/coreclr/src/jit/codegencommon.cpp Lines 6403 to 6418 in 8079a27
So I modified my changes to skip zero initialization in structs with GC fields when While studying I also made the change to not block the optimization for This leaves us with about half of the original wins, which is still worthwhile. |
New framework diff results:
|
Benchmark diffs:
|
@AndyAyersMS @sandreenko PTAL |
@@ -4656,6 +4656,9 @@ void CodeGen::genCheckUseBlockInit() | |||
continue; | |||
} | |||
|
|||
// TODO-Review: The code below is currently unreachable. We are guaranteed to execute one of the | |||
// 'continue' statements above. | |||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to hold up getting this merged - looks great BTW - but why didn't you just delete this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to go back and see if any of this needs to be resurrected. It looks like part of the unreachable code was supposed to be responsible for correctness. I'll clean this up in a subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps open an issue for follow up on this, so we don't forget?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment to the issue that tracks improving the heuristic: #8890 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good.
Seems like we should probably block init more aggressively in the prolog (as well as init more efficiently).
Yes, we have #8890 open |
We have 3 places in the code where we may need to insert explicit zero
initialization. In some cases the initialization may become redundant with prolog
initialization so we have logic to avoid explicit initialization in some cases.
This change makes the logic less conservative.
If the variable is not being initialized in a loop, we can avoid explicit zero initialization if
Before this change we inserted explicit zero initalization in one more case: when compInitMem
wasn't set and the variable was a gc pointer or a struct with gc fields. That is not necessary,
so this change fixes that in fgVarNeedsExplicitZeroInit.
When we were allocating structs via newobj, we were always inserting explicit zero initializations
when importing an inlinee. This change applies the same optimization if the basic block can't be
in a loop after inlining.
When we were initializing inlinee locals, we were only optimizing explicit zero initializations
for structs; however, the same logic applies to non-structs.