-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hide 'align' instruction behind jmp #60787
Hide 'align' instruction behind jmp #60787
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsOverviewWith current loop alignment, Here is the sample diff where I have also added DesignA new datastructure
ImpactIdeally, this change should have been no code size diffs, however the reason behind diffs on x64 is that because of moving the As expected, there is no code size difference in arm64 because we just moved the
Detail diffs: https://gist.github.com/kunalspathak/9cc028b60a2e7aba82308fa1e94951ba Contributes to #43227
|
@dotnet/jit-contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional questions/comments:
- Is there a measured perf difference? Or is this speculative, and we'll see what the lab shows?
- What if there are many "jmp" before the loop to align? Maybe the first one is a poor choice because it is in a hot loop. Or maybe the only one prior is in a hot loop. Maybe we shouldn't move it then, due to impacting the I-cache? Should we look at block weights to decide?
- It seems like there should be an "alignment planning" pass that decides where to put the alignment instructions, and annotates the BasicBlocks with those decisions. It could both set a flag that an alignment instruction is needed, and include a pointer to the BasicBlock for the loop we are aligning. Then, the codegen loop would just act on those decisions, and be very simple. Interleaving the planning and codegen loops seems complicated. It seems like this might remove the need for the
alignBBLists
list. - Is there an end-to-end design written down in comments in one place? It seems like there should be.
src/coreclr/jit/lclvars.cpp
Outdated
#if FEATURE_LOOP_ALIGN | ||
if (calculateAlign) | ||
{ | ||
// Track the blocks that needs alignment | ||
// except if it is first block, because currently adding padding | ||
// in prolog is not supported | ||
if (opts.compJitHideAlignBehindJmp && block->isLoopAlign() && (block != fgFirstBB)) | ||
{ | ||
alignBlocksList* curr = new (this, CMK_FlowList) alignBlocksList(block); | ||
|
||
if (alignBBLists == nullptr) | ||
{ | ||
alignBBLists = curr; | ||
} | ||
else | ||
{ | ||
alignBB->next = curr; | ||
} | ||
|
||
alignBB = curr; | ||
} | ||
} | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems very strange to put this code in this function, since it has nothing to do with ref counts. Why is it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't wanted to do another iteration for BasicBlocks
because for long methods, it can be costly. With that said, I have created a separate phase for align
placement where I now iterate over BasicBlock. However, there is still room for improvement there. In my latest code (which I will push shortly), bbPlaceLoopAlignInstructions()
, I want to skip the pass if there are no loops to align. It turns out that the only reliable way to check that after lowering is to check for BB->isLoopAlign()
. I would really like to piggy-back in this ref counting method because that is the last thing that is executed before code gen and the state of the BBF_LOOP_ALIGN
is accurate at that point and we would save iterating over the BasicBlocks again.
A simple logic in this method like below will help avoid iterating over the basic blocks list for scenarios where we add loop alignment initially but then removing it during flow graph analysis because of loops with calls, loop unrolling, compacting, etc. Currently, I have added a needsLoopAlignment
that gets set whenever we mark a loop as BBF_LOOP_ALIGN
but will never unset if we unmark any loop for alignment.
needsLoopAlignment |= block->isLoopAlign();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you keep a count of aligned loops and decrement the count when you remove a BBF_LOOP_ALIGN bit, then just check for "loopAlignCount > 0" before running the PlaceLoopAlignment phase?
It's really unpleasant to have unrelated phases tied together in a somewhat implicit contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you keep a count of aligned loops and decrement the count when you remove a BBF_LOOP_ALIGN bit, then just check for "loopAlignCount > 0" before running the PlaceLoopAlignment phase?
Yes, that's exactly what I tried doing, but turns out that sometimes a block/loop is marked as not needing alignment multiple times specially in AddContainsCallAllContainingLoops()
that miscalculates the count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in AddContainsCallAllContainingLoops() that miscalculates the count.
Why would it do that? Once you've cleared the bit, you wouldn't decrement again. e.g.,
void Compiler::AddContainsCallAllContainingLoops(unsigned lnum)
{
#if FEATURE_LOOP_ALIGN
// If this is the inner most loop, reset the LOOP_ALIGN flag
// because a loop having call will not likely to benefit from
// alignment
if (optLoopTable[lnum].lpChild == BasicBlock::NOT_IN_LOOP)
{
BasicBlock* first = optLoopTable[lnum].lpFirst;
if (first->isLoopAlign())
{
assert(compAlignedLoopCount > 0);
--compAlignedLoopCount;
first->bbFlags &= ~BBF_LOOP_ALIGN;
JITDUMP("Removing LOOP_ALIGN flag for " FMT_LP " that starts at " FMT_BB " because loop has a call.\n", lnum,
first->bbNum);
}
}
#endif
...
src/coreclr/jit/emit.cpp
Outdated
#ifdef FEATURE_LOOP_ALIGN | ||
/* Save the prev IG */ | ||
|
||
emitPrevIG = emitCurIG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to implement this goal without creating a emitPrevIG
"global"?
src/coreclr/jit/emitxarch.cpp
Outdated
//if (emitComp->opts.disAsm) | ||
//{ | ||
// emitDispInsAddr(dstRW); | ||
|
||
// emitDispInsOffs(0, false); | ||
|
||
// printf(" %-9s ; stress-mode injected interrupt\n", "int3"); | ||
//} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment? Or uncomment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally left it as commented so during debugging, we could uncomment and see the instruction. Let me know if you think otherwise.
75eb833
to
0aefa51
Compare
src/coreclr/jit/codegen.h
Outdated
@@ -217,6 +217,10 @@ class CodeGen final : public CodeGenInterface | |||
|
|||
void genInitializeRegisterState(); | |||
|
|||
#if FEATURE_LOOP_ALIGN | |||
void genMarkBlocksForLoopAlignment(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to delete this.
@BruceForstall - This should be ready for the review. Here are the changes:
|
CodeSize diffs: https://dev.azure.com/dnceng/public/_build/results?buildId=1464807&view=ms.vss-build-web.run-extensions-tab
I analyzed some of the PerfScore regressions on windows/arm64 for coreclr_tests, and they all are coming from moving the align instructions behind |
Should PerfScore not count align instructions in an IG following an unconditional branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions
// | ||
// if (emitComp->opts.disAsm) | ||
//{ | ||
// emitDispInsAddr(dstRW); | ||
|
||
// emitDispInsOffs(0, false); | ||
|
||
// printf(" %-9s ; stress-mode injected interrupt\n", "int3"); | ||
//} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it commented intentionally so we can quickly uncomment and check it in disassembly. If you still prefer, I can delete it.
0aefa51
to
12c592d
Compare
Let me see if it can be done easily. |
Thanks for the suggestion. This gives the real data about PerfScore which is much nicer.
|
fix the alignBytesRemoved Some fixes and working model Some fixes and redesign Some more fixes more fixes fix Add the check for fgFirstBB misc changes code cleanup + JitHideAlignBehindJmp switch validatePadding only if align are before the loop IG More cleanup, remove commented code jit format
…st the targetIG to prevIG Add IGF_REMOVED_ALIGN flag for special scenarios
d965663
to
57759d0
Compare
@BruceForstall - Can you review it again? I think I have addressed all the feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One possible follow-up.
Co-authored-by: Bruce Forstall <[email protected]>
Overview
With current loop alignment,
align
instruction are placed just before the loop start. This can sometimes affect performance adversely if the block in whichalign
instruction is placed is hot or if it is part of nested loop, in which case processor would perform fetching and decoding ofnop
. This PR places thealign
instructions behind unconditionaljmp
instructions if they exist before the loop that it is trying to align. If no suchjmp
were present, thenalign
will be placed right before the loop start, the one it is done today.Here is the sample diff where
align
instruction was moved fromIG31
and placed intoIG29
after thejmp
instruction.I have also added
COMPlus_JitHideAlignBehindJmp
to turn off this feature for debugging purpose. In Release, it is always ON. I have also added a stress mode where 50% of time, we would emitINS_BREAKPOINT
instead ofalign
in situation where thealign
are placed behindjmp
to make sure that we never execute them.Design
A new datastructure
alignBlocksList
is created which is a linked list of allBasicBlock
that are the head of loop start needing alignment. This is done during final ref counting phase. During codegen, we pull the basic block from thealignBlocksList
and monitor if there is any unconditional jmp. If we find one, we emitalign
instruction and set the BB flagBBF_LOOP_ALIGN_ADDED
. This makes sure that if we see morejmp
before the actual loop start, we do not addalign
instructions again. When we reach a point in the flow where next block is the loop start, we would update thetargetIG
(see below) of thealignInstr
. At this time, we also determine that if we didn't see anyjmp
so far (usingBBF_LOOP_ALIGN_ADDED
), we would emit thealign
instruction. Finally, we move to the nextBasicBlock
ofalignBlocksList
.instrDescAlign
data structure has been updated.idaIG
field in it now points to the IG that contains thealign
instruction. This IG can be before the loop or can be some on the previous IG that ends withjmp
.idaTargetIG
is the IG that earlier used to beidaIG
. It points to the IG before the IG that has loop. This is used when we want to calculate theloopSize
. Some of the changes are around getting the right field wherever necessary.IGF_LOOP_ALIGN
flag, which previously used to be on an IG just before the loop IG has been replaced byIGF_HAS_ALIGN
and will be on IG that contains thealign
instruction. Again, this may or may not be the one just before the loop IG. Finally, to handle special scenarios where an IG that is part of loop might havealign
instruction for a different IG, flagIGF_REMOVED_ALIGN
is added that tells if thealign
instruction present in that IG are removed or not.Impact
Ideally, this change should have been no code size diffs, however the reason behind diffs on x64 is that because of moving the
align
instruction, we tend to shorten some jumps and that changes the heuristics calculation of alignment. However, as seen below, the impact is minimal and number of unchanged methods are far more than the impacting methods.As expected, there is no code size difference in arm64 because we just moved the
align
instruction around.Detail diffs: https://gist.github.com/kunalspathak/9cc028b60a2e7aba82308fa1e94951ba
Contributes to #43227