-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Add a pass of liveness for the new locals inside physical promotion #86043
JIT: Add a pass of liveness for the new locals inside physical promotion #86043
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsUsed when decomposing stores to skip the fields that are dying and the remainder if it is dying. Also allows forward sub to kick in for the inserted replacement locals. Not sure yet this is how I want to go about the liveness here... we might instead combine this with the existing early liveness pass, or use the early liveness information to figure out where to look when calculating more precise liveness. But this at least gives us an implementation to experiment with.
|
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-jit-experimental |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-jit-experimental |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-jit-experimental |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-jit-experimental |
Azure Pipelines successfully started running 3 pipeline(s). |
jitstress failures are a couple of timeouts in event source tests in scenario without physical promotion enabled, so almost certainly unrelated. runtime-jit-experimental failures were recently fixed I believe, so will rerun that one, but will push a comment cleanup PR first. No failures in libraries-jitstress. |
/azp run runtime-jit-experimental |
Azure Pipelines successfully started running 1 pipeline(s). |
The failure looks like #86233. cc @dotnet/jit-contrib PTAL @AndyAyersMS The throughput impact is very small so I don't think there will be a reason to try to be much smarter here. Well, unless it turns out that's because of a bug (but I have looked at a few examples and things look reasonable). After this is merged I think most basics are in place and to the point where I can get started on readjusting the heuristics. |
// TODO: We need a scalability limit on these, we cannot always track | ||
// the remainder and all fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address this in a separate follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the "coarsening" discussion we had re alias analysis the other day, you should be able to have models where you group together fields and or fields + remainder and get approximate liveness for them.
Would be interesting to know if you see cases where there are large structs with "hot fields" so that it is worth tracking a subset but coarsening everything else, and if that diverges somehow from just promoting the hot fields and leaving the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be interesting to know if you see cases where there are large structs with "hot fields" so that it is worth tracking a subset but coarsening everything else, and if that diverges somehow from just promoting the hot fields and leaving the rest.
That's a good point... the heuristics should likely include something to avoid creating locals that are just going to end up untracked.
#ifdef DEBUG | ||
unsigned char lvTrackedWithoutIndex : 1; // Tracked but has no lvVarIndex (i.e. only valid GTF_VAR_DEATH flags, used | ||
// by physical promotion) | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just used to print "(last use)" in the dump outside the new locals introduced by physical promotion.
bool AggregateInfo::OverlappingReplacements(unsigned offset, | ||
unsigned size, | ||
Replacement** firstReplacement, | ||
Replacement** endReplacement) | ||
{ | ||
size_t firstIndex = Promotion::BinarySearch<Replacement, &Replacement::Offset>(Replacements, offset); | ||
if ((ssize_t)firstIndex < 0) | ||
{ | ||
firstIndex = ~firstIndex; | ||
if (firstIndex > 0) | ||
{ | ||
Replacement& lastRepBefore = Replacements[firstIndex - 1]; | ||
if ((lastRepBefore.Offset + genTypeSize(lastRepBefore.AccessType)) > offset) | ||
{ | ||
// Overlap with last entry starting before offs. | ||
firstIndex--; | ||
} | ||
else if (firstIndex >= Replacements.size()) | ||
{ | ||
// Starts after last replacement ends. | ||
return false; | ||
} | ||
} | ||
|
||
const Replacement& first = Replacements[firstIndex]; | ||
if (first.Offset >= (offset + size)) | ||
{ | ||
// First candidate starts after this ends. | ||
return false; | ||
} | ||
} | ||
|
||
assert((firstIndex < Replacements.size()) && Replacements[firstIndex].Overlaps(offset, size)); | ||
*firstReplacement = &Replacements[firstIndex]; | ||
|
||
if (endReplacement != nullptr) | ||
{ | ||
size_t lastIndex = Promotion::BinarySearch<Replacement, &Replacement::Offset>(Replacements, offset + size); | ||
if ((ssize_t)lastIndex < 0) | ||
{ | ||
lastIndex = ~lastIndex; | ||
} | ||
|
||
// Since we verified above that there is an overlapping replacement | ||
// we know that lastIndex exists and is the next one that does not | ||
// overlap. | ||
assert(lastIndex > 0); | ||
*endReplacement = Replacements.data() + lastIndex; | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced a new AggregateInfo
and moved this method into it.
bool StructSegments::Segment::IntersectsOrAdjacent(const Segment& other) const | ||
{ | ||
if (End < other.Start) | ||
{ | ||
return false; | ||
} | ||
|
||
if (other.End < Start) | ||
{ | ||
return false; | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These StructSegments
method are now being used as part of replacement picking, so I've moved it into the main promotion.cpp.
Ping @AndyAyersMS, but no rush if you're busy with PGO things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Left some comments but those can be handled in follow-ups.
// For LCL_ADDR this is a retbuf and we expect it to be a def. We | ||
// don't know the exact size here so we cannot mark anything as | ||
// being fully defined, thus we can just return. | ||
assert(isDef); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about this part -- for managed callees it seems like we could access their return type and figure out the extent of the writes they do.
In principle the callee must either write that entire extent or write nothing (nothing only if it throws rather than returns; then if we're in a try the previous definition can reach handlers/filters). But maybe we skip writing to padding in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the problem is finding the call node that contains the retbuf, since the linked locals list does not contain it. In principle it shouldn't be hard to quickly find it utilizing GTF_CALL
, so we could be a bit more precise about this.
// TODO: We need a scalability limit on these, we cannot always track | ||
// the remainder and all fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the "coarsening" discussion we had re alias analysis the other day, you should be able to have models where you group together fields and or fields + remainder and get approximate liveness for them.
Would be interesting to know if you see cases where there are large structs with "hot fields" so that it is worth tracking a subset but coarsening everything else, and if that diverges somehow from just promoting the hot fields and leaving the rest.
|
||
for (BasicBlock* block = m_compiler->fgLastBB; block != nullptr; block = block->bbPrev) | ||
{ | ||
m_hasPossibleBackEdge |= block->bbNext && (block->bbNext->bbNum <= block->bbNum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know for sure we have proper bbNum ordering?
In cases with loops this will converge faster if you walk the blocks in post-order here, to maximize the propagation per iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the same goes for regular liveness so perhaps we can fix that too -- maybe do both as a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know for sure we have proper bbNum ordering?
Even if we don't it will just result in m_hasPossibleBackEdge
being set to true which might result in one more iteration of the fixpoint computation than necessary.
When I did early liveness I considered computing strongly connected components and doing the localized fixpoint computation within each SCC, which should be the optimal thing. However, when I measured the actual throughput cost of the fixpoint computation I felt it wasn't worth more complexity, from #79346 (comment):
Base: 232865278205, Diff: 236405285056, +1.5202%
?PerBlockAnalysis@LiveVarAnalysis@@AEAA_NPEAUBasicBlock@@_N1@Z : 616293246 : +30.93% : 11.44% : +0.2647%
?fgMarkUseDef@Compiler@@AEAAXPEAUGenTreeLclVarCommon@@@Z : 426414825 : +39.60% : 7.91% : +0.1831%
?fgInterBlockLocalVarLiveness@Compiler@@QEAAXXZ : 353468679 : +43.94% : 6.56% : +0.1518%
?fgPerBlockLocalVarLiveness@Compiler@@QEAAXXZ : 258048984 : +26.39% : 4.79% : +0.1108%
?lvaSortByRefCount@Compiler@@QEAAXXZ : 222829814 : +39.99% : 4.14% : +0.0957%
?GetSucc@BasicBlock@@QEAAPEAU1@IPEAVCompiler@@@Z : 142747231 : +11.06% : 2.65% : +0.0613%
?PostOrderVisit@ForwardSubVisitor@@QEAA?AW4fgWalkResult@Compiler@@PEAPEAUGenTree@@PEAU4@@Z : 136636902 : NA : 2.54% : +0.0587%
?WalkTree@?$GenTreeVisitor@VLocalSequencer@@@@QEAA?AW4fgWalkResult@Compiler@@PEAPEAUGenTree@@PEAU4@@Z : 134894223 : NA : 2.50% : +0.0579%
??0AllSuccessorIterPosition@@QEAA@PEAVCompiler@@PEAUBasicBlock@@@Z : 129181805 : +13.59% : 2.40% : +0.0555%
?fgComputeLifeTrackedLocalUse@Compiler@@QEAAXAEAPEA_KAEAVLclVarDsc@@PEAUGenTreeLclVarCommon@@@Z : 127997140 : +37.14% : 2.38% : +0.0550%
?fgForwardSubStatement@Compiler@@AEAA_NPEAUStatement@@@Z : 119968922 : +70.17% : 2.23% : +0.0515%
?fgMarkAddressExposedLocals@Compiler@@AEAA?AW4PhaseStatus@@XZ : 119662088 : +72.72% : 2.22% : +0.0514%
?FindNextRegSuccTry@EHSuccessorIterPosition@@AEAAXPEAVCompiler@@PEAUBasicBlock@@@Z : 116051428 : +13.69% : 2.15% : +0.0498%
?WalkTree@?$GenTreeVisitor@V?$GenericTreeWalker@$00$0A@$0A@$00@@@@QEAA?AW4fgWalkResult@Compiler@@PEAPEAUGenTree@@PEAU4@@Z : 111845378 : +56.69% : 2.08% : +0.0480%
?fgComputeLifeTrackedLocalDef@Compiler@@QEAA_NAEAPEA_KAEBQEA_KAEAVLclVarDsc@@PEAUGenTreeLclVarCommon@@@Z : 108064056 : +34.46% : 2.01% : +0.0464%
memset : 104530112 : +5.76% : 1.94% : +0.0449%
PerBlockLiveness
is the method called for each basic block in each iteration of the fixpoint computation. So around 0.26% in TP cost for the fixpoint computation from adding early liveness.
It would still be an interesting experiment, and RPO traversal is also simpler than computing full SCCs.
Used when decomposing stores to skip the fields that are dying and the remainder if it is dying. Also allows forward sub to kick in for the inserted replacement locals.
Not sure yet this is how I want to go about the liveness here... we might instead combine this with the existing early liveness pass, or use the early liveness information to figure out where to look when calculating more precise liveness. But this at least gives us an implementation to experiment with.The throughput impact seems very small, so doing something smarter seems unnecessary.