-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: jump threading #46257
JIT: jump threading #46257
Conversation
Optimize branches where a branch outcome is fully determined by a dominating branch, and both true and false values can reach the dominated branch.
cc @dotnet/jit-contrib Note we expect this pattern to come up more frequently once we enable guarded devirtualization via PGO. Jit diffs via PMI. Most regressions are extra CSEs. A few are places where LSRA does some odd spills.
|
Sample diff. public static int Y(object o)
{
int result = 0;
if (o.GetType() == typeof(string)) result++;
if (o.GetType() == typeof(string)) result++;
if (o.GetType() == typeof(string)) result++;
if (o.GetType() == typeof(string)) result++;
return result;
} ; BEFORE
;
; Assembly listing for method X:Y(Object):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T02] ( 3, 3 ) ref -> rcx class-hnd
; V01 loc0 [V01,T00] ( 9, 5.50) int -> rax
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
; V03 cse0 [V03,T01] ( 5, 5 ) long -> rdx "CSE - aggressive"
;
; Lcl frame size = 0
G_M10236_IG01:
;; bbWeight=1 PerfScore 0.00
G_M10236_IG02:
xor eax, eax
mov rdx, qword ptr [rcx]
mov rcx, 0xD1FFAB1E
cmp rdx, rcx
jne SHORT G_M10236_IG04
;; bbWeight=1 PerfScore 3.75
G_M10236_IG03:
mov eax, 1
;; bbWeight=0.50 PerfScore 0.12
G_M10236_IG04:
mov rcx, 0xD1FFAB1E
cmp rdx, rcx
jne SHORT G_M10236_IG06
;; bbWeight=1 PerfScore 1.50
G_M10236_IG05:
inc eax
;; bbWeight=0.50 PerfScore 0.12
G_M10236_IG06:
mov rcx, 0xD1FFAB1E
cmp rdx, rcx
jne SHORT G_M10236_IG08
;; bbWeight=1 PerfScore 1.50
G_M10236_IG07:
inc eax
;; bbWeight=0.50 PerfScore 0.12
G_M10236_IG08:
mov rcx, 0xD1FFAB1E
cmp rdx, rcx
jne SHORT G_M10236_IG10
;; bbWeight=1 PerfScore 1.50
G_M10236_IG09:
inc eax
;; bbWeight=0.50 PerfScore 0.12
G_M10236_IG10:
ret
; Total bytes of code 77, prolog size 0, PerfScore 17.45, instruction count 19 ; AFTER
;
; Assembly listing for method X:Y(Object):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T01] ( 3, 3 ) ref -> rcx class-hnd
; V01 loc0 [V01,T00] ( 9, 9 ) int -> rax
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M10236_IG01:
;; bbWeight=1 PerfScore 0.00
G_M10236_IG02:
xor eax, eax
mov rdx, 0xD1FFAB1E
cmp qword ptr [rcx], rdx
jne SHORT G_M10236_IG03
mov eax, 1
inc eax
inc eax
inc eax
;; bbWeight=1 PerfScore 4.50
G_M10236_IG03:
ret
; Total bytes of code 29, prolog size 0, PerfScore 8.40, instruction count 9 Note the lack downstream opts; forward-sub or similar would help clean this up. But the control flow is simplified. |
Updated diffs now that we're correctly rejecting cases with both an ambiguous pred and a fall through.
|
Nice! Btw, do we have a tracking issue to fold this (from your sample): mov eax, 1
inc eax
inc eax
inc eax into mov eax, 4 |
And from my understanding it only works when conditions have the same VN? e.g. the following won't work? public static int Test(int o)
{
int result = 0;
if (o > 100)
{
if (o > 10) result++;
}
return result;
} or e.g.: public static int Test(object o)
{
int result = 0;
if (o.GetType() != typeof(string))
{
if (o.GetType() == typeof(string)) result++; // is never taken
}
return result;
} Not sure it will increase the jit-diff though. |
No, not specifically this. We have #6973 but that would be happening upstream. There is no general framework in the jit for cross-statement expression optimization. Ideally perhaps SSA should be leveraged to make look like we have built cross-tree edges and could operate on the factored-out expression opts in morph, but we're quite a ways from anything like that.
Correct. I have a more general version in the works that can handle some cases where the VN differs. |
Small revision to account for all preds, and only note the fall through pred when it's not an ambiguous pred. Only one new diff from the results above because EH preds are now properly counted as ambiguous preds. I suspect the EH flow checks are a bit too stringent, but will leave as is for now. |
@EgorBo preview of the more general version, that tries to handle cases where knowing the value of one relop implies you know the value of another: AndyAyersMS/runtime@AndyAyersMS:JumpThreading...AndyAyersMS:RelopImpliesRelop Not yet sure if the extra complexity is worth the trouble... also probably not yet entirely bug free. |
Nice! Will try to play with it locally 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions/comments.
// However, be conservative if block is in a try as we might not | ||
// have a full picture of EH flow. | ||
// | ||
if (!block->hasTryIndex()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: is not it enough to check that all blocks have the same try index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
I need to take a deeper look at how the jit models EH flow. It seems like some aspects are incomplete or approximate and so it might be dangerous to make assumptions when fgReachable
returns false. I think we are safeguarded here and in the previous work I did because dominance information in the flow graph is also approximate in a way. But I am not 100% sure.
For example, in a case like:
public static int Main(string[] args)
{
int x = args.Length;
int y = 0;
try
{
if (x == 0)
{
throw new Exception();
}
}
catch (Exception e)
{
}
if (x == 0)
{
y = 100;
}
return y;
}
We possibly could see that the second (x == 0)
is dominated by the first (x == 0)
, and if we did, fgReachable
might only see the false path between the two, because the true path involves a throw/catch. If this happened we might mistakenly believe the second compare is always false. However, when we build dominators the second compare is not dominated by the first, so we never consider jump threading at all.
After computing reachability sets:
------------------------------------------------
BBnum Reachable by
------------------------------------------------
BB01 : BB01
BB02 : BB01 BB02
BB03 : BB01 BB02 BB03
BB04 : BB01 BB02 BB04 BB07
BB05 : BB01 BB02 BB04 BB05 BB07
BB06 : BB01 BB02 BB04 BB05 BB06 BB07
BB07 : BB07
After computing reachability:
------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region]
------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..006)
BB02 [0001] 1 0 BB01 1 [006..009)-> BB04 ( cond ) T0 try {
BB03 [0002] 1 0 BB02 0 [009..00F) (throw ) T0 }
BB04 [0005] 2 BB02,BB07 1 [014..017)-> BB06 ( cond )
BB05 [0006] 1 BB04 1 [017..01A)
BB06 [0007] 2 BB04,BB05 1 [01A..01C) (return)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
BB07 [0004] 1 0 1 [011..014)-> BB04 ( cret ) H0 F catch { }
------------------------------------------------------------------------------------------------
*************** In fgComputeDoms
Dominator computation start blocks (those blocks with no incoming edges):
BB01 BB07
------------------------------------------------
BBnum Dominated by
------------------------------------------------
BB07: BB07
BB01: BB01
BB02: BB02 BB01
BB03: BB03 BB02 BB01
BB04: BB04
BB05: BB05 BB04
BB06: BB06 BB04
Inside fgBuildDomTree
After computing the Dominance Tree:
BB01 : BB02
BB02 : BB03
BB04 : BB06 BB05
Note in the above that BB04
is not dominated by BB02
, because its pred BB07
is an "eh entry". But what's odd is that neither BB04
or BB07
are dominated by BB01
but reachability says they are reachable from BB01
.
So mixing information from fgComputeDoms
and fgReachable
may be risky.
I can't come up with an example with the current opts where we'd make a bad assumption, but if/ when we extend things to handle "related" compares then perhaps we'll run into trouble.
What's also interesting is that SSA has its own dominance computation, and does it differently and seemingly a bit more accurately; here BB01
properly dominates BB04
and BB07
. Seems like we ought not to have two different notions of dominance in the jit.
*************** In SsaBuilder::Build()
[SsaBuilder] Max block count is 8.
------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region]
------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..006)
BB02 [0001] 1 0 BB01 1 [006..009)-> BB04 ( cond ) T0 try {
BB03 [0002] 1 0 BB02 0 [009..00F) (throw ) T0 }
BB04 [0005] 2 BB02,BB07 1 [014..017)-> BB06 ( cond )
BB05 [0006] 1 BB04 1 [017..01A)
BB06 [0007] 2 BB04,BB05 1 [01A..01C) (return)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
BB07 [0004] 1 0 0 [011..014)-> BB04 ( cret ) H0 F catch { }
------------------------------------------------------------------------------------------------
*************** Exception Handling table
index eTry, eHnd
0 :: - Try at BB02..BB03 [006..011), Handler at BB07..BB07 [011..014)
[SsaBuilder] Topologically sorted the graph.
[SsaBuilder::ComputeImmediateDom]
Inside fgBuildDomTree
After computing the Dominance Tree:
BB01 : BB07 BB04 BB02
BB02 : BB03
BB04 : BB06 BB05
I'm not going to have time to unravel this before the holidays, so will come back to it in the new year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am going to add a bespoke reachability computation (optReachable
) for this.
BasicBlock* const predBlock = pred->flBlock; | ||
|
||
const bool isTruePred = | ||
((predBlock == domBlock) && (trueSuccessor == block)) || fgReachable(trueSuccessor, predBlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not fgReachable
have linear complexity in the worst case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty cheap, basically a bit vector check.
// Since flow is going to bypass block, make sure there | ||
// is nothing in block that can cause a side effect. | ||
// | ||
// Note we neglect PHI assignments. This reflects a general lack of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: What does it cause? Is it a correctness issue or CQ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I believe it's just CQ, causing us to miss opportunities downstream (say range prop, which chases back through PHIs, will see more viable paths than actually exist). But hard to be 100% confident.
We really should be more explicit about the ways the optimizer relies on stale analysis data and potentially stale IR, and what (if any) update strategies we employ.
BasicBlock* const predBlock = pred->flBlock; | ||
numPreds++; | ||
|
||
// We don't do switch updates, yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why can't we leave the switches flowing into the original blocks as we do with other AmbiguousPreds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should just add handling for switches; it's not that hard, just a different kind of update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (after JITDUMP fixes).
we can run the algorithm forward instead of backward.
With the new reachability computation, similar diffs to what I posted above (note I've taken a few merges so input assemblies have changed)
|
Still want to make a few more changes (switches, and perhaps we can now assert that if x dominates y, at least one of x's successors must reach y.). |
Took a long look at asserting that there now must be a path from dominator to dominated block, but that doesn't hold up as we optimize, as we can create unreachable blocks and subgraphs. So just added some comments, and we'll just tolerate those rare cases where we can't find paths. |
Optimize branches where a branch outcome is fully determined by a dominating
branch, and both true and false values can reach the dominated branch.