-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PRE for multi-level field loads in a loop #5110
Conversation
079a1d6
to
2f462b3
Compare
} NEXT_PREDECESSOR_BLOCK; | ||
|
||
Assert(tail); | ||
return tail; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when there are multiple back-edges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment at the call site of this method explains why is it ok to get any tail block - basically we'll be getting the value of a PRE candidate sym on the block returned by this method. Since, the sym is a PRE candidate sym, its value must be the same on all tail blocks. I'll update the method name to GetAnyTailBlock
lib/Backend/GlobOpt.cpp
Outdated
|
||
if (ldSrc->m_sym != propertySym) | ||
{ | ||
// It's possible that the propertySym but have equivalent objPtrs. Verify their values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that the propertySy [](start = 25, length = 19)
Nit: fix comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…p candidate is live at InlineeStart
Merge pull request #5110 from rajatd:oxy-no-ldslot This PR enhances the existing Partial Redundancy Elimination (PRE) infrastructure to catch multi-level field loads, i.e., field loads of the form o.x.y. **Challenges:** 1. Loading o.x.y translates to roughly the following IR: T1 = o.x T2 = T1.y In order to put a load of `T1.y` in the loop landing pad, we need `T1` to be live there. But, since it’s a temp its not live in the landing pad. 2. In the prepass, when we create initial values for `o.x` and `T1.y`, we give them new symStores (because we aren't sure what symStore will those syms end up with at the end of the loop). Symbol-to-value map entries look something like this: symbol -> value -> symstore o.x -> v1 -> T3 T1.y ->v2 -> T4 The PRE code that runs after the loop prepass, then uses these symStores to preload the PRE candidate property syms (`o.x` and `T1.y` in the above example). Assuming we could solve the previous problem by inserting `T1 = o.x` in the landing pad, landing pad would look something like this: T3 = o.x T1 = o.x T4 = T1.y But, now when we optimize the landing pad, `T1.y` gets object-pointer copy-propped to `T3.y`. So, now we have `T3.y` live coming in to the loop, but `T1.y` live on the back-edge. Since, these are two different property syms, they don’t produce a value for any `.y` during merging info on the loop header. This defeats the purpose of putting an initial value for `T1.y` in the landing pad. **Solutions:** 1. For making `T1` live in the landing pad, I rely on it being single-def and go with the premise that if its single def, I should be able to insert its defining instruction in the landing pad, as long as the rhs of the instruction is live in the landing pad; if its not, I try to make the rhs live by either: a. Processing it as a PRE candidate (if it was one), or b. recursing on the same logic as above for making it live. 2. Since, `T3` is the symstore for `o.x` on the back-edge, and we insert `T3 = o.x` in the landing pad, `T3` will eventually object-pointer copy-prop into `T2 = T1.y` (making it `T2 = T3.y`) during the main pass of the loop. To then field copy prop `T3.y`, we need `T3.y` to have a value on the loop header and to have that, we need `T3.y` live on the back edge. And that is exactly what I do - when preloading `T1.y` in the landing pad, I recognize that `T3` is going to be the copy-prop sym for `T1` and I make `T3.y` live in the landing pad and all the back edges. **Perf** ``` Kraken Left run time Right run time ? Run time ? Run time % Comment -------------------------------- ----------------- ----------------- ---------- ------------ --------------- Ai-astar 203.80 ms ±0.90% 205.83 ms ±0.92% 2.03 ms 1.00% Audio-beat-detection 105.00 ms ±0.95% 102.50 ms ±0.49% -2.50 ms -2.38% Audio-dft 141.89 ms ±0.99% 140.75 ms ±0.18% -1.14 ms -0.80% Audio-fft 74.50 ms ±0.70% 75.91 ms ±0.74% 1.41 ms 1.89% Audio-oscillator 73.63 ms ±0.43% 69.25 ms ±0.24% -4.38 ms -5.94% Improved Imaging-darkroom 177.83 ms ±0.92% 176.33 ms ±0.68% -1.50 ms -0.84% Imaging-desaturate 75.57 ms ±0.49% 75.43 ms ±0.76% -0.14 ms -0.19% Imaging-gaussian-blur 189.25 ms ±0.13% 165.60 ms ±0.24% -23.65 ms -12.50% Improved Json-parse-financial 58.50 ms ±0.85% 57.60 ms ±0.89% -0.90 ms -1.54% Json-stringify-tinderbox 32.00 ms ±0.81% 32.00 ms ±0.99% 0.00 ms 0.00% Stanford-crypto-aes 145.20 ms ±0.96% 144.80 ms ±0.94% -0.40 ms -0.28% Stanford-crypto-ccm 112.00 ms ±0.89% 111.50 ms ±0.45% -0.50 ms -0.45% Stanford-crypto-pbkdf2 246.33 ms ±0.27% 244.80 ms ±0.91% -1.53 ms -0.62% Stanford-crypto-sha256-iterative 63.75 ms ±0.39% 62.83 ms ±0.96% -0.92 ms -1.44% -------------------------------- ----------------- ----------------- ---------- ------------ --------------- Total 1699.25 ms ±0.67% 1665.14 ms ±0.66% -34.11 ms -2.01% Likely improved Octane Left score Right score ∆ Score ∆ Score % Comment ---------------- --------------- --------------- ------- --------- --------------- Box2d 23756.13 ±0.16% 24083.13 ±0.58% 327.00 1.38% Code-load 8355.50 ±0.25% 8435.50 ±0.15% 80.00 0.96% Improved Crypto 25798.93 ±0.24% 26194.50 ±0.18% 395.57 1.53% Improved Deltablue 19073.33 ±0.66% 20072.00 ±0.23% 998.67 5.24% Improved Earley-boyer 32181.25 ±0.36% 31788.40 ±0.49% -392.85 -1.22% Gbemu 36176.00 ±0.20% 36268.00 ±0.38% 92.00 0.25% Mandreel 20893.25 ±0.32% 20803.50 ±0.27% -89.75 -0.43% Mandreel latency 61067.78 ±0.71% 61170.50 ±0.23% 102.72 0.17% Navier-stokes 29619.50 ±0.30% 31657.71 ±0.30% 2038.21 6.88% Improved Pdfjs 13850.75 ±0.41% 13977.63 ±0.44% 126.88 0.92% Raytrace 31511.67 ±0.38% 31847.71 ±0.30% 336.05 1.07% Likely improved Regexp 3627.00 ±0.51% 3586.13 ±0.11% -40.88 -1.13% Richards 17123.00 ±0.49% 17286.50 ±0.26% 163.50 0.95% Splay 16760.20 ±0.49% 16873.75 ±0.32% 113.55 0.68% Splay latency 34004.50 ±0.58% 34297.50 ±0.44% 293.00 0.86% Typescript 28073.38 ±0.49% 28221.88 ±0.41% 148.50 0.53% Zlib 70900.88 ±0.17% 70701.75 ±0.25% -199.13 -0.28% ---------------- --------------- --------------- ------- --------- --------------- Total 22911.87 ±0.39% 23154.84 ±0.31% 242.96 1.06% Likely improved ``` Follow up work will target catching multi-level field loads that have LdEnv in the load sequence, and loads of and stores to the same field in the same loop
… missing values Merge pull request #5300 from rajatd:prepareSeg If the entire range of a memop operation fits in the head segment, properly check if there were any missing items in the head outside of that range, before setting HasNoMissingValues to true on the dst array. This is needed only after #5110.
This PR enhances the existing Partial Redundancy Elimination (PRE) infrastructure to catch multi-level field loads, i.e., field loads of the form o.x.y.
Challenges:
Loading o.x.y translates to roughly the following IR:
In order to put a load of
T1.y
in the loop landing pad, we needT1
to be live there. But, since it’s a temp its not live in the landing pad.In the prepass, when we create initial values for
o.x
andT1.y
, we give them new symStores (because we aren't sure what symStore will those syms end up with at the end of the loop). Symbol-to-value map entries look something like this:The PRE code that runs after the loop prepass, then uses these symStores to preload the PRE candidate property syms (
o.x
andT1.y
in the above example). Assuming we could solve the previous problem by insertingT1 = o.x
in the landing pad, landing pad would look something like this:But, now when we optimize the landing pad,
T1.y
gets object-pointer copy-propped toT3.y
. So, now we haveT3.y
live coming in to the loop, butT1.y
live on the back-edge. Since, these are two different property syms, they don’t produce a value for any.y
during merging info on the loop header. This defeats the purpose of putting an initial value forT1.y
in the landing pad.Solutions:
For making
T1
live in the landing pad, I rely on it being single-def and go with the premise that if its single def, I should be able to insert its defining instruction in the landing pad, as long as the rhs of the instruction is live in the landing pad; if its not, I try to make the rhs live by either:a. Processing it as a PRE candidate (if it was one), or
b. recursing on the same logic as above for making it live.
Since,
T3
is the symstore foro.x
on the back-edge, and we insertT3 = o.x
in the landing pad,T3
will eventually object-pointer copy-prop intoT2 = T1.y
(making itT2 = T3.y
) during the main pass of the loop. To then field copy propT3.y
, we needT3.y
to have a value on the loop header and to have that, we needT3.y
live on the back edge.And that is exactly what I do - when preloading
T1.y
in the landing pad, I recognize thatT3
is going to be the copy-prop sym forT1
and I makeT3.y
live in the landing pad and all the back edges.Perf
Follow up work will target catching multi-level field loads that have LdEnv in the load sequence, and loads of and stores to the same field in the same loop