-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of reused constants in the register allocator #70182
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsToday https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/lsra-heuristic-tuning.md documents that it does A simple example is the following where
While these values aren't last use, the last use does happen before the constant is overwritten and in some cases we can end up with 4-5 registers all being initialized to the same constant value (often It would be beneficial if the register allocator could improve this scenario and keep the constant in one register for such scenarios. This in particular appears to impact
|
There is an assumption in LSRA that materializing constants are cheap and hence we don't want to tie it to a register. May be if we have an information of the occurances of particular constants (zeros for sure), we can have it assigned a register. I tried experimenting by switching the order of
windows/x64 coreclr: |
This will most likely not happen in .NET 7 |
This seems fine to me. There are a few places where we still have what appear to be associated regressions most of which appear to be cases of This sometimes impacts loop alignment, etc. So we'll probably want to take a closer look in 8. But overall the changes related to |
Just adding in a summary that I gave elsewhere... There are basically two fighting forces at play. On one hand we have constant folding, containment, and other optimizations that can only kick in when we know we have a constant. On the other hand, if we don't know that one of these optimizations are going to kick in then hoisting these values outside the loop is likely strictly better. However, given that we can't really know for sure what will be emitted until lowering, we're making vague heuristical guesses in HIR instead. The premise of my proposal is that we should always CSE and always hoist constants into locals. We should then specially track that a given local represents a constant and we should prevent such locals from being rewritten (either via a new node kind GT_LCL_CNS or a flag on GT_LCL_VAR/GT_LCL_FLD). Constant folding and lowering would then need to be updated to consider both GT_CNS_* and this new type of "constant local". This allows constant folding to still occur, it allows lowering to still perform containment, etc. However, it also ensures that if a given architecture or code path won't specially handle the constant that it is still hoisted outside the loop and won't be a repeated per-iteration cost. A downside to this are that it may have have some minor impact on throughput. LSRA will also need to determine if all users of a local mark it as contained so that it can avoid initializing a value and burning a register unnecessarily. Much like with other scenarios, LSRA could understand and treat these constants specially and know they are cheaper to rematerialize and so therefore can be preference them for spilling if register contention is an issue (which should result in no-worse codegen than what we have today). |
Related: #76067 |
I don't have time to look into this issue in .NET 8. We need to have a "User Story" in future release that contains all these LSRA issues to get them prioritized. |
While working on #87082, been debugging the codegen on osx/arm64 to see why we don't reuse the constant for such scenarios: movz x0, #0x8C00
movk x0, #817 LSL #16
movk x0, #1 LSL #32
movz x1, #0x8C00
movk x1, #817 LSL #16
movk x1, #1 LSL #32
ldr x1, [x1]
blr x1 Idle code gen: movz x0, #0x8C00
movk x0, #817 LSL #16
movk x0, #1 LSL #32
ldr x1, [x0]
blr x1 Multiple reasons: We seem to track the registers containing constants, but as soon as we see a different interval, we reset the tracked register. runtime/src/coreclr/jit/lsra.cpp Lines 6112 to 6119 in 35a5afb
We try to match the constant represented by IConNode or VecCons but not anything further like FtnAddress (which is the case in above example).If a register is holding a constant, we mark it as "busy". When we try to allocate for another position that contains exact same constant, we run try_FREE() heuristic first, which eliminates the register holding constant from list of possible candidates. When we do try_CONST_AVAILABLE() , we never see the register that is holding the same constant. Hence, we end up assigning a new register. And of course, there are bunch of asserts that kicks in when I was trying to make this work.
I am actually wondering, why can't we just have a lookup table of N entries, where N is the number of available registers. key would be const and value would be reg. We store the constant against the register that is holding it. A new heuristic will check if the constant we are allocating for, is present in lookup and if yes, just give it the register present in lookup instead of assigning it a new register. Of course, at block boundaries, we should flush this lookup table, so we don't hold the entries that are true in one pred and not in another. |
Another example from @jakobbotsch - #76928 (comment) |
Today https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/lsra-heuristic-tuning.md documents that it does
FREE
and thenCONST_AVAILABLE
. This ends up not being able to reuse an already enregistered constant if it wasn't "last use".A simple example is the following where
C81
andC82
both represent theCNS_DBL
of0.0
:While these values aren't last use, the last use does happen before the constant is overwritten and in some cases we can end up with 4-5 registers all being initialized (as is the case for
Matrix4x4.Decompose
) to the same constant value (oftenzero
so emittingxorps
many times in a row).It would be beneficial if the register allocator could improve this scenario and keep the constant in one register for such scenarios. This in particular appears to impact
Zero
andAllBitsSet
since they are "cheap to compute" and so typically don't undergo CSE.category:cq
theme:register-allocator
The text was updated successfully, but these errors were encountered: