-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ValueNumbering support for GT_SIMD and GT_HWINTRINSIC tree nodes #31834
Conversation
CC. @CarolEidt, @echesakovMSFT |
|
||
#ifdef FEATURE_HW_INTRINSICS | ||
case GT_HWINTRINSIC: | ||
return true; // allow Hardware Intrinsics to be CSE-ed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any HWIntrinsics we don't want CSE'd or any that need special handling (such as the non-temporal load/stores)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent thought - I would say this should include:
- any of the "special" loads (
LoadFence
, anyNonTemporal
- any store (i.e.
HW_Category_MemoryStore
andStoreFence
) - any prefetch
MemoryFence
Perhaps we should consider adding a flag for this, but a reasonable proxy in the meantime would be to exclude any HW_Category_MemoryStore
, HW_Category_MemoryLoad
and HW_Category_Special
:
- There would be few, if any, loads that the JIT could identify as CSE's (local stack loads would generally not show up as actual load intrinsics).
- The only
HW_Category_Special
that I found that this would unnecessarily exclude isCompareLessThan
One issue that I ran into while working on this feature was that the numArgs for the SIMD and HW_INTRINSICS is not always set to a useful value. Value numbering needs to know the exact number of arguments to expect and some instructions have a "-1" for numArgs or specify 2 when then can accept either 1 or 2. For the purposes of Value Numbering it would be better to have multiple entries in the table when the intrinsic can take different numbers of arguments. |
For true binary operations, it would be useful to have a "commutative" column that indicates that the operands can be safely swapped. |
You can use This is likewise hooked up through the standard Its also worth noting that there are some intrinsics, like FusedMultiplyAdd, which are commutative but have 3 operands and require special handling to account for this (the operation is a fused
This would make lookup during imporation quite a bit more expensive. We already have to check the names, I don't think we want to also have to start checking other parameters. There are existing helper methods such as |
The Value number operation defined by this enum (below), requires that each enum value have a fixed value for arity and commutativity, as we just pass around the value number and then expect to be able to unpack it based upon the VNFunc.
|
@CarolEidt, do you have any concerns about updating the hwintrinsic tables to only contain constant data (so entries that currently have a variable number of args or supporting multiple SIMD sizes would be split into multiple entries, for example)? It would make the table larger but I think we could probably cut some of the current lookup cost and still make this manageable (like having a table per column as we've done elsewhere; or tracking the first/last entry for a given ISA). |
It would be unfortunate, but I don't think it would be too prohibitive to split it out by number of args, but I don't think it's necessary to split it out by different SIMD sizes, as nodes of different sizes should never have the same value type - isn't that right @briansull? |
Actually the type size isn't part of the the value number, so there are a couple of nodes where Value Numbering adds an additional arg that holds a constant represent the size of the operation. We do that for GT_CAST and I also had to add this for the SIMD Init operation., As one test was CSE-ing two different sized Vector Inits: 3-way and 4-way. So it awkward but it is something that we can already handle.
|
src/coreclr/src/jit/valuenum.cpp
Outdated
case SIMDIntrinsicWidenLo: | ||
{ | ||
// Also encodes the resulting type | ||
encodeResultType = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also encode the result type for all of the above SIMD operations.
- Zero out the bbAssertionIn values, as these can be referenced in RangeCheck::MergeAssertion and this is shared state with the CSE phase: bbCseIn
Mutate the gloabl heap when performing a HW_INTRINSIC memory store operation Printing of SIMD constants only support 0
…a SIMD LclVar in PerformCSE
Added bool OperIsMemoryLoad() to GenTreeSIMD, returns true for SIMDIntrinsicInitArray Added valuenumfuncs.h to src/coreclr/src/jit/CMakeLists.txt
…have a GT_IND node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks!
I would like to see new tests added here, especially given the rather complex IR surface. Can you also show the impact on FX codegen and/or tests? I am curious how much coverage we're likely to get organically. How robust will this be as people add new intrinsics? |
|
|
The regression in prolog
Loop:
|
src/coreclr/src/jit/optcse.cpp
Outdated
@@ -2461,6 +2461,28 @@ class CSE_Heuristic | |||
extra_yes_cost *= 2; // full cost if we are being Conservative | |||
} | |||
} | |||
|
|||
#ifdef FEATURE_SIMD | |||
// SIMD types will cause a SIMD register to be spilled/restored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "may cause".
I think you should have a single comment that talks about the heuristics for estimating register save/restore/spill costs. First, it is conservative to assume that each SIMD CSE will require an additional register to be saved/restored in prolog/epilog, as there may be disjoint CSE live ranges that can utilize the same register.
Second, this code presumes that the CSEs will be live arcoss a call, which may be true more often than not, but it also assumes that each use will require a restore, though it may be that multiple uses will not span a call.
While it may be fine to be conservative in these ways, I think we should accurately describe those conservative assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
src/coreclr/src/jit/optcse.cpp
Outdated
{ | ||
spillSimdRegInProlog++; // we likely need to spill an extra register to save the upper half | ||
|
||
// Increase the cse_use_cost since at use sites we have to generate code to spill/restore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, "we have to" should be "we may have to"
Here is an example where value number of SIMD types produced a very nice win: runtime\src\coreclr\tests\src\JIT\Regression\JitBlue\GitHub_11816\GitHub_11816.cs:
We create the following CSE's to hold source, u.A, u.B, u.C, u.D, U.E and u.F
Reducing the code from:
To:
PerfScore:
|
We perform a nice CSE for the src\coreclr\tests\src\JIT\SIMD>emacs BitwiseOperations.cs
|
@AndyAyersMS I believe that we have a reasonable amount of test coverage here. I will add a Config variable to allow us to disable this new functionality and revert to the old behavior. |
We also get this CSE now in:
|
Default 0, ValueNumbering of SIMD nodes and HW Intrinsic nodes enabled If 1, then disable ValueNumbering of SIMD nodes If 2, then disable ValueNumbering of HW Intrinsic nodes If 3, disable both SIMD and HW Intrinsic nodes
|
|
|
|
We detect and make a new CSE in: coreclr\tests\src\JIT\Performance\CodeQuality\BenchmarksGame\mandelbrot\mandelbrot-7.cs The expression:
We also make the CSE when we inline the GetByte method into the lamba function :
|
I verified that setting |
No description provided.