-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor how the embedded rounding is handled so the logic is more reusable #97569
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsPreviously the logic was done by forcing some bits to be tracked as flags on the node and then observing them later on in codegen. If the rounding mode was a constant, these flags were set in lowering and otherwise repeatedly as part of the switch case handler. This worked well, however it was taking up precious flag space and wasn't going to be extensible to other types of embedded operations we may also need to support, such as embedded masking. It was also then inconsistent with how embedded broadcast was handled. This updates the logic to instead pass along The new pattern will allow a similar mechanism to be done for embedded masking where codegen can check for
|
47171ba
to
fa808b7
Compare
fa808b7
to
b19a4c5
Compare
CC. @dotnet/jit-contrib for review. Simple cleanup that will significantly simplify adding the embedded masking support and will make that general support easier to add to Arm64 for SVE, since Arm64 already uses a similar pattern for passing around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
JitStress failure is known, related to the new small type handling for interlocked APIs |
* Expose embedded rounding related scalar intrinsic APIs * Expose embedded rounding related arithmatic intrinsic APIs * Ensure the new APIs are properly lowered * Bug fixes * Expose embedded rounding casting APIs * Expose arithmetic embedded rounding unit tests * Add a test template for embedded rounding APIs, this will be enough to cover all the binary APIs including vector and scalar operations. * Add template for unary ops * Expose all the embedded rounding unit tests generated by the templates * Expose embedded rounding casting APIs unit tests * Expose handwritten unit tests for embedded rounding APIs with special input arg lists. * Bug fixes: 1. ConvertToVector256Int32/UInt32 use special code gen path, adding a fallback path when embedded rounding is activated and the control byte is not constant. * Bug fix: Fix wrong data type in the API definition. * formatting * Update API documents for embedded rounding APIs. * resolve conflicts with #97569 * formatting * bug fix and remove un-needed SAE related intrinsics * resolve comments: 1. update the arg lists for genHWIntrinsic_R_RM * resolve comments: Add jumptable fallback to non-table driven embedded rounding intrinsics. * resolve comments: 1. remove some redundent checks on embedded rounding intrinsics * Bug fix: pass the correct operand GenTree node, when emitting the fallback for embedded rounding intrinsics. * formatting * revert an unexpected change. * 1.Resolve comments: 2. Added FMA intrinsics with embedded rounding and unit tests. * Expose the rest of embedded rounding APIs * formatting * Ensure the control byte local is assigned to the correct register.
Previously the logic was done by forcing some bits to be tracked as flags on the node and then observing them later on in codegen. If the rounding mode was a constant, these flags were set in lowering and otherwise repeatedly as part of the switch case handler.
This worked well, however it was taking up precious flag space and wasn't going to be extensible to other types of embedded operations we may also need to support, such as embedded masking. It was also then inconsistent with how embedded broadcast was handled.
This updates the logic to instead pass along
insOpts
through the necessary codepaths which allows codegen to handle everything itself instead. It does this by doing an up front check for if the node is using embedded rounding and if so extracts the mode and merges the info with the trackedinsOpts
.The new pattern will allow a similar mechanism to be done for embedded masking where codegen can check for
BlendVariableMask
and if it is present, it can extract the contained operation (such asAdd
) and callgenHWIntrinsic
with the updatedinsOpts
, thus allowing us to track theEVEX.aaa
bits without needing to handle much larger nodes, without needing to specialize the handling in all the various code paths, etc.