-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose AVX512F embedded rounding intrinsics. #97415
Conversation
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsFollow-up of #94684, still WIP, no need to review at the moment.
|
dfee205
to
5c44b41
Compare
Diff results for #97415Throughput diffsThroughput diffs for linux/x64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Throughput diffs for windows/x64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Details here |
Diff results for #97415Throughput diffsThroughput diffs for linux/x64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Throughput diffs for windows/x64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Details here |
Failures look unrelated, PR is ready for review. |
cc54966
to
5f0c7eb
Compare
Diff results for #97415Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Throughput diffs for linux/x64 ran on windows/x64MinOpts (0.00% to +0.01%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Throughput diffs for windows/x64 ran on windows/x64MinOpts (0.00% to +0.01%)
Details here |
Diff results for #97415Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Throughput diffs for linux/x64 ran on windows/x64MinOpts (0.00% to +0.01%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Throughput diffs for windows/x64 ran on windows/x64MinOpts (0.00% to +0.01%)
Details here |
Diff results for #97415Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for linux/x64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Throughput diffs for windows/x64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Details here |
Applied the format patch. All failures are known. |
Diff results for #97415Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for linux/x64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Throughput diffs for windows/x64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Details here |
Updated the branch to pick up some CI fixes that have gone in. |
Looks like there are a few new updates, I will rebase my local branch and resolve the build error. I think I forgot to mention there were a few APIs that the underlying intrinisics only support For example, I provided one in the commit: /// __m128d _mm_cvt_roundss_sd (__m128d a, __m128 b, int sae)
/// VCVTSS2SD xmm1, xmm2, xmm3 {sae}
public static Vector128<double> ConvertScalarToVector128Double(Vector128<double> upper, Vector128<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertScalarToVector128Double(upper, value, mode); There are some more I haven't exposed, because the |
1. remove some redundent checks on embedded rounding intrinsics
2. Added FMA intrinsics with embedded rounding and unit tests.
fa83b08
to
fd5a2d5
Compare
Diff results for #97415Assembly diffsAssembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,531,978 contexts (984,938 MinOpts, 1,547,040 FullOpts). MISSED contexts: 1 (0.00%) Overall (+8 bytes)
FullOpts (+8 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,804,171 contexts (1,155,877 MinOpts, 1,648,294 FullOpts). MISSED contexts: 3,198 (0.11%) Overall (+8 bytes)
FullOpts (+8 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,599,926 contexts (1,005,474 MinOpts, 1,594,452 FullOpts). Overall (+8 bytes)
FullOpts (+8 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.00% to +0.01%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Throughput diffs for osx/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.00% to +0.01%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Details here |
Diff results for #97415Assembly diffsAssembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,531,978 contexts (984,938 MinOpts, 1,547,040 FullOpts). MISSED contexts: 1 (0.00%) Overall (+8 bytes)
FullOpts (+8 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,804,171 contexts (1,155,877 MinOpts, 1,648,294 FullOpts). MISSED contexts: 3,198 (0.11%) Overall (+8 bytes)
FullOpts (+8 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,599,926 contexts (1,005,474 MinOpts, 1,594,452 FullOpts). Overall (+8 bytes)
FullOpts (+8 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.00% to +0.01%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Throughput diffs for osx/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.00% to +0.01%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Details here |
Failures look unrelated, will apply the format patch, ready to continue the review. |
Diff results for #97415Assembly diffsAssembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,531,978 contexts (984,938 MinOpts, 1,547,040 FullOpts). MISSED contexts: 1 (0.00%) Overall (+8 bytes)
FullOpts (+8 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,821,026 contexts (1,163,479 MinOpts, 1,657,547 FullOpts). Overall (+8 bytes)
FullOpts (+8 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.00% to +0.01%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to +0.01%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.00% to +0.01%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Details here |
Diff results for #97415Assembly diffsAssembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,531,978 contexts (984,938 MinOpts, 1,547,040 FullOpts). MISSED contexts: 1 (0.00%) Overall (+8 bytes)
FullOpts (+8 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,821,026 contexts (1,163,479 MinOpts, 1,657,547 FullOpts). Overall (+8 bytes)
FullOpts (+8 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,599,259 contexts (1,005,474 MinOpts, 1,593,785 FullOpts). MISSED contexts: 667 (0.03%) Overall (+8 bytes)
FullOpts (+8 bytes)
Details here |
// For FMA intrinsics, since it is not possible to get any contained operand in this case: embedded rounding | ||
// is limited in register-to-register form, and the control byte is dynamic, we don't need to do any swap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure I follow this.
I'd have expected we still want to do instruction selection based on the target register to avoid having to insert the additional movaps
. This ensures a last use parameter can still be choosen, regardless of whether its op1
, op2
, or op3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this being handled in a follow up PR, so we can get this merged.
But I do think we need to ensure its handled so we get optimal codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may actually be sufficient to just call genFMAIntrinsic
here instead of genHWIntrinsic_R_R_R_RM
(or rather to split out the core of it so we don't call genProduceReg
too many times).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might misunderstand the original emit path for FMA, let me check if this would be a quick fix, thanks for pointing out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same general thing might apply to the other paths, where it'd be "more ideal" if we could call genAvxFamily
for things like NI_AVX512F_ConvertToVector256Int32
, to ensure we aren't missing handling.
The only reason we "can't" today is because they force the genProduceReg
call, but that could be extracted or conditioned in a way to make that safe (maybe just if (instOptions == NONE) { genProduceReg(..); }
or similar).
Still fine with it being handled in a separate PR, as what's here isn't incorrect, just potentially less optimal long term
src/coreclr/jit/lsraxarch.cpp
Outdated
@@ -2506,6 +2509,11 @@ int LinearScan::BuildHWIntrinsic(GenTreeHWIntrinsic* intrinsicTree, int* pDstCou | |||
srcCount += BuildDelayFreeUses(emitOp2, emitOp1); | |||
srcCount += emitOp3->isContained() ? BuildOperandUses(emitOp3) : BuildDelayFreeUses(emitOp3, emitOp1); | |||
|
|||
if (intrinsicTree->OperIsEmbRoundingEnabled() && !intrinsicTree->Op(4)->IsCnsIntOrI()) | |||
{ | |||
srcCount += BuildDelayFreeUses(intrinsicTree->Op(4), emitOp1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need BuildDelayFreeUses
here. op(4)
is an integer and so will go in a GPR. op1
is a vector and so will go in a SIMD register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have limited understand within LSRA domain, it would be much appreciated if you could point me to the reasonable function to use....
Like BuildOperandUse
might be the correct one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case you should just need BuildOperandUses(intrinsicTree->Op(4))
The general premise is:
BuildOperandUses
-- safe to use when there aren't any restrictionsBuildDelayFreeUses
-- safe to use when you need to restrict the register set, typically used for RMW nodes
In this case, FMA
is an RMW
node. However, the register sets used by op4
(simd) and op1
(general-purpose) don't overlap, so we don't need to concern ourselves with op4
being delay free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs secondary sign-off from someone from @dotnet/jit-contrib
@Ruihan-Yin The top comment here says "Follow-up of #94684, still WIP, no need to review at the moment.". Can you please update this to describe the contents of the PR? |
Done, thanks for the notice. |
Diff results for #97415Assembly diffsAssembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,531,978 contexts (984,938 MinOpts, 1,547,040 FullOpts). MISSED contexts: 1 (0.00%) Overall (+8 bytes)
FullOpts (+8 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,821,026 contexts (1,163,479 MinOpts, 1,657,547 FullOpts). Overall (+8 bytes)
FullOpts (+8 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,599,259 contexts (1,005,474 MinOpts, 1,593,785 FullOpts). MISSED contexts: 667 (0.03%) Overall (+8 bytes)
FullOpts (+8 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.00% to +0.01%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.00% to +0.01%)
MinOpts (-0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Details here |
Thanks for all the reviews and help! |
Follow-up of #94684.
Updates on 02/14/24:
This PR exposes embedded rounding enabled APIs in AVX512:
The followings are from #73604
The followings are suggested by the reviewer: from AVX512 surfaces tracked by other issues.