-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix JIT_LMul optimization #110467
base: main
Are you sure you want to change the base?
Fix JIT_LMul optimization #110467
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
src/coreclr/vm/jithelpers.cpp
Outdated
@@ -120,7 +115,7 @@ HCIMPL2_VV(INT64, JIT_LMul, INT64 val1, INT64 val2) | |||
UINT32 val2High = Hi32Bits(val2); | |||
|
|||
if ((val1High == 0) && (val2High == 0)) | |||
return Mul32x32To64(val1, val2); | |||
return (UINT64)((UINT32)val1 * (UINT32)val2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have the same semantics? 32 bit multiplication will presumably truncate the upper bits of the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's addressing the issue where Clang doesn't distinguish between the original and regular multiplication on 32-bit platforms due to a redundant cast (uint64->uint32->uint64 on each op separately). Now it's uint32 on each op, and uint64 cast (which is not strictly needed, just being explicit) on the overall result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code does not compute the same value as it did before. Casting a 32-bit result to 64-bits will leave the upper 32 bits zero always, but the result of 32x32 bit multiplication can be >= 2^32.
From Godbolt I do not see why the new version would be more efficient than the old version. They both do umull
+ two "multiply-add" operations. If clang did the branch it would be able to do just a single umull
in the "upper bits zero" case, but seems like Clang has determined that the branchy version is not better than the version that does the superfluous multiply-add operations.
On x86 it's more important as "32x32 -> 64" multiplication can be done with a single instruction, while the "64x64 -> 64" version falls back to a helper call, at least for MSVC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code only runs on linux
for arm
and x86
. Do you mean we should update this function to Regular_LMul from godbolt, since there is no difference? This optimization is same as done in handwritten assembly for win-x86
runtime/src/coreclr/vm/i386/jithelp.asm
Line 591 in 8fca0a1
PUBLIC JIT_LMul |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing the special case is ok given that it has no impact on Clang's codegen. Alternatively, we would want to figure out how to get Clang to emit the branchy version with only a single umull
. I am fine with either.
Right now, clang does not see any difference between JIT_LMul and regular multiplication on arm32 or x86: https://godbolt.org/z/vj96KfnY4. This patch fixes the intended optimization when high 32 bits are zero.