Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix JIT_LMul optimization #110467

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Fix JIT_LMul optimization #110467

wants to merge 2 commits into from

Conversation

am11
Copy link
Member

@am11 am11 commented Dec 6, 2024

Right now, clang does not see any difference between JIT_LMul and regular multiplication on arm32 or x86: https://godbolt.org/z/vj96KfnY4. This patch fixes the intended optimization when high 32 bits are zero.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 6, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 6, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@@ -120,7 +115,7 @@ HCIMPL2_VV(INT64, JIT_LMul, INT64 val1, INT64 val2)
UINT32 val2High = Hi32Bits(val2);

if ((val1High == 0) && (val2High == 0))
return Mul32x32To64(val1, val2);
return (UINT64)((UINT32)val1 * (UINT32)val2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have the same semantics? 32 bit multiplication will presumably truncate the upper bits of the result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's addressing the issue where Clang doesn't distinguish between the original and regular multiplication on 32-bit platforms due to a redundant cast (uint64->uint32->uint64 on each op separately). Now it's uint32 on each op, and uint64 cast (which is not strictly needed, just being explicit) on the overall result.

Copy link
Member

@jakobbotsch jakobbotsch Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code does not compute the same value as it did before. Casting a 32-bit result to 64-bits will leave the upper 32 bits zero always, but the result of 32x32 bit multiplication can be >= 2^32.

From Godbolt I do not see why the new version would be more efficient than the old version. They both do umull + two "multiply-add" operations. If clang did the branch it would be able to do just a single umull in the "upper bits zero" case, but seems like Clang has determined that the branchy version is not better than the version that does the superfluous multiply-add operations.

On x86 it's more important as "32x32 -> 64" multiplication can be done with a single instruction, while the "64x64 -> 64" version falls back to a helper call, at least for MSVC.

Copy link
Member Author

@am11 am11 Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code only runs on linux for arm and x86. Do you mean we should update this function to Regular_LMul from godbolt, since there is no difference? This optimization is same as done in handwritten assembly for win-x86

PUBLIC JIT_LMul

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing the special case is ok given that it has no impact on Clang's codegen. Alternatively, we would want to figure out how to get Clang to emit the branchy version with only a single umull. I am fine with either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants