Fix JIT_LMul optimization #110467

am11 · 2024-12-06T09:02:55Z

Right now, clang does not see any difference between JIT_LMul and regular multiplication on arm32 or x86: https://godbolt.org/z/vj96KfnY4. This patch fixes the intended optimization when high 32 bits are zero.

dotnet-policy-service · 2024-12-06T09:03:37Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

jakobbotsch · 2024-12-06T09:18:32Z

src/coreclr/vm/jithelpers.cpp

@@ -120,7 +115,7 @@ HCIMPL2_VV(INT64, JIT_LMul, INT64 val1, INT64 val2)
    UINT32 val2High = Hi32Bits(val2);

    if ((val1High == 0) && (val2High == 0))
-        return Mul32x32To64(val1, val2);
+        return (UINT64)((UINT32)val1 * (UINT32)val2);


Does this have the same semantics? 32 bit multiplication will presumably truncate the upper bits of the result.

It's addressing the issue where Clang doesn't distinguish between the original and regular multiplication on 32-bit platforms due to a redundant cast (uint64->uint32->uint64 on each op separately). Now it's uint32 on each op, and uint64 cast (which is not strictly needed, just being explicit) on the overall result.

The code does not compute the same value as it did before. Casting a 32-bit result to 64-bits will leave the upper 32 bits zero always, but the result of 32x32 bit multiplication can be >= 2^32.

From Godbolt I do not see why the new version would be more efficient than the old version. They both do umull + two "multiply-add" operations. If clang did the branch it would be able to do just a single umull in the "upper bits zero" case, but seems like Clang has determined that the branchy version is not better than the version that does the superfluous multiply-add operations.

On x86 it's more important as "32x32 -> 64" multiplication can be done with a single instruction, while the "64x64 -> 64" version falls back to a helper call, at least for MSVC.

This code only runs on linux for arm and x86. Do you mean we should update this function to Regular_LMul from godbolt, since there is no difference? This optimization is same as done in handwritten assembly for win-x86

runtime/src/coreclr/vm/i386/jithelp.asm

Line 591 in 8fca0a1

PUBLIC JIT_LMul

I think removing the special case is ok given that it has no impact on Clang's codegen. Alternatively, we would want to figure out how to get Clang to emit the branchy version with only a single umull. I am fine with either.

Fix JIT_LMul optimization

453ce17

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 6, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 6, 2024

jakobbotsch reviewed Dec 6, 2024

View reviewed changes

Simplify to regular multiplication

345bf3e

jakobbotsch approved these changes Dec 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix JIT_LMul optimization #110467

Fix JIT_LMul optimization #110467

am11 commented Dec 6, 2024 •

edited

Loading

dotnet-policy-service bot commented Dec 6, 2024

jakobbotsch Dec 6, 2024

am11 Dec 6, 2024

jakobbotsch Dec 6, 2024 •

edited

Loading

am11 Dec 6, 2024 •

edited

Loading

jakobbotsch Dec 6, 2024

Fix JIT_LMul optimization #110467

Are you sure you want to change the base?

Fix JIT_LMul optimization #110467

Conversation

am11 commented Dec 6, 2024 • edited Loading

dotnet-policy-service bot commented Dec 6, 2024

jakobbotsch Dec 6, 2024

Choose a reason for hiding this comment

am11 Dec 6, 2024

Choose a reason for hiding this comment

jakobbotsch Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

am11 Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

jakobbotsch Dec 6, 2024

Choose a reason for hiding this comment

am11 commented Dec 6, 2024 •

edited

Loading

jakobbotsch Dec 6, 2024 •

edited

Loading

am11 Dec 6, 2024 •

edited

Loading