Bmi2.MultiplyNoFlags issues #11782

pentp · 2019-01-08T22:36:49Z

I have a working version of decimal.DecCalc which uses MultiplyNoFlags from dotnet/coreclr#21480 in a branch, but I discovered two issues.

~~If MultiplyNoFlags is called without having its result used, then it's assumed to be a no-op, even if the low part is used. While such use would be sub-optimal, it should still be valid~~ (fixed in dotnet/coreclr#21928).

The second problem is that performance is increased only up to 3% for some methods, while others suffer a performance penalty up to 20%! This is primarily caused by forcing the low result to be written to memory and excessive temporary register use, compounded by forced zero-init of the locals (even with no .locals init) which affects all code paths of the function.

static unsafe ulong mulx(ulong a, ulong b)
{
	ulong r;
	return X86.Bmi2.X64.MultiplyNoFlags(a, b, &r) + r;
}

; Assembly listing for method DecCalc:mulx(long,long):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )    long  ->  rcx        
;  V01 arg1         [V01,T01] (  3,  3   )    long  ->  [rsp+0x18]  
;  V02 loc0         [V02    ] (  2,  2   )    long  ->  [rsp+0x00]   do-not-enreg[X] must-init addr-exposed ld-addr-op
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 8

G_M42317_IG01:
       push     rax
       xor      rax, rax
       mov      qword ptr [rsp], rax
       mov      qword ptr [rsp+18H], rdx

G_M42317_IG02:
       lea      rax, bword ptr [rsp]
       mov      rdx, rcx
       mulx     rdx, r8, qword ptr [rsp+18H]
       mov      qword ptr [rax], r8
       mov      rax, rdx
       add      rax, qword ptr [rsp]

G_M42317_IG03:
       add      rsp, 8
       ret      

; Total bytes of code 41, prolog size 7 for method DecCalc:mulx(long,long):long

While ideally this should be just:

       mulx     rax, rcx, rcx
       add      rax, rcx
       ret

category:cq
theme:vector-codegen
skill-level:expert
cost:medium
impact:small

The text was updated successfully, but these errors were encountered:

pentp · 2019-01-08T22:49:05Z

Note that temp register use depends on the order of parameters (with respect to the implied EDX/RDX parameter), so reversing a and b in the sample improves the codegen somewhat (this might be relatively easy to fix). Getting rid of the lea for locals could probably be done at emit time even. And must-init can be probably turned off also.

The hard part is getting MultiplyNoFlags working without forcing a memory store if a local variable is used for the low result.
@fiigii @mikedn, how hard would it be to improve MULX codegen with the constraint that the low part is stored to a local variable?

fiigii · 2019-01-09T10:11:42Z

If MultiplyNoFlags is called without having its result used, then it's assumed to be a no-op

I will fix the first problem tomorrow.

how hard would it be to improve MULX codegen with the constraint that the low part is stored to a local variable?

As I know, RyuJIT only builds SSA on normal local vars and does not have the general memory analysis mechanism (e.g., Memory SSA), so eliminating this kind of redundant memory store/loads looks a big job. But I may be wrong.
@CarolEidt @AndyAyersMS Do you have suggestions?

Originally, I suggested to expose this intrinsic as

ValueTuple<ulong, ulong> MultiplyNoFlags(ulong left, ulong right);

This API could be optimized via struct promotion that won't have memory access on most cases. However, I was noticed that ValueTuple is not good in public APIs...

mikedn · 2019-01-09T12:14:47Z

As I know, RyuJIT only builds SSA on normal local vars and does not have the general memory analysis mechanism (e.g., Memory SSA), so eliminating this kind of redundant memory store/loads looks a big job.

It has less to do with SSA and more to do with the ability to represent this in the IR.

This API could be optimized via struct promotion that won't have memory access on most cases.

Unlikely, you'll end up with dependent struct promotion and thus in memory. Multireg return calls (including long return calls on 32 bit) also suffer from a similar problem.

For example:

       FF153C7B1305 call     [C:GetLong():long]
       8945F0       mov      dword ptr [ebp-10H], eax
       8955F4       mov      dword ptr [ebp-0CH], edx
       8B75F0       mov      esi, dword ptr [ebp-10H]
       8B7DF4       mov      edi, dword ptr [ebp-0CH]
       FF153C7B1305 call     [C:GetLong():long]
       8945E8       mov      dword ptr [ebp-18H], eax
       8955EC       mov      dword ptr [ebp-14H], edx
       8BC6         mov      eax, esi
       0345E8       add      eax, dword ptr [ebp-18H]
       8BD7         mov      edx, edi
       1355EC       adc      edx, dword ptr [ebp-14H]

which should be something like

       call     [C:GetLong():long]
       mov      esi, eax
       mov      edi, edx
       call     [C:GetLong():long]
       add      eax, esi
       adc      edx, edi

Not to say that this cannot be fixed. But it's not that simple.

And yes, returning a pair of values might make more sense than using an "out" parameter.

pentp · 2019-01-09T15:47:08Z

The public API is probably not holding us back - it could just forward to a private ValueTuple version if it helped or this transformation could be achieved in JIT import. Importation could ensure that the result is always stored to a (temp) local variable - this might simplify/improve the codegen somewhat. But fundamentally the problem is modeling a two-value IR node (applies to all three variants).

fiigii · 2019-01-09T20:03:45Z

Unlikely, you'll end up with dependent struct promotion and thus in memory. Multireg return calls (including long return calls on 32 bit) also suffer from a similar problem.

@mikedn Ah, thanks for the info. Are there issues to track this problem?

pentp · 2019-01-09T21:41:14Z

What if the importer transformed local variable use cases from

ldloca/ldarga op3
conv.u
call MultiplyNoFlags(op1, op2, op3)

into

call MultiplyNoFlags(op1, op2)
ldmagic /*load low part*/
stloc/starg op3

and all other cases into

stloc addr_tmp
call MultiplyNoFlags(op1, op2)
ldloc addr_tmp
ldmagic /*load low part*/
stind

where ldmagic would be some new IR node type that is tied to the second return value of mulx? And it could be stitched together using a GT_COMMA maybe?

mikedn · 2019-01-09T23:20:26Z

Ah, thanks for the info. Are there issues to track this problem?

None that I know of.

where ldmagic would be some new IR node type that is tied to the second return value of mulx

There's no such thing as "tied to the second return value" of a node in IR. That would imply that a node has multiple uses (it cannot) and that a use can specify which value is used (it cannot).

Something like this might work in lowering, where the "magic" node would act like second return value. But by the time you reach lowering the damage is already done.

pentp · 2019-01-10T00:44:10Z

Flags from and are a kind of a second return value? Maybe something similar would work? Or some kind of an implied return value that appears out of thin air at first, but is later tied to the preceding mulx in LSRA/emit?
Most of the intermediate phases are not very useful on two-value MultiplyNoFlags anyway (value num, CSE, assertions), the primary requirement would be that this "magic" load is kept right after mulx throughout the process.

mikedn · 2019-01-10T06:09:09Z

Flags from and are a kind of a second return value? Maybe something similar would work?

Yes, that's what I was thinking of when I said that something like this might work in lowering. Though it's a bit different as LSRA does not model the flag register but of course, it does model integer registers. Besides, the flag stuff is a bit on the hackish side, it gets the job done but it's not quite the right thing to do.

Most of the intermediate phases are not very useful on two-value MultiplyNoFlags anyway (value num, CSE, assertions), the primary requirement would be that this "magic" load is kept right after mulx throughout the process.

Yes, but in the case of the current API it's kind of difficult to deal with the address taken variable.

A version that returns a struct, hmm, I think that's really the best option and the only one that doesn't look too hackish. But as I told @fiigii before, there's a problem with struct promotion. I should check up what's up wit that, perhaps we can make it so that certain cases of dependent struct promotion don't force the struct in memory. As I already mentioned, that would help other cases as well.

tannergooding · 2019-01-10T16:16:50Z

A version that returns a struct, hmm, I think that's really the best option and the only one that doesn't look too hackish

I think, if we did this, it would either need to be an internal method or would need to use a strongly named type (like Int128/UInt128). The guidance so far has been don't use ValueTuple in the public surface area (and the only time we tried, which was fairly recently, it ended up getting backed out). NOTE: Exposing a strongly named type (like Int128/UInt128) would probably not get reviewed/approved/implemented in time for the initial release (as there are a lot of other considerations here as well).

It might also be nice to still find a solution for the address taken scenario; since we have some existing methods (like Math.DivRem) which we can't change the public surface area for but which will require similar fixes; and since we would still want good codegen for the address taken scenario...

pentp · 2019-01-10T20:32:14Z

The address taken scenario can always be transformed into a stloc addr; call mulx; ldmagic; stloc tmp; ldloc addr; ldloc tmp; stind so that the problem is again reduced to this call mulx; ldmagic pair. But this variant is actually pretty optimal already (low address points to some non-local memory location).

If I'm not mistaken, then after struct promotion ValueTuple would end up with the same problem of two return values and two (promoted) variables.

The current API shape might actually be a blessing because it allows very easy detection of the case where the out parameter points to a local.

mikedn · 2019-01-10T21:34:40Z

The address taken scenario can always be transformed into a stloc addr; call mulx; ldmagic; stloc tmp; ldloc addr; ldloc tmp; stind so that the problem is again reduced to this call mulx; ldmagic pair

Yes, introducing a temp could help but there are certain problems that will have to be solved for this to actually work. The idea would be to introduce this temp during import such that

ASG high, MULX op1, op2, op3
; becomes
MULX op1, op2, ADDR temp
ASG IND(op3), temp
; if op3 represents the address of a lcl then this reduces to
ASG low, temp

This prevents the original local variable from become address taken. Instead, temp becomes address taken, a good thing because it should prevent the optimizer from doing anything that will then prevent lowering from recognizing this pattern. In particular, no other uses of temp should ever appear.

Then lowering could probably transform this into something like:

STORE_LCL_VAR high, MULX …
STORE_LCL_VAR low, MAGIC

So far so good. The problem is dealing with register allocation for such a thing. Couple of observations for now:

MULX will likely need to be a multi-reg node, one way or another you need to allocate 2 destination registers for it.
MAGIC might need to have MULX as "hidden" operand. "Hidden" in the sense that it has a pointer to the MULX node but it doesn't consider it an operand. That would violate the requirement that an IR node must have a single use. Question is if LSRA would tolerate that.
Care needs to be taken to avoid an extra reg-reg move - low and MULX's corresponding low destination register should be the same whenever possible.

So I don't know, it might be doable. Some additional investigation and perhaps experimentation will be needed.

On top of that, I'm not sure if the JIT team will be happy with that magic node.

mikedn · 2019-01-10T21:39:01Z

The current API shape might actually be a blessing because it allows very easy detection of the case where the out parameter points to a local.

But "easy detection of the case where the out parameter points to a local" isn't that much of a problem. There are already places where the JIT detects such cases and eliminates them, provided that it has something to replace the tree with. For example, IND(ADDR(LCL_VAR)) is usually transformed into LCL_VAR and the local variable is not considered address taken.

But in the case of MULX the problem is that you don't have something reasonable to transform to.

mikedn · 2019-01-10T21:47:48Z

If I'm not mistaken, then after struct promotion ValueTuple would end up with the same problem of two return values and two (promoted) variables.

Sort of. The problem with relying on struct promotion is that you have to do something like:

ASG LCL_VAR pair, MULX ; pair is a struct var having 2 int fields
; later you can use the individual fields
… LCL_VAR pair.low
… LCL_VAR pair.high

It's pretty natural. The problem with this is that if you promote then you shouldn't use the entire struct variable anymore, you should only use its individual fields. And if you do access the entire struct variable then it cannot be enregistered and it ends up in memory, what we were trying to avoid in the first place.

So the problem here is to investigate if it's possible to change this. This probably will end up in the register allocation territory as well. Registers must be allocated to the individual fields as usual but then we need to be able to store to both registers using the same STORE_LCL_VAR node.

Again, it might be doable. But it needs investigation.

mikedn · 2019-01-10T21:58:12Z

And I suspect that either way need MULX to be a multi-reg node. That means that we'll need to create a special oper for it and use GenTreeMultiRegOp instead of the current HW intrinsic oper and class.

CarolEidt · 2019-01-11T01:44:39Z

Unlikely, you'll end up with dependent struct promotion and thus in memory. Multireg return calls (including long return calls on 32 bit) also suffer from a similar problem.

Are there issues to track this problem?

There are actual multiple issues: #5024, #6316, #6318, #9839, #5112 and #8571 (the last two are probably the most relevant to this specific case). While I continue to make a little progress on the struct issues, it's a process of many tiny steps.

If we go the route of introducing a temp during import, I would think a better way to transform it would be, earlier in the JIT (morph? optimizer?) to transform it to an appropriate multi-reg op, as @mikedn hints at above. The "original" form would continue to be a memory store, but the alternate form would write two registers as an internal two-field struct. It might take some work to eliminate the copies that may be induced (especially the memory copies), but I think that's the only reasonable path to take.

On top of that, I'm not sure if the JIT team will be happy with that magic node.

I'd really like to avoid such a thing!

mikedn · 2019-01-11T07:01:25Z

I'd really like to avoid such a thing!

Thought so 😄. Even the CMP/JCC case is a bit ugly in this regard, but at least that one doesn't interfere with register allocation.

There are actual multiple issues: #5024, #6316, #6318, #9839, #5112 and #8571 (the last two are probably the most relevant to this specific case). While I continue to make a little progress on the struct issues, it's a process of many tiny steps.

Thanks for the list of issues, I'll take a look. Anyway I'm looking a struct issues these days due to my attempt at getting rid of GT_ASG.

I would think a better way to transform it would be, earlier in the JIT (morph? optimizer?)

Local address taken stuff, probably this should be done in LocalAddressVisitor. But this is a minor issue.

fiigii · 2019-01-15T19:38:08Z

@pentp The first problem has been solved by dotnet/coreclr#21928, could you please update this issue?

CarolEidt · 2020-06-03T14:52:43Z

#34822 added support for intrinsics that define two registers, and once #36862 is merged we'll be able to store that multi-reg result to an enregistered lclVar.

I have a WIP branch, https://github.com/CarolEidt/runtime/tree/Mulx2, that builds on #36862 and adds a mulx intrinsic that returns a ValueTuple, and can enregister both fields of that result.

pentp · 2020-06-07T22:37:39Z

Should I create an API proposal for the new ValueTuple returning MULX to expedite the process?
Maybe we could name the new method just Bmi2.Multiply to solve the naming conflict (instead of MultiplyNoFlags2)? There isn't any functional difference if flags are affected or not (they are inaccessible anyway).

CarolEidt · 2020-06-08T15:01:16Z

Should I create an API proposal for the new ValueTuple returning MULX to expedite the process?
Maybe we could name the new method just Bmi2.Multiply to solve the naming conflict (instead of MultiplyNoFlags2)? There isn't any functional difference if flags are affected or not (they are inaccessible anyway).

I was planning to leave that decision to the folks with API design expertise. The other alternative would be not to expose it, and leave it as an internal method that the other method calls. One could theoretically make the same kind of transformation in the JIT, but it doesn't currently have the capability to create something like a ValueTuple on its own.

@tannergooding - perhaps this would be a good general design discussion to have, as I believe there are other intrinsics that define multiple registers.

tannergooding · 2020-06-08T15:38:24Z

This was brought up minimally for an unrelated API last week and the general consensus was that, especially with the HWIntrinsics, returning value tuples would be fine.
API proposals would be needed and they would need to go through review still

If one was put up, it would be nice to cover the "core" APIs, that is not just Bmi2.MultiplyNoFlags, but also Math.DivRem and Math.BigMul for example.

The conflict given we have a variant of MultiplyNoFlags that just returns the high bits is a bit unfortunate and would likely require some more thought.
Math.BigMul is in a semi-related boat but doesn't quite have a conflict: it has long BigMul(int, int) and ulong BigMul(ulong, ulong, out ulong) and long BigMul(long, long, out long)

pentp · 2020-06-08T19:27:09Z

Yes, naming might best be left for API designers. I'm just thinking of options.
But there is definitely value in exposing the API because currently it forces users into unsafe code because of the ulong* parameter.

Fix dotnet#11782

benaadams · 2021-01-20T04:11:51Z

Has an api review been opened for the ValueTuple return for BigMul?

Or will this be resolved by MultiplyNoFlags returning a ValueTuple and the BigMul api can remain the same with the Jit resolving it at inline time since its address taken in C# rather than address taken for intrinsic?

/cc @lemire

benaadams · 2021-01-20T04:35:51Z

Ah looks like it will be resolved by #37928 and then changing the internal C# implementation without need for another api? #42156 (comment)

tevador · 2024-06-17T21:57:34Z

It's 2024 and dotnet 8.0 still suffers from this issue. MultiplyNoFlags has terrible codegen for the low half of the result. In fact, it's much faster to only use the high half returned by MultiplyNoFlags and recalculate the low half with a separate multiplication.

Benchmark sample:

public class MulxTests
{
    [Params(0x0123456789abcdef)]
    public ulong TestA { get; set; }

    [Params(0xdeadbeefdeadbeef)]
    public ulong TestB { get; set; }

    [Benchmark]
    public ulong BenchMulx1()
    {
        ulong accLo = TestA;
        ulong accHi = TestB;
        BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc1(accLo, accHi, ref accHi, ref accLo);
        return accLo + accHi;
    }

    [Benchmark]
    public ulong BenchMulx2()
    {
        ulong accLo = TestA;
        ulong accHi = TestB;
        BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
        BigMulAcc2(accLo, accHi, ref accHi, ref accLo);
        return accLo + accHi;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static unsafe void BigMulAcc1(ulong a, ulong b, ref ulong accHi, ref ulong accLo)
    {
        ulong lo;
        ulong hi = Bmi2.X64.MultiplyNoFlags(a, b, &lo);
        accHi += hi;
        accLo += lo;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static void BigMulAcc2(ulong a, ulong b, ref ulong accHi, ref ulong accLo)
    {
        ulong hi = Bmi2.X64.MultiplyNoFlags(a, b);
        ulong lo = a * b;
        accHi += hi;
        accLo += lo;
    }
}

This benchmark looks artificial, but it's in fact fairly close to actual code used when implementing high-speed elliptic-curve based cryptography.

Results on my machine:

| Method     | TestA             | TestB                | Mean     | Error     | StdDev    | Code Size |
|----------- |------------------ |--------------------- |---------:|----------:|----------:|----------:|
| BenchMulx1 | 81985529216486895 | 16045690984833335023 | 2.521 ns | 0.0075 ns | 0.0070 ns |     129 B |
| BenchMulx2 | 81985529216486895 | 16045690984833335023 | 1.349 ns | 0.0026 ns | 0.0024 ns |      84 B |

So throwing away the low half of the result of mulx and recalculating it with a second multiplication is currently almost twice faster.

The reason why BenchMulx1 is much slower is that the compiler always forces the low half to be stored in memory. If there are multiple calls to MultiplyNoFlags within a method, each will use a different stack location to store its low result, which causes excessive stack usage. For example, with 25 inlined calls to MultiplyNoFlags, the compiler will allocate 200 bytes just for the temporaries!

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

pentp mentioned this issue Mar 13, 2020

Improve Bmi2.MultiplyNoFlags codegen #1224

Closed

CarolEidt mentioned this issue Mar 26, 2020

JIT: Support assignments of multi-reg values to enregistered promoted structs #34105

Closed

EgorBo mentioned this issue May 7, 2020

BigMul with 64-bit operands #35975

Merged

CarolEidt mentioned this issue Jun 15, 2020

Support mulx returning ValueTuple #37928

Closed

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

CarolEidt added a commit to CarolEidt/runtime that referenced this issue Nov 17, 2020

Support mulx returning ValueTuple

66d337d

Fix dotnet#11782

echesakov pushed a commit to echesakov/runtime that referenced this issue Nov 19, 2020

Support mulx returning ValueTuple

cbd01ca

Fix dotnet#11782

pentp mentioned this issue Nov 9, 2021

Support MULX returning ValueTuple #61336

Closed

ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 9, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 10, 2021

echesakov mentioned this issue Feb 14, 2022

Enable multi-register intrinsics support for Arm64 #64921

Closed

13 tasks

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022

pentp mentioned this issue Nov 6, 2023

Use BigMul for 32x32=64 in decimal #93345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bmi2.MultiplyNoFlags issues #11782

Bmi2.MultiplyNoFlags issues #11782

pentp commented Jan 8, 2019 •

edited by BruceForstall

Loading

pentp commented Jan 8, 2019

fiigii commented Jan 9, 2019

mikedn commented Jan 9, 2019

pentp commented Jan 9, 2019 •

edited

Loading

fiigii commented Jan 9, 2019

pentp commented Jan 9, 2019 •

edited

Loading

mikedn commented Jan 9, 2019

pentp commented Jan 10, 2019 •

edited

Loading

mikedn commented Jan 10, 2019

tannergooding commented Jan 10, 2019

pentp commented Jan 10, 2019 •

edited

Loading

mikedn commented Jan 10, 2019

mikedn commented Jan 10, 2019

mikedn commented Jan 10, 2019

mikedn commented Jan 10, 2019

CarolEidt commented Jan 11, 2019

mikedn commented Jan 11, 2019

fiigii commented Jan 15, 2019

CarolEidt commented Jun 3, 2020

pentp commented Jun 7, 2020 •

edited

Loading

CarolEidt commented Jun 8, 2020

tannergooding commented Jun 8, 2020

pentp commented Jun 8, 2020

benaadams commented Jan 20, 2021

benaadams commented Jan 20, 2021

tevador commented Jun 17, 2024

Bmi2.MultiplyNoFlags issues #11782

Bmi2.MultiplyNoFlags issues #11782

Comments

pentp commented Jan 8, 2019 • edited by BruceForstall Loading

pentp commented Jan 8, 2019

fiigii commented Jan 9, 2019

mikedn commented Jan 9, 2019

pentp commented Jan 9, 2019 • edited Loading

fiigii commented Jan 9, 2019

pentp commented Jan 9, 2019 • edited Loading

mikedn commented Jan 9, 2019

pentp commented Jan 10, 2019 • edited Loading

mikedn commented Jan 10, 2019

tannergooding commented Jan 10, 2019

pentp commented Jan 10, 2019 • edited Loading

mikedn commented Jan 10, 2019

mikedn commented Jan 10, 2019

mikedn commented Jan 10, 2019

mikedn commented Jan 10, 2019

CarolEidt commented Jan 11, 2019

mikedn commented Jan 11, 2019

fiigii commented Jan 15, 2019

CarolEidt commented Jun 3, 2020

pentp commented Jun 7, 2020 • edited Loading

CarolEidt commented Jun 8, 2020

tannergooding commented Jun 8, 2020

pentp commented Jun 8, 2020

benaadams commented Jan 20, 2021

benaadams commented Jan 20, 2021

tevador commented Jun 17, 2024

pentp commented Jan 8, 2019 •

edited by BruceForstall

Loading

pentp commented Jan 9, 2019 •

edited

Loading

pentp commented Jan 9, 2019 •

edited

Loading

pentp commented Jan 10, 2019 •

edited

Loading

pentp commented Jan 10, 2019 •

edited

Loading

pentp commented Jun 7, 2020 •

edited

Loading