Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Optimize Math.Pow(x, c) in JIT #26552

Closed
wants to merge 10 commits into from
Closed

Optimize Math.Pow(x, c) in JIT #26552

wants to merge 10 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Sep 5, 2019

Fixes https://github.com/dotnet/coreclr/issues/26434

Converts:
Math.Pow(x, 2) to x*x
Math.Pow(x, 1) to x
Math.Pow(x, -1) to 1/x
(same for MathF and floats)

Can be added:
Math.Pow(c1, c2) to c3 (call PAL_pow?)
Math.Pow(1, x) to 1
Math.Pow(2, x) to exp2(x)
Math.Pow(x, 0) to 1
Math.Pow(x, -0) to 1
Math.Pow(x, 0.5) to sqrt(x)
Math.Pow(x, -2) to 1/(x*x) (probably is not safe)

Test

static double Pow2(double x) => Math.Pow(x, 2);

static double Pow1(double x) => Math.Pow(x, 1);

static double PowN1(double x) => Math.Pow(x, -1);

Before (tier1):

; Pow2(double):double
G_M37400_IG01:
       sub      rsp, 40
       vzeroupper 
G_M37400_IG02:
       vmovsd   xmm1, qword ptr [reloc @RWD00]
       call     System.Math:Pow(double,double):double
       nop      
G_M37400_IG03:
       add      rsp, 40
       ret      
RWD00  dq	4000000000000000h
; Total bytes of code: 26


; Pow1(double):double
G_M37403_IG01:
       sub      rsp, 40
       vzeroupper 
G_M37403_IG02:
       vmovsd   xmm1, qword ptr [reloc @RWD00]
       call     System.Math:Pow(double,double):double
       nop      
G_M37403_IG03:
       add      rsp, 40
       ret      
RWD00  dq	3FF0000000000000h
; Total bytes of code: 26


; PowN1(double):double
G_M24053_IG01:
       sub      rsp, 40
       vzeroupper 
G_M24053_IG02:
       vmovsd   xmm1, qword ptr [reloc @RWD00]
       call     System.Math:Pow(double,double):double
       nop      
G_M24053_IG03:
       add      rsp, 40
       ret      
RWD00  dq	BFF0000000000000h
; Total bytes of code: 26

After (tier1):

; Pow2(double):double
G_M37400_IG01:
       vzeroupper 
G_M37400_IG02:
       vmulsd   xmm0, xmm0
G_M37400_IG03:
       ret      
; Total bytes of code: 8


; Pow1(double):double
G_M37403_IG01:
       vzeroupper 
G_M37403_IG02:
       ret      
; Total bytes of code: 4


; PowN1(double):double
G_M24053_IG01:
       vzeroupper 
G_M24053_IG02:
       vmovsd   xmm1, qword ptr [reloc @RWD00]
       vdivsd   xmm1, xmm0
       vmovaps  xmm0, xmm1
G_M24053_IG03:
       ret      
RWD00  dq	3FF0000000000000h
; Total bytes of code: 20

Will run the jit-diff tools but I suspect it won't find anything in the BCL (UPD there are actually two cases). But I did see such usages in different applications. E.g. Xenko (a game engine): https://github.com/xenko3d/xenko/search?q=Math.Pow&unscoped_q=Math.Pow

Also, once some sort of ffast-math mode is implemented - we can unroll other constants (gcc unrolls up to 100, clang - 32)

{
// Math.Pow(x, 2) -> x*x
newNode = gtNewOperNode(GT_MUL, powerCon->TypeGet(), arg0,
gtNewLclvNode(arg0->gtLclVar.gtLclNum, arg0->TypeGet()));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you are assuming that arg0 is GT_LCL_VAR. Why would that always be true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikedn So it can be GT_LCL_VAR, GT_CALL, GT_FIELD (+ static field access with initialization). I've temporarily limited it to LCL_VAR, I guess for other cases I need to introduce a tmp var?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it can be GT_LCL_VAR, GT_CALL, GT_FIELD

I don't know why it would be limited to those opers, it can probably be pretty much anything, like a + b * (float)c.

Anyway, yes, if it's not GT_LCL_VAR (or GT_LCL_FLD perhaps) you need to introduce a temporary variable because the original expression now has multiple uses in x * x. Introducing a temp is a bit unfortunate because it breaks trees and that may cause other problems. In the past I toyed with the idea of adding a SQR oper to avoid multiple uses though I'm not sure if it's something worth doing.

@am11
Copy link
Member

am11 commented Sep 6, 2019

(if it doesn't already) could it produce three operands for AVX / AVX512F like gcc?

PowN1:

vmovsd   xmm1, qword ptr [reloc @RWD00]
- vdivsd   xmm1, xmm0
- vmovaps  xmm0, xmm1
+ vdivss xmm0, xmm1, xmm0

@EgorBo
Copy link
Member Author

EgorBo commented Sep 6, 2019

@am11 I guess it's a task for the low level optimizations/register allocator

@tannergooding
Copy link
Member

(if it doesn't already) could it produce three operands for AVX / AVX512F like gcc?

There is at least one bug here (and I think we have a tracking issue somewhere)...

The disassembly output is definitely wrong. If we are actually outputting the vex-encoding as indicated (vdivsd vs divsd); then we aren't printing all three parameters and that should be fixed. Otherwise, we aren't outputting the vex-encoding and we should print divsd for this case. -- If it's the latter, we have another bug where we should be outputting the vex-encoding here.

I believe we are just hitting the former (we are actually using the VEX encoding but not printing all three registers), as (iirc) the emitter has logic to always adjust things appropriately. However, this may mean there is another issue where we are going through a path in codegen that isn't VEX aware and therefore the register allocation done might not be ideal. @CarolEidt might have more context here.

@mikedn
Copy link

mikedn commented Sep 6, 2019

I actually have an old branch that generates the 3 operand VEX form for floating point ops. The disassembler does display the 3 operand form when IF_RWR_RRD_RRD is used.

@tannergooding
Copy link
Member

I logged a bug here: https://github.com/dotnet/coreclr/issues/26569, to track genCodeForBinary (and other functions) being non-VEX aware.

@KvanTTT
Copy link

KvanTTT commented Sep 7, 2019

Math.Pow(c1, c2) to c3 (call PAL_pow?)

I think it should be added not only for Pow function but for all pure functions from Math class (Sin, Cos, ATan, Exp, Log, etc. It's a pretty common case.

@am11
Copy link
Member

am11 commented Sep 7, 2019

gcc unrolls up to 100, clang - 32

with C's double pow (double base, double exponent), gcc unrolls over exponent of 263 with fast-math: https://godbolt.org/z/9KWH83 (and clang's limit is merely 32). However, max exponent for .NET is 639315432260133184 (also in C) and min is -671156337577776575 (C has slightly higher min: -671156337577776448) for base ±1.000000000000001, and the rest (which probably can be intrinsified) is infinity or 0.0.

@erozenfeld
Copy link
Member

@dotnet/jit-contrib

@maryamariyan
Copy link
Member

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your coreclr repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>

@BruceForstall BruceForstall added the post-consolidation PRs which will be hand ported to dotnet/runtime label Nov 7, 2019
@BruceForstall BruceForstall requested a review from a team November 7, 2019 21:17
}
else if (arg0->OperIs(GT_IND) && arg0->AsIndir()->Addr()->gtGetOp1()->OperIs(GT_LCL_VAR))
{
// Math.Pow(x, 2) -> x*x where x is a field
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the code is written the only field that it will recognize is the first field of a struct (the field at offset 0). I doubt that was the intention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikedn heh, definitely not the intention, I wish I actually could try the "introduce a tmp local" scenario instead.

@CarolEidt
Copy link

this may mean there is another issue where we are going through a path in codegen that isn't VEX aware and therefore the register allocation done might not be ideal. @CarolEidt might have more context here

If that is an issue, then the places to track it down and fix it are in Lowering where the register requirements are defined, and then in either CodeGen or emitter depending on what path it's going down.

I'm wondering if it might be better to transform this in the importer, before the call has actually be created. At that time:

  • I believe it's easier to create a new temp when needed, e.g. for the x*x case,
  • It could be done in impMathIntrinsic() when we're already down the math intrinsic path, rather than adding a check during morph,
  • and we don't have to create the call and then delete it later

Other thoughts @dotnet/jit-contrib ?

@tannergooding
Copy link
Member

tannergooding commented Nov 8, 2019

I had done some more investigation here and basically, outside of HWIntrinsics, most other code paths (including SIMD intrinsics) go through emitInsBinary or explicit calls to emitIns_* overloads that aren't VEX aware.

The emitter ends up fixing these later by checking IsDstDstSrcAVXInstruction and IsDstSrcSrcAVXInstruction.

I think, ideally, we would switch all the floating-point code over to be VEX aware (and therefore mark instructions as non-RMW when VEX is available). We would then assert in the emitter that we get dst, op1, and op2 separately when VEX is supported.

Edit: However, for HWIntrinsics, we always go through VEX aware code-paths and even VEX aware emitter calls (i.e. emitIns_SIMD_R_R_R). These VEX aware paths know how to handle dst and op1 differeing and insert the appropriate mov instructions where required. In lowering, we handle the RMW vs non-RMW paths by tracking flags in the hwintrinsiclistxarch.h table.

@maryamariyan
Copy link
Member

Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen post-consolidation PRs which will be hand ported to dotnet/runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Math.Pow(X, C) in JIT
10 participants