Optimize Math.Pow(x, c) in JIT #26552

EgorBo · 2019-09-05T23:20:00Z

Fixes https://github.com/dotnet/coreclr/issues/26434

Converts:
Math.Pow(x, 2) to x*x
Math.Pow(x, 1) to x
Math.Pow(x, -1) to 1/x
(same for MathF and floats)

Can be added:
Math.Pow(c1, c2) to c3 (call PAL_pow?)
Math.Pow(1, x) to 1
Math.Pow(2, x) to exp2(x)
Math.Pow(x, 0) to 1
Math.Pow(x, -0) to 1
Math.Pow(x, 0.5) to sqrt(x)
Math.Pow(x, -2) to 1/(x*x) (probably is not safe)

Test

static double Pow2(double x) => Math.Pow(x, 2);

static double Pow1(double x) => Math.Pow(x, 1);

static double PowN1(double x) => Math.Pow(x, -1);

Before (tier1):

; Pow2(double):double
G_M37400_IG01:
       sub      rsp, 40
       vzeroupper 
G_M37400_IG02:
       vmovsd   xmm1, qword ptr [reloc @RWD00]
       call     System.Math:Pow(double,double):double
       nop      
G_M37400_IG03:
       add      rsp, 40
       ret      
RWD00  dq	4000000000000000h
; Total bytes of code: 26


; Pow1(double):double
G_M37403_IG01:
       sub      rsp, 40
       vzeroupper 
G_M37403_IG02:
       vmovsd   xmm1, qword ptr [reloc @RWD00]
       call     System.Math:Pow(double,double):double
       nop      
G_M37403_IG03:
       add      rsp, 40
       ret      
RWD00  dq	3FF0000000000000h
; Total bytes of code: 26


; PowN1(double):double
G_M24053_IG01:
       sub      rsp, 40
       vzeroupper 
G_M24053_IG02:
       vmovsd   xmm1, qword ptr [reloc @RWD00]
       call     System.Math:Pow(double,double):double
       nop      
G_M24053_IG03:
       add      rsp, 40
       ret      
RWD00  dq	BFF0000000000000h
; Total bytes of code: 26

After (tier1):

; Pow2(double):double
G_M37400_IG01:
       vzeroupper 
G_M37400_IG02:
       vmulsd   xmm0, xmm0
G_M37400_IG03:
       ret      
; Total bytes of code: 8


; Pow1(double):double
G_M37403_IG01:
       vzeroupper 
G_M37403_IG02:
       ret      
; Total bytes of code: 4


; PowN1(double):double
G_M24053_IG01:
       vzeroupper 
G_M24053_IG02:
       vmovsd   xmm1, qword ptr [reloc @RWD00]
       vdivsd   xmm1, xmm0
       vmovaps  xmm0, xmm1
G_M24053_IG03:
       ret      
RWD00  dq	3FF0000000000000h
; Total bytes of code: 20

Will run the jit-diff tools but I suspect it won't find anything in the BCL (UPD there are actually two cases). But I did see such usages in different applications. E.g. Xenko (a game engine): https://github.com/xenko3d/xenko/search?q=Math.Pow&unscoped_q=Math.Pow

Also, once some sort of ffast-math mode is implemented - we can unroll other constants (gcc unrolls up to 100, clang - 32)

mikedn · 2019-09-06T05:02:06Z

src/jit/morph.cpp

+            {
+                // Math.Pow(x, 2) -> x*x
+                newNode = gtNewOperNode(GT_MUL, powerCon->TypeGet(), arg0,
+                                        gtNewLclvNode(arg0->gtLclVar.gtLclNum, arg0->TypeGet()));


It seems that you are assuming that arg0 is GT_LCL_VAR. Why would that always be true?

@mikedn So it can be GT_LCL_VAR, GT_CALL, GT_FIELD (+ static field access with initialization). I've temporarily limited it to LCL_VAR, I guess for other cases I need to introduce a tmp var?

So it can be GT_LCL_VAR, GT_CALL, GT_FIELD

I don't know why it would be limited to those opers, it can probably be pretty much anything, like a + b * (float)c.

Anyway, yes, if it's not GT_LCL_VAR (or GT_LCL_FLD perhaps) you need to introduce a temporary variable because the original expression now has multiple uses in x * x. Introducing a temp is a bit unfortunate because it breaks trees and that may cause other problems. In the past I toyed with the idea of adding a SQR oper to avoid multiple uses though I'm not sure if it's something worth doing.

am11 · 2019-09-06T08:39:06Z

(if it doesn't already) could it produce three operands for AVX / AVX512F like gcc?

PowN1:

vmovsd   xmm1, qword ptr [reloc @RWD00]
- vdivsd   xmm1, xmm0
- vmovaps  xmm0, xmm1
+ vdivss xmm0, xmm1, xmm0

EgorBo · 2019-09-06T11:35:12Z

@am11 I guess it's a task for the low level optimizations/register allocator

tannergooding · 2019-09-06T14:59:38Z

(if it doesn't already) could it produce three operands for AVX / AVX512F like gcc?

There is at least one bug here (and I think we have a tracking issue somewhere)...

The disassembly output is definitely wrong. If we are actually outputting the vex-encoding as indicated (vdivsd vs divsd); then we aren't printing all three parameters and that should be fixed. Otherwise, we aren't outputting the vex-encoding and we should print divsd for this case. -- If it's the latter, we have another bug where we should be outputting the vex-encoding here.

I believe we are just hitting the former (we are actually using the VEX encoding but not printing all three registers), as (iirc) the emitter has logic to always adjust things appropriately. However, this may mean there is another issue where we are going through a path in codegen that isn't VEX aware and therefore the register allocation done might not be ideal. @CarolEidt might have more context here.

mikedn · 2019-09-06T16:18:10Z

I actually have an old branch that generates the 3 operand VEX form for floating point ops. The disassembler does display the 3 operand form when IF_RWR_RRD_RRD is used.

tannergooding · 2019-09-06T17:16:16Z

I logged a bug here: https://github.com/dotnet/coreclr/issues/26569, to track genCodeForBinary (and other functions) being non-VEX aware.

KvanTTT · 2019-09-07T00:40:24Z

Math.Pow(c1, c2) to c3 (call PAL_pow?)

I think it should be added not only for Pow function but for all pure functions from Math class (Sin, Cos, ATan, Exp, Log, etc. It's a pretty common case.

am11 · 2019-09-07T14:53:32Z

gcc unrolls up to 100, clang - 32

with C's double pow (double base, double exponent), gcc unrolls over exponent of 2⁶³ with fast-math: https://godbolt.org/z/9KWH83 (and clang's limit is merely 32). However, max exponent for .NET is 639315432260133184 (also in C) and min is -671156337577776575 (C has slightly higher min: -671156337577776448) for base ±1.000000000000001, and the rest (which probably can be intrinsified) is infinity or 0.0.

erozenfeld · 2019-09-25T23:45:26Z

@dotnet/jit-contrib

maryamariyan · 2019-11-06T21:04:10Z

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

In your coreclr repository clone, create patch by running git format-patch origin
In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>

mikedn · 2019-11-07T21:31:05Z

src/jit/morph.cpp

+                }
+                else if (arg0->OperIs(GT_IND) && arg0->AsIndir()->Addr()->gtGetOp1()->OperIs(GT_LCL_VAR))
+                {
+                    // Math.Pow(x, 2) -> x*x where x is a field


The way the code is written the only field that it will recognize is the first field of a struct (the field at offset 0). I doubt that was the intention.

@mikedn heh, definitely not the intention, I wish I actually could try the "introduce a tmp local" scenario instead.

CarolEidt · 2019-11-08T22:54:04Z

this may mean there is another issue where we are going through a path in codegen that isn't VEX aware and therefore the register allocation done might not be ideal. @CarolEidt might have more context here

If that is an issue, then the places to track it down and fix it are in Lowering where the register requirements are defined, and then in either CodeGen or emitter depending on what path it's going down.

I'm wondering if it might be better to transform this in the importer, before the call has actually be created. At that time:

I believe it's easier to create a new temp when needed, e.g. for the x*x case,
It could be done in impMathIntrinsic() when we're already down the math intrinsic path, rather than adding a check during morph,
and we don't have to create the call and then delete it later

Other thoughts @dotnet/jit-contrib ?

tannergooding · 2019-11-08T23:00:54Z

I had done some more investigation here and basically, outside of HWIntrinsics, most other code paths (including SIMD intrinsics) go through emitInsBinary or explicit calls to emitIns_* overloads that aren't VEX aware.

The emitter ends up fixing these later by checking IsDstDstSrcAVXInstruction and IsDstSrcSrcAVXInstruction.

I think, ideally, we would switch all the floating-point code over to be VEX aware (and therefore mark instructions as non-RMW when VEX is available). We would then assert in the emitter that we get dst, op1, and op2 separately when VEX is supported.

Edit: However, for HWIntrinsics, we always go through VEX aware code-paths and even VEX aware emitter calls (i.e. emitIns_SIMD_R_R_R). These VEX aware paths know how to handle dst and op1 differeing and insert the appropriate mov instructions where required. In lowering, we handle the RMW vs non-RMW paths by tracking flags in the hwintrinsiclistxarch.h table.

maryamariyan · 2019-12-02T19:37:37Z

Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime.

EgorBo added 6 commits September 5, 2019 20:04

Morph CALL Math.Pow(x, w) to X * X

d6eb0a3

detect Math.Pow via CORINFO_INTRINSIC_Pow

4055a1f

Oops, the second argument is not always a DblCon

f7bd9b4

Replace gtArgEntryByArgNum(call, 1)->node with just arg1

293550d

Fix case when both arguments are constants

6f59c34

fix formatting issue

d43393b

mikedn reviewed Sep 6, 2019

View reviewed changes

Limit usage to LCL_VAR

c99e6de

EgorBo added 3 commits September 6, 2019 21:09

Handle fields as arg0

c447729

make sure GT_IND contains LCL_VAR

b2c2598

remove extraneous parentheses

832dda9

sandreenko added the area-CodeGen label Sep 10, 2019

BruceForstall added the post-consolidation PRs which will be hand ported to dotnet/runtime label Nov 7, 2019

BruceForstall requested a review from a team November 7, 2019 21:17

mikedn reviewed Nov 7, 2019

View reviewed changes

maryamariyan closed this Dec 2, 2019

StendProg mentioned this pull request Jan 8, 2020

Adds Support for Linear Regression ScottPlot/ScottPlot#198

Merged

EgorBo mentioned this pull request Feb 8, 2020

Optimize Math.Pow(x, c) where c is 2, 1, -1 or 0 dotnet/runtime#31978

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Math.Pow(x, c) in JIT #26552

Optimize Math.Pow(x, c) in JIT #26552

EgorBo commented Sep 5, 2019 •

edited

Loading

mikedn Sep 6, 2019

EgorBo Sep 6, 2019

mikedn Sep 6, 2019

am11 commented Sep 6, 2019 •

edited

Loading

EgorBo commented Sep 6, 2019

tannergooding commented Sep 6, 2019

mikedn commented Sep 6, 2019

tannergooding commented Sep 6, 2019

KvanTTT commented Sep 7, 2019

am11 commented Sep 7, 2019

erozenfeld commented Sep 25, 2019

maryamariyan commented Nov 6, 2019

mikedn Nov 7, 2019

EgorBo Nov 8, 2019

CarolEidt commented Nov 8, 2019

tannergooding commented Nov 8, 2019 •

edited

Loading

maryamariyan commented Dec 2, 2019

Optimize Math.Pow(x, c) in JIT #26552

Optimize Math.Pow(x, c) in JIT #26552

Conversation

EgorBo commented Sep 5, 2019 • edited Loading

Test

Before (tier1):

After (tier1):

mikedn Sep 6, 2019

Choose a reason for hiding this comment

EgorBo Sep 6, 2019

Choose a reason for hiding this comment

mikedn Sep 6, 2019

Choose a reason for hiding this comment

am11 commented Sep 6, 2019 • edited Loading

EgorBo commented Sep 6, 2019

tannergooding commented Sep 6, 2019

mikedn commented Sep 6, 2019

tannergooding commented Sep 6, 2019

KvanTTT commented Sep 7, 2019

am11 commented Sep 7, 2019

erozenfeld commented Sep 25, 2019

maryamariyan commented Nov 6, 2019

mikedn Nov 7, 2019

Choose a reason for hiding this comment

EgorBo Nov 8, 2019

Choose a reason for hiding this comment

CarolEidt commented Nov 8, 2019

tannergooding commented Nov 8, 2019 • edited Loading

maryamariyan commented Dec 2, 2019

EgorBo commented Sep 5, 2019 •

edited

Loading

am11 commented Sep 6, 2019 •

edited

Loading

tannergooding commented Nov 8, 2019 •

edited

Loading