Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf -40%] Benchstone.BenchF.Simpsn.Test #37802

Closed
DrewScoggins opened this issue Jun 12, 2020 · 6 comments · Fixed by #38083
Closed

[Perf -40%] Benchstone.BenchF.Simpsn.Test #37802

DrewScoggins opened this issue Jun 12, 2020 · 6 comments · Fixed by #38083
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@DrewScoggins
Copy link
Member

Run Information

Architecture x64
OS Windows 10.0.18362
Changes diff

Regressions in Benchstone.BenchF.Simpsn

Benchmark Baseline Test Test/Base Modality Baseline Outlier
Test 164.06 ms 230.26 ms 1.40 False

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'Benchstone.BenchF.Simpsn*';

Histogram

Benchstone.BenchF.Simpsn.Test

[162358394.287 ; 177057262.379) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[177057262.379 ; 185020170.713) | @
[185020170.713 ; 199719038.804) | 
[199719038.804 ; 214417906.896) | 
[214417906.896 ; 223408815.954) | 
[223408815.954 ; 239552934.046) | @@@@@@@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins DrewScoggins added the tenet-performance-benchmarks Issue from performance benchmark label Jun 12, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Jun 12, 2020
@DrewScoggins
Copy link
Member Author

from @AndyAyersMS

Focusing just on the jump in runtime commits it looks like we lost the ability to hoist calls to Exp. This test makes 4 calls to Exp in its inner loop, but two are loop invariant.

Inner loop at first commit of above:

G_M28296_IG04:
       C5F857C0             vxorps   xmm0, xmm0
       C5FB2AC7             vcvtsi2sd  xmm0, edi
       C5FB5905F0000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD24]
       C5F057C9             vxorps   xmm1, xmm1
       C57B58D1             vaddsd   xmm10, xmm0, xmm1
       C5FB1005E8000000     vmovsd   xmm0, qword ptr [reloc @RWD32]
       C4C17857C2           vxorps   xmm0, xmm10
       C5FB5905E3000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD40]
       E8B6D4B45F           call     System.Math:Exp(double):double
       C5FB58F6             vaddsd   xmm6, xmm0, xmm6
       C5AB5805DA000000     vaddsd   xmm0, xmm10, qword ptr [reloc @RWD48]
       C5FB100DC2000000     vmovsd   xmm1, qword ptr [reloc @RWD32]
       C5F857C1             vxorps   xmm0, xmm1
       C5FB5905CE000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD56]
       E891D4B45F           call     System.Math:Exp(double):double
       C5FB58FF             vaddsd   xmm7, xmm0, xmm7
       C5C35905C5000000     vmulsd   xmm0, xmm7, qword ptr [reloc @RWD64]
       C5BB58C0             vaddsd   xmm0, xmm8, xmm0
       C5CB590DC1000000     vmulsd   xmm1, xmm6, qword ptr [reloc @RWD72]
       C5FB58C1             vaddsd   xmm0, xmm0, xmm1
       C5B358C0             vaddsd   xmm0, xmm9, xmm0
       C5FB5935B9000000     vmulsd   xmm6, xmm0, qword ptr [reloc @RWD80]
						;; bbWeight=16    PerfScore 1045.33
G_M28296_IG05:
       FFC7                 inc      edi
       83FF63               cmp      edi, 99
       0F8E76FFFFFF         jle      G_M28296_IG04
						;; bbWeight=16    PerfScore 24.00

inner loop with current main

G_M28296_IG04:
       C5F857C0             vxorps   xmm0, xmm0
       C5FB2AC7             vcvtsi2sd  xmm0, edi
       C5FB59050D010000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD40]
       C5F057C9             vxorps   xmm1, xmm1
       C57B58D1             vaddsd   xmm10, xmm0, xmm1
       C5FB10050D010000     vmovsd   xmm0, qword ptr [reloc @RWD56]
       C4C17857C2           vxorps   xmm0, xmm10
       C5FB58C0             vaddsd   xmm0, xmm0, xmm0
       E83F87B45F           call     System.Math:Exp(double):double
       C4417B58C0           vaddsd   xmm8, xmm0, xmm8
       C5AB580502010000     vaddsd   xmm0, xmm10, qword ptr [reloc @RWD72]
       C5FB100DEA000000     vmovsd   xmm1, qword ptr [reloc @RWD56]
       C5F857C1             vxorps   xmm0, xmm1
       C5FB58C0             vaddsd   xmm0, xmm0, xmm0
       E81D87B45F           call     System.Math:Exp(double):double
       C4417B58C9           vaddsd   xmm9, xmm0, xmm9
       C5FB1005F0000000     vmovsd   xmm0, qword ptr [reloc @RWD88]
       E80B87B45F           call     System.Math:Exp(double):double
       C5B3590DF3000000     vmulsd   xmm1, xmm9, qword ptr [reloc @RWD104]
       C5FB58C1             vaddsd   xmm0, xmm0, xmm1
       C4C13B58C8           vaddsd   xmm1, xmm8, xmm8
       C57B58C1             vaddsd   xmm8, xmm0, xmm1
       C5FB1005EE000000     vmovsd   xmm0, qword ptr [reloc @RWD120]
       E8E986B45F           call     System.Math:Exp(double):double
       C4C17B58C0           vaddsd   xmm0, xmm0, xmm8
       C57B5905EC000000     vmulsd   xmm8, xmm0, qword ptr [reloc @RWD136]
						;; bbWeight=16    PerfScore 1045.33
G_M28296_IG05:
       FFC7                 inc      edi
       83FF63               cmp      edi, 99
       0F8E64FFFFFF         jle      G_M28296_IG04
						;; bbWeight=16    PerfScore 24.00

Nothing jumps out from the commit range as a likely suspect, so will start bisecting.

@AndyAyersMS
Copy link
Member

This regressed because of #33024.

cc @EgorBo @dotnet/jit-contrib

That change can introduce assignment trees into what were previously side-effect free computations, inhibiting the computations from being hoisted out of loops.

Someday we'll fix the loop hoisting code to hoist assignments, but not anytime soon (see eg #35735). Would suggest for now we forgo this optimization if the parent tree is side effect free and the duplication requires introducing a comma.

(Aside: we might want to review more cases where morph may introduce assignment commas to make sure we're not unduly inhibiting hoisting).

Old IR during optHoistLoopCode:

N005 ( 46, 15) [000227] --C---------                 |  +--*  INTRINSIC double exp $285
N004 ( 10, 11) [000226] ------------                 |  |  \--*  MUL       double $49
N002 (  2,  3) [000224] ------------                 |  |     +--*  NEG       double $49
N001 (  1,  2) [000098] ------------                 |  |     |  \--*  LCL_VAR   double V00 loc0         u:2 $40
N003 (  3,  4) [000225] ------------                 |  |     \--*  CNS_DBL   double 2.0000000000000000 $45

new intrinsic IR:

N009 ( 45, 15) [000227] -AC---------                 |  +--*  INTRINSIC double exp $285
N008 (  9, 11) [000260] -A----------                 |  |  \--*  ADD       double $49
N006 (  3,  5) [000258] -A----------                 |  |     +--*  COMMA     double $49
N004 (  2,  3) [000256] -A------R---                 |  |     |  +--*  ASG       double $49
N003 (  1,  2) [000255] D------N----                 |  |     |  |  +--*  LCL_VAR   double V24 tmp10        d:2 $49
N002 (  2,  3) [000224] ------------                 |  |     |  |  \--*  NEG       double $49
N001 (  1,  2) [000098] ------------                 |  |     |  |     \--*  LCL_VAR   double V00 loc0         u:2 $40
N005 (  1,  2) [000257] ------------                 |  |     |  \--*  LCL_VAR   double V24 tmp10        u:2 $49
N007 (  1,  2) [000259] ------------                 |  |     \--*  LCL_VAR   double V24 tmp10        u:2 (last use) $49

@EgorBo
Copy link
Member

EgorBo commented Jun 12, 2020

@AndyAyersMS ouch, in my defense I initially implemented that optimization in Lower so it used to happen after the loop hoisting 🙂 Maybe it makes sense to revert to Lower?

@AndyAyersMS
Copy link
Member

Next time you'll think twice before listening to my suggestions... though in my defense, I also wrote

We don't have a good way of determining if there are dependencies between optimizations, or if one optimization may enable another. ... Best I can suggest for now is to make sure you run diffs widely, eg over the entire set of Pri1 tests, and look carefully at the diffs

I still think early makes sense, we just need to be careful not to upset our fragile optimization phases. We tend to introduce comma forms in morph without really thinking what impact they might have.

@EgorBo
Copy link
Member

EgorBo commented Jun 12, 2020

@AndyAyersMS I suspect the following code (which is not related to my optimization) also fails to hoist code from loop:

static void Foo(int[] array, int x)
{
    for (int i = 0; i < array.Length; i++)
    {
        array[i] = (x * x) % 10; // replace with "/ 10" and it will work
    }
}

I guess it's the same problem - % 10 was decomposed in morph with a temp.

@AndyAyersMS AndyAyersMS added this to the 5.0 milestone Jun 12, 2020
@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jun 12, 2020
@AndyAyersMS
Copy link
Member

We should fix this for 5.0.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants