[Perf -40%] Benchstone.BenchF.Simpsn.Test #37802

DrewScoggins · 2020-06-12T17:30:21Z

Run Information

Architecture	x64
OS	Windows 10.0.18362
Changes	diff

Regressions in Benchstone.BenchF.Simpsn

Benchmark	Baseline	Test	Test/Base	Modality	Baseline Outlier
Test	164.06 ms	230.26 ms	1.40		False

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'Benchstone.BenchF.Simpsn*';

Histogram

Benchstone.BenchF.Simpsn.Test

[162358394.287 ; 177057262.379) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[177057262.379 ; 185020170.713) | @
[185020170.713 ; 199719038.804) | 
[199719038.804 ; 214417906.896) | 
[214417906.896 ; 223408815.954) | 
[223408815.954 ; 239552934.046) | @@@@@@@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

The text was updated successfully, but these errors were encountered:

DrewScoggins · 2020-06-12T17:31:18Z

from @AndyAyersMS

Focusing just on the jump in runtime commits it looks like we lost the ability to hoist calls to Exp. This test makes 4 calls to Exp in its inner loop, but two are loop invariant.

Inner loop at first commit of above:

G_M28296_IG04:
       C5F857C0             vxorps   xmm0, xmm0
       C5FB2AC7             vcvtsi2sd  xmm0, edi
       C5FB5905F0000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD24]
       C5F057C9             vxorps   xmm1, xmm1
       C57B58D1             vaddsd   xmm10, xmm0, xmm1
       C5FB1005E8000000     vmovsd   xmm0, qword ptr [reloc @RWD32]
       C4C17857C2           vxorps   xmm0, xmm10
       C5FB5905E3000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD40]
       E8B6D4B45F           call     System.Math:Exp(double):double
       C5FB58F6             vaddsd   xmm6, xmm0, xmm6
       C5AB5805DA000000     vaddsd   xmm0, xmm10, qword ptr [reloc @RWD48]
       C5FB100DC2000000     vmovsd   xmm1, qword ptr [reloc @RWD32]
       C5F857C1             vxorps   xmm0, xmm1
       C5FB5905CE000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD56]
       E891D4B45F           call     System.Math:Exp(double):double
       C5FB58FF             vaddsd   xmm7, xmm0, xmm7
       C5C35905C5000000     vmulsd   xmm0, xmm7, qword ptr [reloc @RWD64]
       C5BB58C0             vaddsd   xmm0, xmm8, xmm0
       C5CB590DC1000000     vmulsd   xmm1, xmm6, qword ptr [reloc @RWD72]
       C5FB58C1             vaddsd   xmm0, xmm0, xmm1
       C5B358C0             vaddsd   xmm0, xmm9, xmm0
       C5FB5935B9000000     vmulsd   xmm6, xmm0, qword ptr [reloc @RWD80]
						;; bbWeight=16    PerfScore 1045.33
G_M28296_IG05:
       FFC7                 inc      edi
       83FF63               cmp      edi, 99
       0F8E76FFFFFF         jle      G_M28296_IG04
						;; bbWeight=16    PerfScore 24.00

inner loop with current main

G_M28296_IG04:
       C5F857C0             vxorps   xmm0, xmm0
       C5FB2AC7             vcvtsi2sd  xmm0, edi
       C5FB59050D010000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD40]
       C5F057C9             vxorps   xmm1, xmm1
       C57B58D1             vaddsd   xmm10, xmm0, xmm1
       C5FB10050D010000     vmovsd   xmm0, qword ptr [reloc @RWD56]
       C4C17857C2           vxorps   xmm0, xmm10
       C5FB58C0             vaddsd   xmm0, xmm0, xmm0
       E83F87B45F           call     System.Math:Exp(double):double
       C4417B58C0           vaddsd   xmm8, xmm0, xmm8
       C5AB580502010000     vaddsd   xmm0, xmm10, qword ptr [reloc @RWD72]
       C5FB100DEA000000     vmovsd   xmm1, qword ptr [reloc @RWD56]
       C5F857C1             vxorps   xmm0, xmm1
       C5FB58C0             vaddsd   xmm0, xmm0, xmm0
       E81D87B45F           call     System.Math:Exp(double):double
       C4417B58C9           vaddsd   xmm9, xmm0, xmm9
       C5FB1005F0000000     vmovsd   xmm0, qword ptr [reloc @RWD88]
       E80B87B45F           call     System.Math:Exp(double):double
       C5B3590DF3000000     vmulsd   xmm1, xmm9, qword ptr [reloc @RWD104]
       C5FB58C1             vaddsd   xmm0, xmm0, xmm1
       C4C13B58C8           vaddsd   xmm1, xmm8, xmm8
       C57B58C1             vaddsd   xmm8, xmm0, xmm1
       C5FB1005EE000000     vmovsd   xmm0, qword ptr [reloc @RWD120]
       E8E986B45F           call     System.Math:Exp(double):double
       C4C17B58C0           vaddsd   xmm0, xmm0, xmm8
       C57B5905EC000000     vmulsd   xmm8, xmm0, qword ptr [reloc @RWD136]
						;; bbWeight=16    PerfScore 1045.33
G_M28296_IG05:
       FFC7                 inc      edi
       83FF63               cmp      edi, 99
       0F8E64FFFFFF         jle      G_M28296_IG04
						;; bbWeight=16    PerfScore 24.00

Nothing jumps out from the commit range as a likely suspect, so will start bisecting.

AndyAyersMS · 2020-06-12T19:37:56Z

This regressed because of #33024.

cc @EgorBo @dotnet/jit-contrib

That change can introduce assignment trees into what were previously side-effect free computations, inhibiting the computations from being hoisted out of loops.

Someday we'll fix the loop hoisting code to hoist assignments, but not anytime soon (see eg #35735). Would suggest for now we forgo this optimization if the parent tree is side effect free and the duplication requires introducing a comma.

(Aside: we might want to review more cases where morph may introduce assignment commas to make sure we're not unduly inhibiting hoisting).

Old IR during optHoistLoopCode:

N005 ( 46, 15) [000227] --C---------                 |  +--*  INTRINSIC double exp $285
N004 ( 10, 11) [000226] ------------                 |  |  \--*  MUL       double $49
N002 (  2,  3) [000224] ------------                 |  |     +--*  NEG       double $49
N001 (  1,  2) [000098] ------------                 |  |     |  \--*  LCL_VAR   double V00 loc0         u:2 $40
N003 (  3,  4) [000225] ------------                 |  |     \--*  CNS_DBL   double 2.0000000000000000 $45

new intrinsic IR:

N009 ( 45, 15) [000227] -AC---------                 |  +--*  INTRINSIC double exp $285
N008 (  9, 11) [000260] -A----------                 |  |  \--*  ADD       double $49
N006 (  3,  5) [000258] -A----------                 |  |     +--*  COMMA     double $49
N004 (  2,  3) [000256] -A------R---                 |  |     |  +--*  ASG       double $49
N003 (  1,  2) [000255] D------N----                 |  |     |  |  +--*  LCL_VAR   double V24 tmp10        d:2 $49
N002 (  2,  3) [000224] ------------                 |  |     |  |  \--*  NEG       double $49
N001 (  1,  2) [000098] ------------                 |  |     |  |     \--*  LCL_VAR   double V00 loc0         u:2 $40
N005 (  1,  2) [000257] ------------                 |  |     |  \--*  LCL_VAR   double V24 tmp10        u:2 $49
N007 (  1,  2) [000259] ------------                 |  |     \--*  LCL_VAR   double V24 tmp10        u:2 (last use) $49

EgorBo · 2020-06-12T19:53:36Z

@AndyAyersMS ouch, in my defense I initially implemented that optimization in Lower so it used to happen after the loop hoisting 🙂 Maybe it makes sense to revert to Lower?

AndyAyersMS · 2020-06-12T20:08:04Z

Next time you'll think twice before listening to my suggestions... though in my defense, I also wrote

We don't have a good way of determining if there are dependencies between optimizations, or if one optimization may enable another. ... Best I can suggest for now is to make sure you run diffs widely, eg over the entire set of Pri1 tests, and look carefully at the diffs

I still think early makes sense, we just need to be careful not to upset our fragile optimization phases. We tend to introduce comma forms in morph without really thinking what impact they might have.

EgorBo · 2020-06-12T20:30:06Z

@AndyAyersMS I suspect the following code (which is not related to my optimization) also fails to hoist code from loop:

static void Foo(int[] array, int x)
{
    for (int i = 0; i < array.Length; i++)
    {
        array[i] = (x * x) % 10; // replace with "/ 10" and it will work
    }
}

I guess it's the same problem - % 10 was decomposed in morph with a temp.

AndyAyersMS · 2020-06-12T22:53:46Z

We should fix this for 5.0.

DrewScoggins added the tenet-performance-benchmarks Issue from performance benchmark label Jun 12, 2020

Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Jun 12, 2020

AndyAyersMS added this to the 5.0 milestone Jun 12, 2020

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jun 12, 2020

EgorBo mentioned this issue Jun 18, 2020

RyuJIT: Don't insert GT_COMMA for x*2->x+x opt before optHoistLoopCode phase #38083

Merged

BruceForstall assigned EgorBo Jun 22, 2020

AndyAyersMS closed this as completed in #38083 Jun 25, 2020

ghost locked as resolved and limited conversation to collaborators Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf -40%] Benchstone.BenchF.Simpsn.Test #37802

[Perf -40%] Benchstone.BenchF.Simpsn.Test #37802

DrewScoggins commented Jun 12, 2020

Histogram

Benchstone.BenchF.Simpsn.Test

Docs

DrewScoggins commented Jun 12, 2020

AndyAyersMS commented Jun 12, 2020

EgorBo commented Jun 12, 2020

AndyAyersMS commented Jun 12, 2020

EgorBo commented Jun 12, 2020 •

edited

Loading

AndyAyersMS commented Jun 12, 2020

[Perf -40%] Benchstone.BenchF.Simpsn.Test #37802

[Perf -40%] Benchstone.BenchF.Simpsn.Test #37802

Comments

DrewScoggins commented Jun 12, 2020

Run Information

Regressions in Benchstone.BenchF.Simpsn

Repro

Histogram

Benchstone.BenchF.Simpsn.Test

Docs

DrewScoggins commented Jun 12, 2020

AndyAyersMS commented Jun 12, 2020

EgorBo commented Jun 12, 2020

AndyAyersMS commented Jun 12, 2020

EgorBo commented Jun 12, 2020 • edited Loading

AndyAyersMS commented Jun 12, 2020

EgorBo commented Jun 12, 2020 •

edited

Loading