Hoisting the invariant out of multi-level nested loops #61420

kunalspathak · 2021-11-10T15:27:24Z

If there are multi-level nested loops and there is an invariant calculation inside the inner most loop, we only hoist it to one level outer loop instead of hoisting it to multiple level outer loop (as appropriate). Below is an example of real world code:

https://github.com/SixLabors/ImageSharp/blob/255226b03c8350f88e641bdb58c9450b68729ef7/src/ImageSharp/Processing/Processors/Quantization/WuQuantizer%7BTPixel%7D.cs#L418-L444

Here is a simple repro where all the calculation of ind1 is done inside the a-loop instead of spreading it to b-loop, g-loop and r-loop.

I tried to hand-code this in SixLabors/ImageSharp#1818 and see some good improvements. Ideally this should be done by JIT.

The text was updated successfully, but these errors were encountered:

ghost · 2021-11-10T15:27:31Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

If there are multi-level nested loops and there is an invariant calculation inside the inner most loop, we only hoist it to one level outer loop instead of hoisting it to multiple level outer loop (as appropriate). Below is an example of real world code:

https://github.com/SixLabors/ImageSharp/blob/255226b03c8350f88e641bdb58c9450b68729ef7/src/ImageSharp/Processing/Processors/Quantization/WuQuantizer%7BTPixel%7D.cs#L418-L444

Here is a simple repro where all the calculation of ind1 is done inside the a-loop.

I tried to hand-code this in SixLabors/ImageSharp#1818 and see some good improvements. Ideally this should be done by JIT.

Author:	kunalspathak
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `untriaged`
Milestone:	-

kunalspathak · 2021-11-10T15:27:33Z

@dotnet/jit-contrib , @BruceForstall

EgorBo · 2021-11-10T15:50:32Z

This is exactly why I see a small regression in #61408, e.g.:

void foo(int i)
{
    for (int j = 0; j < 100; j++)
    {
        for (int k = 0; k < 100; k++)
        {
            Console.WriteLine(Vector128.Create(42, 42));
        }
    }
}

Current codegen in Main:

; Method Program:foo(int):this
G_M10044_IG01:
       push     rdi
       push     rsi
       sub      rsp, 56
       vzeroupper 
       vmovaps  qword ptr [rsp+20H], xmm6
						;; bbWeight=1    PerfScore 5.25

G_M10044_IG02:
       xor      esi, esi
						;; bbWeight=1    PerfScore 0.25

G_M10044_IG03:
       xor      edi, edi
       vmovupd  xmm6, xmmword ptr [reloc @RWD00]
						;; bbWeight=4    PerfScore 13.00

G_M10044_IG04:
       mov      rcx, 0xD1FFAB1E      ; System.Runtime.Intrinsics.Vector128`1[Int64]
       call     CORINFO_HELP_NEWSFAST
       vmovupd  xmmword ptr [rax+8], xmm6
       mov      rcx, rax
       call     System.Console:WriteLine(System.Object)
       inc      edi
       cmp      edi, 100
       jl       SHORT G_M10044_IG04
						;; bbWeight=16    PerfScore 96.00

G_M10044_IG05:
       inc      esi
       cmp      esi, 100
       jl       SHORT G_M10044_IG03
						;; bbWeight=4    PerfScore 6.00

G_M10044_IG06:
       vmovaps  xmm6, qword ptr [rsp+20H]
       add      rsp, 56
       pop      rsi
       pop      rdi
       ret      
						;; bbWeight=1    PerfScore 6.25
RWD00  	dq	000000000000002Ah, 000000000000002Ah
; Total bytes of code: 82

As you can see, Vector128.Create(42, 42) was only hoisted from the inner loop.

AndyAyersMS · 2021-11-10T16:14:44Z

We've got multiple issues open on this already -- linking them up for context

BruceForstall · 2021-11-10T18:21:04Z

@kunalspathak Wrote a simplified example from the imagesharp case:

using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

public class C {
    
    private const int IndexBits = 5;
    private const int IndexAlphaBits = 5;
    private const int IndexCount = (1 << IndexBits) + 1;
    private const int IndexAlphaCount = (1 << IndexAlphaBits) + 1;

    
    public int M() {
        int sum = 0;
        for (int r = 0; r < IndexCount; r++)
        {
            for (int g = 0; g < IndexCount; g++)
            {
                for (int b = 0; b < IndexCount; b++)
                {
                    for (int a = 0; a < IndexAlphaCount; a++)
                    {
                        int ind1 = (r << ((IndexBits * 2) + IndexAlphaBits))
                                    + (r << (IndexBits + IndexAlphaBits + 1))
                                    + (g << (IndexBits + IndexAlphaBits))
                                    + (r << (IndexBits * 2))
                                    + (r << (IndexBits + 1))
                                    + (g << IndexBits)
                                    + ((r + g + b) << IndexAlphaBits)
                                    + r + g + b + a;
                        sum += ind1;
                    }
                }
            }
        }
        return sum;
    }
}

kunalspathak · 2022-02-04T06:27:47Z

Another impact of this is when we refer a static variable inside a loop. This is bad in for Arm64 because we need 3 instructions to construct the address of static variable.

private static int COUNT = 5;

[MethodImpl(MethodImplOptions.NoInlining)]
public static double issue3(int x, int y, int z)
{
    double result = 0;
    for (int i = 0; i < x; i++)
    {
        for (int j = 0; j < y; i++)
        {
            result += COUNT;
        }
    }
    return result;
}

G_M22420_IG03:              ;; offset=0020H
        7100003F          cmp     w1, #0
        5400018D          ble     G_M22420_IG05

 ; Below construction of 0x7ffb568dda90 can be hoisted outside the outer most loop IG03.
        D29B5214          movz    x20, #0xda90
        F2AAD1B4          movk    x20, #0x568d LSL #16
        F2CFFF74          movk    x20, #0x7ffb LSL #32 
        AA1403E0          mov     x0, x20
        528001A1          mov     w1, #13
        94000000          bl      CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
        B9405E80          ldr     w0, [x20,#92]
        1E620010          scvtf   d16, w0
                                                ;; bbWeight=4    PerfScore 48.00
G_M22420_IG04:              ;; offset=0048H
        1E702908          fadd    d8, d8, d16
        11000673          add     w19, w19, #1
        17FFFFFE          b       G_M22420_IG04
                                                ;; bbWeight=16    PerfScore 72.00
G_M22420_IG05:              ;; offset=0054H
        11000673          add     w19, w19, #1
        6B00027F          cmp     w19, w0
        54FFFE2B          blt     G_M22420_IG03

kunalspathak · 2022-02-07T19:58:35Z

@BruceForstall - do you think you will have time to look into this in .NET 7?

BruceForstall · 2022-02-08T01:34:43Z

@BruceForstall - do you think you will have time to look into this in .NET 7?

It feels unlikely.

kunalspathak · 2022-04-14T07:49:39Z

I was trying to understand why hoisting of static variable address doesn't get hoisted at 2-levels with the following example:

private static int SOME_TEMP= 4;

int Test() {
  int result = 0;
  for (int i = 0; i < n; i++)  {
      for (int j = 0; j < m; j++)  {
          result += SOME_TEMP;
      }
  }
}

The problem is because we hoist the code from outer to inner loop and not vice-versa as pointed out here.

runtime/src/coreclr/jit/optimizer.cpp

Lines 6202 to 6208 in 1a296c0

    
           // We really should consider hoisting from conditionally executed blocks, if they are frequently executed 
        
           // and it is safe to evaluate the tree early. 
        
           // 
        
           // In particular if we have a loop nest, when scanning the outer loop we should consider hoisting from blocks 
        
           // in enclosed loops. However, this is likely to scale poorly, and we really should instead start 
        
           // hoisting inner to outer. 
        
           //

I tried switching the ordering with the following change:

@@ -6049,7 +6063,6 @@ void Compiler::optHoistLoopNest(unsigned lnum, LoopHoistContext* hoistCtxt)
     m_loopsConsidered++;
 #endif // LOOP_HOIST_STATS

-    optHoistThisLoop(lnum, hoistCtxt);

     VNSet* hoistedInCurLoop = hoistCtxt->ExtractHoistedInCurLoop();

@@ -6086,6 +6099,8 @@ void Compiler::optHoistLoopNest(unsigned lnum, LoopHoistContext* hoistCtxt)
             }
         }
     }
+
+     optHoistThisLoop(lnum, hoistCtxt);

However, when we visit the blocks to hoist, we only visit blocks that dominate exit blocks of the loop. Here is the original block ordering:

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..006)-> BB05 ( cond )                     i 
BB02 [0001]  2       BB01,BB04             4     0 [006..00A)-> BB04 ( cond )                     i Loop Loop0 bwd bwd-target 
BB03 [0002]  2       BB02,BB03            16     1 [00A..01A)-> BB03 ( cond )                     i Loop Loop0 hascall bwd bwd-target 
BB04 [0004]  2       BB02,BB03             4     0 [01A..022)-> BB02 ( cond )                     i bwd 
BB05 [0006]  2       BB01,BB04             1       [022..024)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

There are two loops - L00, the outer loop and L01, the inner loop. With my change, we hoist the block BB03 and add a preheader block for the hoisted code. The new blocks look like this:

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..006)-> BB05 ( cond )                     i 
BB02 [0001]  2       BB01,BB04             4     0 [006..00A)-> BB04 ( cond )                     i Loop Loop0 bwd bwd-target 
BB06 [0009]  1       BB02                  4     0 [00A..???)                                     internal LoopPH 
BB03 [0002]  2       BB03,BB06            16     1 [00A..01A)-> BB03 ( cond )                     i Loop Loop0 hascall bwd bwd-target 
BB04 [0004]  2       BB02,BB03             4     0 [01A..022)-> BB02 ( cond )                     i bwd 
BB05 [0006]  2       BB01,BB04             1       [022..024)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

But when we try to hoist code for loop L00, we just hoist it for BB02 and BB04 because those are the ones that dominate the exit block BB04. Hence, we discard the block BB06 which has code that can be hoisted out - because we don't consider conditionally executed blocks today.

kunalspathak · 2022-04-14T15:58:49Z

Hence, we discard the block BB06 which has code that can be hoisted out - because we don't consider conditionally executed blocks today.

I believe we should have a way to look at the edge weights and consider all the hot blocks that participate in the loop for hoisting. @AndyAyersMS ?

BruceForstall · 2022-04-26T01:34:31Z

@kunalspathak Since you're working on this, I'm assigning it to you.

kunalspathak · 2022-05-15T04:00:26Z

@kunalspathak Wrote a simplified example from the imagesharp case:

Here is the diff from #68061 . Full diffs: https://www.diffchecker.com/DKGAZKq0

kunalspathak · 2022-06-14T14:09:58Z

The example mentioned in #61420 (comment) is addressed by #68061

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Nov 10, 2021

EgorBo mentioned this issue Nov 10, 2021

Fix perf issues discovered in "For software performance, can you always trust inlining" blog #61408

Merged

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Nov 11, 2021

JulieLeeMSFT added this to the 7.0.0 milestone Nov 11, 2021

BruceForstall mentioned this issue Nov 15, 2021

Upper bound effect on unsafe indexing #61498

Closed

kunalspathak mentioned this issue Feb 4, 2022

Improving ARM64 Performance in .NET 7.0 #64820

Closed

32 tasks

BruceForstall self-assigned this Feb 8, 2022

kunalspathak mentioned this issue Apr 14, 2022

Hoist the invariants out of multi-level nested loops #68061

Merged

4 tasks

BruceForstall assigned kunalspathak and unassigned BruceForstall Apr 26, 2022

kunalspathak mentioned this issue May 11, 2022

RyuJIT: missed opportunity for LICM #6666

Open

kunalspathak closed this as completed Jun 14, 2022

ghost locked as resolved and limited conversation to collaborators Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hoisting the invariant out of multi-level nested loops #61420

Hoisting the invariant out of multi-level nested loops #61420

kunalspathak commented Nov 10, 2021 •

edited

Loading

ghost commented Nov 10, 2021

kunalspathak commented Nov 10, 2021

EgorBo commented Nov 10, 2021 •

edited

Loading

AndyAyersMS commented Nov 10, 2021

BruceForstall commented Nov 10, 2021 •

edited by kunalspathak

Loading

kunalspathak commented Feb 4, 2022 •

edited

Loading

kunalspathak commented Feb 7, 2022 •

edited

Loading

BruceForstall commented Feb 8, 2022

kunalspathak commented Apr 14, 2022

kunalspathak commented Apr 14, 2022

BruceForstall commented Apr 26, 2022

kunalspathak commented May 15, 2022

kunalspathak commented Jun 14, 2022

Hoisting the invariant out of multi-level nested loops #61420

Hoisting the invariant out of multi-level nested loops #61420

Comments

kunalspathak commented Nov 10, 2021 • edited Loading

ghost commented Nov 10, 2021

kunalspathak commented Nov 10, 2021

EgorBo commented Nov 10, 2021 • edited Loading

AndyAyersMS commented Nov 10, 2021

BruceForstall commented Nov 10, 2021 • edited by kunalspathak Loading

kunalspathak commented Feb 4, 2022 • edited Loading

kunalspathak commented Feb 7, 2022 • edited Loading

BruceForstall commented Feb 8, 2022

kunalspathak commented Apr 14, 2022

kunalspathak commented Apr 14, 2022

BruceForstall commented Apr 26, 2022

kunalspathak commented May 15, 2022

kunalspathak commented Jun 14, 2022

kunalspathak commented Nov 10, 2021 •

edited

Loading

EgorBo commented Nov 10, 2021 •

edited

Loading

BruceForstall commented Nov 10, 2021 •

edited by kunalspathak

Loading

kunalspathak commented Feb 4, 2022 •

edited

Loading

kunalspathak commented Feb 7, 2022 •

edited

Loading