Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Improve inliner, re-use RyuJIT heuristics #34286

Open
EgorBo opened this issue Mar 30, 2020 · 5 comments
Open

[mono] Improve inliner, re-use RyuJIT heuristics #34286

EgorBo opened this issue Mar 30, 2020 · 5 comments
Assignees
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Mar 30, 2020

RyuJIT's heuristics (at least some of them) for inlining can be re-used in Mono since our inliner is quite conservative and only takes IL size into account (20 bytes for Jit, 30 bytes for LLVM-backend).
RyuJIT also estimates native code size and is able to inline methods above the threshold. It uses so called observations and increases/decreases benefit multiplier based on them (Also, performance and size impact), e.g.:

  • Inline candidate has SIMD type args, locals or return value. Multiplier increased to %g
  • Inline candidate looks like a wrapper method. Multiplier increased to %g
  • Inline candidate has an arg that feeds a constant test. Multiplier increased to %g
  • Inline candidate is mostly loads and stores. Multiplier increased to %g
  • Inline candidate has arg that feeds range check. Multiplier increased to %g
  • Inline candidate has const arg that feeds a conditional. Multiplier increased to %g
  • Prejit root candidate has arg that feeds a conditional. Multiplier increased to %g
  • Inline candidate callsite is rare. Multiplier limited to %g
  • Inline candidate callsite is boring. Multiplier increased to %g
  • Inline candidate callsite is warm. Multiplier increased to %g
  • Inline candidate callsite is in a loop. Multiplier increased to %g
  • Inline candidate callsite is hot. Multiplier increased to %g
  • multiplier in instance constructors increased to %g
  • multiplier in methods of promotable struct increased to %g

we can start from some simple heuristics e.g. Inline candidate has an arg that feeds a constant test to inline methods above our IL size limit, e.g.:

public bool IsValid(IntPtr handle)
{
    return handle != IntPtr.Zero; // will IntPtr.op_Inequality be inlined here?
}

IL for System.IntPtr:op_Inequality(long,long):bool :

IL_0000  ldarga.s     0x0
IL_0002  ldfld        0x4000402
IL_0007  ldarga.s     0x1
IL_0009  ldfld        0x4000402
IL_000e  ceq
IL_0010  ldc.i4.0
IL_0011  ceq
IL_0013  ret

RyuJIT's log:

multiplier in methods of promotable struct increased to 3.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 4.
Inline candidate is mostly loads and stores.  Multiplier increased to 7.
Inline candidate has const arg that feeds a conditional.  Multiplier increased to 10. <-- !!!
Inline candidate callsite is boring.  Multiplier increased to 11.3.
calleeNativeSizeEstimate=112
callsiteNativeSizeEstimate=115
benefit multiplier=11.3
threshold=1299
Native estimate for function size is within threshold for inlining 11.2 <= 129.9 (multiplier = 11.3)
(success!)

While Mono-JIT (without LLVM) simply refuses to inline it based on IntPtr.op_Inequality's IL size (here). So even a simple "candidate has const arg that feeds a conditional" multiplier could help to jump over the limit and get a faster and smaller codegen for this case.

/cc @BrzVlad @vargaz @lewurm @marek-safar

I recently tried to implement my own RyuJIT heuristic just to learn how it works: #33338 (comment)

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Mar 30, 2020
@EgorBo EgorBo added area-Codegen-JIT-mono and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Mar 30, 2020
@BrzVlad
Copy link
Member

BrzVlad commented Mar 30, 2020

I'm not sure how long this would take to do, but as quick workaround we could also mark methods from IntPtr class as AggressiveInlining

@AndyAyersMS
Copy link
Member

FWIW I have ambitions about rewriting all the RyuJit inlining heuristics at some point. Some of what's there is worth keeping, but there are lots of areas that need improving.

Pros:

  • observations -> heuristics -> decision is the right overall approach.

Cons:

  • RyuJit code size model is based on some outdated, x86 based "state machine" and is not very accurate. This perhaps should take machine details (ABI, etc) into account, but it might be better for inlining decisions not to depend (or to only weakly depend) on target machine/abi.
  • inline observations are driven by an IL walk with very crude stack model, and so are not very accurate
  • heuristic weights are totally ad-hoc; I'd hoped to derive these via machine learning, but ran into various complications (see notes)
    • good news was that (for RyuJit) at least code size estimate seemed like something that could be modeled fairly well
    • performance impact was not well modeled

Challenges:

  • Ultimately profitability needs to incorporate profile feedback in some form, whether it be online, offline, or synthetic
  • User code base now expects certain behavior from inlining; any change in this area that is not strictly "inline more" will result in regressions; hard to land such a change if regression tolerance is low.
  • Inlining more will also result in regressions, though perhaps more subtle and harder to spot, but almost certainly, jit will be slower

@GSPP
Copy link

GSPP commented Apr 5, 2020

It seems to be a recurring theme that some optimizations are too expensive because the necessary analysis is not available yet or is too expensive altogether. For example, the design document says:

Currently the jit walks it is IR in linear fashion deciding whether to inline each time it sees a candidate. If the decision is yes then the inlined code is spliced in place of the call and (because of the order of the walk) immediately scanned for inlining candidates. Thus the inlining is performed "depth first" and is done without much knowledge of the number or location of other candidates in the code stream. Inlining is done very early on before any significant analysis has been done to the IR -- there is a flow graph but no loop nesting, dataflow, or profile estimates are generally available.

Could precomputation help with this? There could be a build step that analyzes a set of assemblies and writes a summary useful for optimizations to disk. The summary could contain a call graph, for example. It could also contain profile data.

The analysis tool could also concretely evaluate every possible inlining decision by performing the inlining and then simplifying. This would expose inlining opportunities that result in code size reduction but that are not obvious to any simple heuristic.

Data useful for optimizations other than inlining could be included as well. A data flow summary comes to mind (e.g. "this variable can never be null; this other variable always is greater than zero").

This technique would not be AOT. The data would not be machine-specific or versioning-brittle.

@AndyAyersMS
Copy link
Member

Could precomputation help with this? Data useful for optimizations other than inlining could be included as well...

Yes, it is one of the things we plan to look into -- using precompilation analysis to leave information for jitting, either as "facts" or "hints".

The analysis tool could also concretely evaluate every possible inlining decision

Combinatorics makes this infeasible. If there are N top level call sites, there are 2^N possible ways to inline, all of which may produce different codegen.

monojenkins pushed a commit to monojenkins/mono that referenced this issue Jun 6, 2020
Mono's inliner is quite conservative (see dotnet/runtime#34286) so we have to workaround some limitations, e.g.
```csharp
static ReadOnlySpan<byte> Arr => new byte[] {1, 2, 3, 4};

// access example:
byte GetByte(int i) => Arr[i];
```
#### Current codegen for `GetByte` (with dotnet/runtime#37458 included)
```asm
0000000000000000 <gram_GetByte__int_>:
<BB>:1
   0:   48 83 ec 18             sub    $0x18,%rsp
   4:   c5 f8 57 c0             vxorps %xmm0,%xmm0,%xmm0
   8:   c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
   d:   48 b8 50 22 73 d2 6e    movabs $0x556ed2732250,%rax
  14:   55 00 00
  17:   48 89 e7                mov    %rsp,%rdi
  1a:   ff 10                   callq  *(%rax)   ;;;; get_Arr() is not inlined!
  1c:   83 7c 24 08 00          cmpl   $0x0,0x8(%rsp)
  21:   74 0b                   je     2e <gram_GetByte__int_+0x2e>
  23:   48 8b 04 24             mov    (%rsp),%rax
  27:   8a 00                   mov    (%rax),%al
  29:   48 83 c4 18             add    $0x18,%rsp
  2d:   c3                      retq
  2e:   48 b8 68 22 73 d2 6e    movabs $0x556ed2732268,%rax
  35:   55 00 00
  38:   bf 02 01 00 00          mov    $0x102,%edi
  3d:   ff 10                   callq  *(%rax)
```

As you can see, `get_Arr()` is not inlined - it happened because the calculated size is 60 and the major contributor is `ReadOnlySpan<>.ctor` which is marked with [AggressiveInlining], so Mono inlined that `.ctor` into `get_Arr` and the total cost for `get_Arr` increased to 60.

#### Codegen with this PR:
```asm
0000000000000000 <gram_GetByte__int_>:
<BB>:1
   0:   48 b8 00 6e 1f 91 11    movabs $0x5611911f6e00,%rax
   7:   56 00 00
   a:   8a 00                   mov    (%rax),%al
   c:   c3                      retq
```
Thus, the fix helped to inline more and even reduced function size.

There are other places where we append method call costs to `inline_costs` but I decided to start with this one for constructors and only for LLVM to minimize possible regressions.

Possible regressions match current CoreCLR behavior, bad case:  [sharplab.io](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYm5JE1IMhNAN4MLNCxYDaAWRgYAFhAAmASXGSAFA+dvPMUkAeTE+CC5cFgBBAHNY2FxcXgUYdy5JXi4s2IBKAF0rax09A2IUBgAhXli/F1dvXKKLU2brC2YATm8AIgAJHtyNanb2rt6YQeHR4vJunskptrG53sWh5dn5iCWRmb1ujb2Z8Z6AdV39g96do6vTqEv90/Xp59We1yeTj4BCb4YAF9NMcmLpmGUKtVao56mcoNgxGIZI02q1Qe0APSYqo1OpuRoMXi4IkZLIwVwMPAMGAIZFgDAUhhwBgQADWmws0PxDTu1mB1DaJQhTAqABVVBhUaD0VdsQx3FT8AwuBAMDS6TAGTlcTD/JSMBAGMAYKTMlwmVlDQwJbgpU0MdZ5cAOOreG6SSzVer2ZzdTz4YjkVBGm8LALAUA===)
EgorBo added a commit to mono/mono that referenced this issue Jun 18, 2020
#19930)

Mono's inliner is quite conservative (see dotnet/runtime#34286) so we have to workaround some limitations, e.g.
```csharp
static ReadOnlySpan<byte> Arr => new byte[] {1, 2, 3, 4};

// access example:
byte GetByte(int i) => Arr[i];
```
#### Current codegen for `GetByte` (with dotnet/runtime#37458 included)
```asm
0000000000000000 <gram_GetByte__int_>:
<BB>:1
   0:   48 83 ec 18             sub    $0x18,%rsp
   4:   c5 f8 57 c0             vxorps %xmm0,%xmm0,%xmm0
   8:   c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
   d:   48 b8 50 22 73 d2 6e    movabs $0x556ed2732250,%rax
  14:   55 00 00
  17:   48 89 e7                mov    %rsp,%rdi
  1a:   ff 10                   callq  *(%rax)   ;;;; get_Arr() is not inlined!
  1c:   83 7c 24 08 00          cmpl   $0x0,0x8(%rsp)
  21:   74 0b                   je     2e <gram_GetByte__int_+0x2e>
  23:   48 8b 04 24             mov    (%rsp),%rax
  27:   8a 00                   mov    (%rax),%al
  29:   48 83 c4 18             add    $0x18,%rsp
  2d:   c3                      retq
  2e:   48 b8 68 22 73 d2 6e    movabs $0x556ed2732268,%rax
  35:   55 00 00
  38:   bf 02 01 00 00          mov    $0x102,%edi
  3d:   ff 10                   callq  *(%rax)
```

As you can see, `get_Arr()` is not inlined - it happened because the calculated size is 60 and the major contributor is `ReadOnlySpan<>.ctor` which is marked with [AggressiveInlining], so Mono inlined that `.ctor` into `get_Arr` and the total cost for `get_Arr` increased to 60.

#### Codegen with this PR:
```asm
0000000000000000 <gram_GetByte__int_>:
<BB>:1
   0:   48 b8 00 6e 1f 91 11    movabs $0x5611911f6e00,%rax
   7:   56 00 00
   a:   8a 00                   mov    (%rax),%al
   c:   c3                      retq
```
Thus, the fix helped to inline more and even reduced function size.

There are other places where we append method call costs to `inline_costs` but I decided to start with this one for constructors and only for LLVM to minimize possible regressions.

Possible regressions match current CoreCLR behavior, bad case:  [sharplab.io](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYm5JE1IMhNAN4MLNCxYDaAWRgYAFhAAmASXGSAFA+dvPMUkAeTE+CC5cFgBBAHNY2FxcXgUYdy5JXi4s2IBKAF0rax09A2IUBgAhXli/F1dvXKKLU2brC2YATm8AIgAJHtyNanb2rt6YQeHR4vJunskptrG53sWh5dn5iCWRmb1ujb2Z8Z6AdV39g96do6vTqEv90/Xp59We1yeTj4BCb4YAF9NMcmLpmGUKtVao56mcoNgxGIZI02q1Qe0APSYqo1OpuRoMXi4IkZLIwVwMPAMGAIZFgDAUhhwBgQADWmws0PxDTu1mB1DaJQhTAqABVVBhUaD0VdsQx3FT8AwuBAMDS6TAGTlcTD/JSMBAGMAYKTMlwmVlDQwJbgpU0MdZ5cAOOreG6SSzVer2ZzdTz4YjkVBGm8LALAUA===)

Co-authored-by: EgorBo <[email protected]>
@lambdageek lambdageek added this to the 6.0.0 milestone Jun 22, 2020
@SamMonoRT SamMonoRT modified the milestones: 6.0.0, 7.0.0 Jun 21, 2021
@SamMonoRT SamMonoRT modified the milestones: 7.0.0, 8.0.0 Aug 4, 2022
@SamMonoRT
Copy link
Member

@jandupej - something for 9.0

@SamMonoRT SamMonoRT modified the milestones: 8.0.0, 9.0.0 Jul 25, 2023
@SamMonoRT SamMonoRT assigned matouskozak and unassigned jandupej Oct 25, 2023
@matouskozak matouskozak modified the milestones: 9.0.0, Future Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants