[mono] Improve inliner, re-use RyuJIT heuristics #34286

EgorBo · 2020-03-30T18:46:12Z

RyuJIT's heuristics (at least some of them) for inlining can be re-used in Mono since our inliner is quite conservative and only takes IL size into account (20 bytes for Jit, 30 bytes for LLVM-backend).
RyuJIT also estimates native code size and is able to inline methods above the threshold. It uses so called observations and increases/decreases benefit multiplier based on them (Also, performance and size impact), e.g.:

Inline candidate has SIMD type args, locals or return value. Multiplier increased to %g
Inline candidate looks like a wrapper method. Multiplier increased to %g
Inline candidate has an arg that feeds a constant test. Multiplier increased to %g
Inline candidate is mostly loads and stores. Multiplier increased to %g
Inline candidate has arg that feeds range check. Multiplier increased to %g
Inline candidate has const arg that feeds a conditional. Multiplier increased to %g
Prejit root candidate has arg that feeds a conditional. Multiplier increased to %g
Inline candidate callsite is rare. Multiplier limited to %g
Inline candidate callsite is boring. Multiplier increased to %g
Inline candidate callsite is warm. Multiplier increased to %g
Inline candidate callsite is in a loop. Multiplier increased to %g
Inline candidate callsite is hot. Multiplier increased to %g
multiplier in instance constructors increased to %g
multiplier in methods of promotable struct increased to %g

we can start from some simple heuristics e.g. Inline candidate has an arg that feeds a constant test to inline methods above our IL size limit, e.g.:

public bool IsValid(IntPtr handle)
{
    return handle != IntPtr.Zero; // will IntPtr.op_Inequality be inlined here?
}

IL for System.IntPtr:op_Inequality(long,long):bool :

IL_0000  ldarga.s     0x0
IL_0002  ldfld        0x4000402
IL_0007  ldarga.s     0x1
IL_0009  ldfld        0x4000402
IL_000e  ceq
IL_0010  ldc.i4.0
IL_0011  ceq
IL_0013  ret

RyuJIT's log:

multiplier in methods of promotable struct increased to 3.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 4.
Inline candidate is mostly loads and stores.  Multiplier increased to 7.
Inline candidate has const arg that feeds a conditional.  Multiplier increased to 10. <-- !!!
Inline candidate callsite is boring.  Multiplier increased to 11.3.
calleeNativeSizeEstimate=112
callsiteNativeSizeEstimate=115
benefit multiplier=11.3
threshold=1299
Native estimate for function size is within threshold for inlining 11.2 <= 129.9 (multiplier = 11.3)
(success!)

While Mono-JIT (without LLVM) simply refuses to inline it based on IntPtr.op_Inequality's IL size (here). So even a simple "candidate has const arg that feeds a conditional" multiplier could help to jump over the limit and get a faster and smaller codegen for this case.

/cc @BrzVlad @vargaz @lewurm @marek-safar

I recently tried to implement my own RyuJIT heuristic just to learn how it works: #33338 (comment)

The text was updated successfully, but these errors were encountered:

BrzVlad · 2020-03-30T18:57:25Z

I'm not sure how long this would take to do, but as quick workaround we could also mark methods from IntPtr class as AggressiveInlining

AndyAyersMS · 2020-03-30T19:09:45Z

FWIW I have ambitions about rewriting all the RyuJit inlining heuristics at some point. Some of what's there is worth keeping, but there are lots of areas that need improving.

Pros:

observations -> heuristics -> decision is the right overall approach.

Cons:

RyuJit code size model is based on some outdated, x86 based "state machine" and is not very accurate. This perhaps should take machine details (ABI, etc) into account, but it might be better for inlining decisions not to depend (or to only weakly depend) on target machine/abi.
inline observations are driven by an IL walk with very crude stack model, and so are not very accurate
heuristic weights are totally ad-hoc; I'd hoped to derive these via machine learning, but ran into various complications (see notes)
- good news was that (for RyuJit) at least code size estimate seemed like something that could be modeled fairly well
- performance impact was not well modeled

Challenges:

Ultimately profitability needs to incorporate profile feedback in some form, whether it be online, offline, or synthetic
User code base now expects certain behavior from inlining; any change in this area that is not strictly "inline more" will result in regressions; hard to land such a change if regression tolerance is low.
Inlining more will also result in regressions, though perhaps more subtle and harder to spot, but almost certainly, jit will be slower

GSPP · 2020-04-05T14:23:24Z

It seems to be a recurring theme that some optimizations are too expensive because the necessary analysis is not available yet or is too expensive altogether. For example, the design document says:

Currently the jit walks it is IR in linear fashion deciding whether to inline each time it sees a candidate. If the decision is yes then the inlined code is spliced in place of the call and (because of the order of the walk) immediately scanned for inlining candidates. Thus the inlining is performed "depth first" and is done without much knowledge of the number or location of other candidates in the code stream. Inlining is done very early on before any significant analysis has been done to the IR -- there is a flow graph but no loop nesting, dataflow, or profile estimates are generally available.

Could precomputation help with this? There could be a build step that analyzes a set of assemblies and writes a summary useful for optimizations to disk. The summary could contain a call graph, for example. It could also contain profile data.

The analysis tool could also concretely evaluate every possible inlining decision by performing the inlining and then simplifying. This would expose inlining opportunities that result in code size reduction but that are not obvious to any simple heuristic.

Data useful for optimizations other than inlining could be included as well. A data flow summary comes to mind (e.g. "this variable can never be null; this other variable always is greater than zero").

This technique would not be AOT. The data would not be machine-specific or versioning-brittle.

AndyAyersMS · 2020-04-06T20:10:45Z

Could precomputation help with this? Data useful for optimizations other than inlining could be included as well...

Yes, it is one of the things we plan to look into -- using precompilation analysis to leave information for jitting, either as "facts" or "hints".

The analysis tool could also concretely evaluate every possible inlining decision

Combinatorics makes this infeasible. If there are N top level call sites, there are 2^N possible ways to inline, all of which may produce different codegen.

Mono's inliner is quite conservative (see dotnet/runtime#34286) so we have to workaround some limitations, e.g. ```csharp static ReadOnlySpan<byte> Arr => new byte[] {1, 2, 3, 4}; // access example: byte GetByte(int i) => Arr[i]; ``` #### Current codegen for `GetByte` (with dotnet/runtime#37458 included) ```asm 0000000000000000 <gram_GetByte__int_>: <BB>:1 0: 48 83 ec 18 sub $0x18,%rsp 4: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0 8: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) d: 48 b8 50 22 73 d2 6e movabs $0x556ed2732250,%rax 14: 55 00 00 17: 48 89 e7 mov %rsp,%rdi 1a: ff 10 callq *(%rax) ;;;; get_Arr() is not inlined! 1c: 83 7c 24 08 00 cmpl $0x0,0x8(%rsp) 21: 74 0b je 2e <gram_GetByte__int_+0x2e> 23: 48 8b 04 24 mov (%rsp),%rax 27: 8a 00 mov (%rax),%al 29: 48 83 c4 18 add $0x18,%rsp 2d: c3 retq 2e: 48 b8 68 22 73 d2 6e movabs $0x556ed2732268,%rax 35: 55 00 00 38: bf 02 01 00 00 mov $0x102,%edi 3d: ff 10 callq *(%rax) ``` As you can see, `get_Arr()` is not inlined - it happened because the calculated size is 60 and the major contributor is `ReadOnlySpan<>.ctor` which is marked with [AggressiveInlining], so Mono inlined that `.ctor` into `get_Arr` and the total cost for `get_Arr` increased to 60. #### Codegen with this PR: ```asm 0000000000000000 <gram_GetByte__int_>: <BB>:1 0: 48 b8 00 6e 1f 91 11 movabs $0x5611911f6e00,%rax 7: 56 00 00 a: 8a 00 mov (%rax),%al c: c3 retq ``` Thus, the fix helped to inline more and even reduced function size. There are other places where we append method call costs to `inline_costs` but I decided to start with this one for constructors and only for LLVM to minimize possible regressions. Possible regressions match current CoreCLR behavior, bad case: [sharplab.io](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYm5JE1IMhNAN4MLNCxYDaAWRgYAFhAAmASXGSAFA+dvPMUkAeTE+CC5cFgBBAHNY2FxcXgUYdy5JXi4s2IBKAF0rax09A2IUBgAhXli/F1dvXKKLU2brC2YATm8AIgAJHtyNanb2rt6YQeHR4vJunskptrG53sWh5dn5iCWRmb1ujb2Z8Z6AdV39g96do6vTqEv90/Xp59We1yeTj4BCb4YAF9NMcmLpmGUKtVao56mcoNgxGIZI02q1Qe0APSYqo1OpuRoMXi4IkZLIwVwMPAMGAIZFgDAUhhwBgQADWmws0PxDTu1mB1DaJQhTAqABVVBhUaD0VdsQx3FT8AwuBAMDS6TAGTlcTD/JSMBAGMAYKTMlwmVlDQwJbgpU0MdZ5cAOOreG6SSzVer2ZzdTz4YjkVBGm8LALAUA===)

#19930) Mono's inliner is quite conservative (see dotnet/runtime#34286) so we have to workaround some limitations, e.g. ```csharp static ReadOnlySpan<byte> Arr => new byte[] {1, 2, 3, 4}; // access example: byte GetByte(int i) => Arr[i]; ``` #### Current codegen for `GetByte` (with dotnet/runtime#37458 included) ```asm 0000000000000000 <gram_GetByte__int_>: <BB>:1 0: 48 83 ec 18 sub $0x18,%rsp 4: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0 8: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) d: 48 b8 50 22 73 d2 6e movabs $0x556ed2732250,%rax 14: 55 00 00 17: 48 89 e7 mov %rsp,%rdi 1a: ff 10 callq *(%rax) ;;;; get_Arr() is not inlined! 1c: 83 7c 24 08 00 cmpl $0x0,0x8(%rsp) 21: 74 0b je 2e <gram_GetByte__int_+0x2e> 23: 48 8b 04 24 mov (%rsp),%rax 27: 8a 00 mov (%rax),%al 29: 48 83 c4 18 add $0x18,%rsp 2d: c3 retq 2e: 48 b8 68 22 73 d2 6e movabs $0x556ed2732268,%rax 35: 55 00 00 38: bf 02 01 00 00 mov $0x102,%edi 3d: ff 10 callq *(%rax) ``` As you can see, `get_Arr()` is not inlined - it happened because the calculated size is 60 and the major contributor is `ReadOnlySpan<>.ctor` which is marked with [AggressiveInlining], so Mono inlined that `.ctor` into `get_Arr` and the total cost for `get_Arr` increased to 60. #### Codegen with this PR: ```asm 0000000000000000 <gram_GetByte__int_>: <BB>:1 0: 48 b8 00 6e 1f 91 11 movabs $0x5611911f6e00,%rax 7: 56 00 00 a: 8a 00 mov (%rax),%al c: c3 retq ``` Thus, the fix helped to inline more and even reduced function size. There are other places where we append method call costs to `inline_costs` but I decided to start with this one for constructors and only for LLVM to minimize possible regressions. Possible regressions match current CoreCLR behavior, bad case: [sharplab.io](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYm5JE1IMhNAN4MLNCxYDaAWRgYAFhAAmASXGSAFA+dvPMUkAeTE+CC5cFgBBAHNY2FxcXgUYdy5JXi4s2IBKAF0rax09A2IUBgAhXli/F1dvXKKLU2brC2YATm8AIgAJHtyNanb2rt6YQeHR4vJunskptrG53sWh5dn5iCWRmb1ujb2Z8Z6AdV39g96do6vTqEv90/Xp59We1yeTj4BCb4YAF9NMcmLpmGUKtVao56mcoNgxGIZI02q1Qe0APSYqo1OpuRoMXi4IkZLIwVwMPAMGAIZFgDAUhhwBgQADWmws0PxDTu1mB1DaJQhTAqABVVBhUaD0VdsQx3FT8AwuBAMDS6TAGTlcTD/JSMBAGMAYKTMlwmVlDQwJbgpU0MdZ5cAOOreG6SSzVer2ZzdTz4YjkVBGm8LALAUA===) Co-authored-by: EgorBo <[email protected]>

SamMonoRT · 2023-07-25T15:42:15Z

@jandupej - something for 9.0

Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Mar 30, 2020

EgorBo added area-Codegen-JIT-mono and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Mar 30, 2020

EgorBo mentioned this issue Jun 6, 2020

[mono] Don't contribute to inline_const if ctor has AggressiveInlining #37544

Merged

monojenkins mentioned this issue Jun 6, 2020

[mono] Don't contribute to inline_const if ctor has AggressiveInlining mono/mono#19930

Merged

lambdageek added this to the 6.0.0 milestone Jun 22, 2020

SamMonoRT modified the milestones: 6.0.0, 7.0.0 Jun 21, 2021

SamMonoRT modified the milestones: 7.0.0, 8.0.0 Aug 4, 2022

SamMonoRT assigned jandupej Jul 25, 2023

SamMonoRT modified the milestones: 8.0.0, 9.0.0 Jul 25, 2023

SamMonoRT assigned matouskozak and unassigned jandupej Oct 25, 2023

matouskozak modified the milestones: 9.0.0, Future Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mono] Improve inliner, re-use RyuJIT heuristics #34286

[mono] Improve inliner, re-use RyuJIT heuristics #34286

EgorBo commented Mar 30, 2020 •

edited

Loading

BrzVlad commented Mar 30, 2020

AndyAyersMS commented Mar 30, 2020

GSPP commented Apr 5, 2020

AndyAyersMS commented Apr 6, 2020

SamMonoRT commented Jul 25, 2023

[mono] Improve inliner, re-use RyuJIT heuristics #34286

[mono] Improve inliner, re-use RyuJIT heuristics #34286

Comments

EgorBo commented Mar 30, 2020 • edited Loading

BrzVlad commented Mar 30, 2020

AndyAyersMS commented Mar 30, 2020

GSPP commented Apr 5, 2020

AndyAyersMS commented Apr 6, 2020

SamMonoRT commented Jul 25, 2023

EgorBo commented Mar 30, 2020 •

edited

Loading