-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mono] Improve inliner, re-use RyuJIT heuristics #34286
Comments
I'm not sure how long this would take to do, but as quick workaround we could also mark methods from IntPtr class as AggressiveInlining |
FWIW I have ambitions about rewriting all the RyuJit inlining heuristics at some point. Some of what's there is worth keeping, but there are lots of areas that need improving. Pros:
Cons:
Challenges:
|
It seems to be a recurring theme that some optimizations are too expensive because the necessary analysis is not available yet or is too expensive altogether. For example, the design document says:
Could precomputation help with this? There could be a build step that analyzes a set of assemblies and writes a summary useful for optimizations to disk. The summary could contain a call graph, for example. It could also contain profile data. The analysis tool could also concretely evaluate every possible inlining decision by performing the inlining and then simplifying. This would expose inlining opportunities that result in code size reduction but that are not obvious to any simple heuristic. Data useful for optimizations other than inlining could be included as well. A data flow summary comes to mind (e.g. "this variable can never be null; this other variable always is greater than zero"). This technique would not be AOT. The data would not be machine-specific or versioning-brittle. |
Yes, it is one of the things we plan to look into -- using precompilation analysis to leave information for jitting, either as "facts" or "hints".
Combinatorics makes this infeasible. If there are N top level call sites, there are 2^N possible ways to inline, all of which may produce different codegen. |
Mono's inliner is quite conservative (see dotnet/runtime#34286) so we have to workaround some limitations, e.g. ```csharp static ReadOnlySpan<byte> Arr => new byte[] {1, 2, 3, 4}; // access example: byte GetByte(int i) => Arr[i]; ``` #### Current codegen for `GetByte` (with dotnet/runtime#37458 included) ```asm 0000000000000000 <gram_GetByte__int_>: <BB>:1 0: 48 83 ec 18 sub $0x18,%rsp 4: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0 8: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) d: 48 b8 50 22 73 d2 6e movabs $0x556ed2732250,%rax 14: 55 00 00 17: 48 89 e7 mov %rsp,%rdi 1a: ff 10 callq *(%rax) ;;;; get_Arr() is not inlined! 1c: 83 7c 24 08 00 cmpl $0x0,0x8(%rsp) 21: 74 0b je 2e <gram_GetByte__int_+0x2e> 23: 48 8b 04 24 mov (%rsp),%rax 27: 8a 00 mov (%rax),%al 29: 48 83 c4 18 add $0x18,%rsp 2d: c3 retq 2e: 48 b8 68 22 73 d2 6e movabs $0x556ed2732268,%rax 35: 55 00 00 38: bf 02 01 00 00 mov $0x102,%edi 3d: ff 10 callq *(%rax) ``` As you can see, `get_Arr()` is not inlined - it happened because the calculated size is 60 and the major contributor is `ReadOnlySpan<>.ctor` which is marked with [AggressiveInlining], so Mono inlined that `.ctor` into `get_Arr` and the total cost for `get_Arr` increased to 60. #### Codegen with this PR: ```asm 0000000000000000 <gram_GetByte__int_>: <BB>:1 0: 48 b8 00 6e 1f 91 11 movabs $0x5611911f6e00,%rax 7: 56 00 00 a: 8a 00 mov (%rax),%al c: c3 retq ``` Thus, the fix helped to inline more and even reduced function size. There are other places where we append method call costs to `inline_costs` but I decided to start with this one for constructors and only for LLVM to minimize possible regressions. Possible regressions match current CoreCLR behavior, bad case: [sharplab.io](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYm5JE1IMhNAN4MLNCxYDaAWRgYAFhAAmASXGSAFA+dvPMUkAeTE+CC5cFgBBAHNY2FxcXgUYdy5JXi4s2IBKAF0rax09A2IUBgAhXli/F1dvXKKLU2brC2YATm8AIgAJHtyNanb2rt6YQeHR4vJunskptrG53sWh5dn5iCWRmb1ujb2Z8Z6AdV39g96do6vTqEv90/Xp59We1yeTj4BCb4YAF9NMcmLpmGUKtVao56mcoNgxGIZI02q1Qe0APSYqo1OpuRoMXi4IkZLIwVwMPAMGAIZFgDAUhhwBgQADWmws0PxDTu1mB1DaJQhTAqABVVBhUaD0VdsQx3FT8AwuBAMDS6TAGTlcTD/JSMBAGMAYKTMlwmVlDQwJbgpU0MdZ5cAOOreG6SSzVer2ZzdTz4YjkVBGm8LALAUA===)
#19930) Mono's inliner is quite conservative (see dotnet/runtime#34286) so we have to workaround some limitations, e.g. ```csharp static ReadOnlySpan<byte> Arr => new byte[] {1, 2, 3, 4}; // access example: byte GetByte(int i) => Arr[i]; ``` #### Current codegen for `GetByte` (with dotnet/runtime#37458 included) ```asm 0000000000000000 <gram_GetByte__int_>: <BB>:1 0: 48 83 ec 18 sub $0x18,%rsp 4: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0 8: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) d: 48 b8 50 22 73 d2 6e movabs $0x556ed2732250,%rax 14: 55 00 00 17: 48 89 e7 mov %rsp,%rdi 1a: ff 10 callq *(%rax) ;;;; get_Arr() is not inlined! 1c: 83 7c 24 08 00 cmpl $0x0,0x8(%rsp) 21: 74 0b je 2e <gram_GetByte__int_+0x2e> 23: 48 8b 04 24 mov (%rsp),%rax 27: 8a 00 mov (%rax),%al 29: 48 83 c4 18 add $0x18,%rsp 2d: c3 retq 2e: 48 b8 68 22 73 d2 6e movabs $0x556ed2732268,%rax 35: 55 00 00 38: bf 02 01 00 00 mov $0x102,%edi 3d: ff 10 callq *(%rax) ``` As you can see, `get_Arr()` is not inlined - it happened because the calculated size is 60 and the major contributor is `ReadOnlySpan<>.ctor` which is marked with [AggressiveInlining], so Mono inlined that `.ctor` into `get_Arr` and the total cost for `get_Arr` increased to 60. #### Codegen with this PR: ```asm 0000000000000000 <gram_GetByte__int_>: <BB>:1 0: 48 b8 00 6e 1f 91 11 movabs $0x5611911f6e00,%rax 7: 56 00 00 a: 8a 00 mov (%rax),%al c: c3 retq ``` Thus, the fix helped to inline more and even reduced function size. There are other places where we append method call costs to `inline_costs` but I decided to start with this one for constructors and only for LLVM to minimize possible regressions. Possible regressions match current CoreCLR behavior, bad case: [sharplab.io](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYm5JE1IMhNAN4MLNCxYDaAWRgYAFhAAmASXGSAFA+dvPMUkAeTE+CC5cFgBBAHNY2FxcXgUYdy5JXi4s2IBKAF0rax09A2IUBgAhXli/F1dvXKKLU2brC2YATm8AIgAJHtyNanb2rt6YQeHR4vJunskptrG53sWh5dn5iCWRmb1ujb2Z8Z6AdV39g96do6vTqEv90/Xp59We1yeTj4BCb4YAF9NMcmLpmGUKtVao56mcoNgxGIZI02q1Qe0APSYqo1OpuRoMXi4IkZLIwVwMPAMGAIZFgDAUhhwBgQADWmws0PxDTu1mB1DaJQhTAqABVVBhUaD0VdsQx3FT8AwuBAMDS6TAGTlcTD/JSMBAGMAYKTMlwmVlDQwJbgpU0MdZ5cAOOreG6SSzVer2ZzdTz4YjkVBGm8LALAUA===) Co-authored-by: EgorBo <[email protected]>
@jandupej - something for 9.0 |
RyuJIT's heuristics (at least some of them) for inlining can be re-used in Mono since our inliner is quite conservative and only takes IL size into account (20 bytes for Jit, 30 bytes for LLVM-backend).
RyuJIT also estimates native code size and is able to inline methods above the threshold. It uses so called observations and increases/decreases benefit multiplier based on them (Also, performance and size impact), e.g.:
we can start from some simple heuristics e.g.
Inline candidate has an arg that feeds a constant test
to inline methods above our IL size limit, e.g.:IL for
System.IntPtr:op_Inequality(long,long):bool :
RyuJIT's log:
While Mono-JIT (without LLVM) simply refuses to inline it based on
IntPtr.op_Inequality
's IL size (here). So even a simple "candidate has const arg that feeds a conditional" multiplier could help to jump over the limit and get a faster and smaller codegen for this case./cc @BrzVlad @vargaz @lewurm @marek-safar
I recently tried to implement my own RyuJIT heuristic just to learn how it works: #33338 (comment)
The text was updated successfully, but these errors were encountered: