[mono][interp] Remove no_inlining functionality for dead bblocks #110468

BrzVlad · 2024-12-06T10:24:56Z

Many methods in the BCL, especially hwintrins related, contain a lot of code that is detected as dead during compilation. On mono, inlining happens during IL import and a lot of optimizations are run as later passes. This exposed the issue where we have a lot of dead code bloat from inlining, with optimizations later running on it.

A simple solution for this problem was tracking jump counts for each bblock (#97514), which are initialized when bblocks are first created, before IL import stage. Then a small set of IL import level optimizations were added, in order to reduce the jump targets of each bblock. As we were further importing IL, if we reached a bblock with 0 jump targets, we would disable inlining into it, in order to reduce code bloat. Disabling code emit altogether was too challenging. Another limitation of this approach was that we would fail to detect dead code if it was part of a loop. The results were good however, by reducing mem usage in System.Numerics.Tensor.Tests from 6GB to 600MB.

For an unrelated issue, the order in which we generate bblocks was redesigned in order to account for bblock stack state initialization in weird control flow scenarios (#108731). This was achieved by deferring IL import into bblocks that were not yet reached from other live bblocks. A side effect of this is that we no longer generate code at all in unreachable bblocks, completely superseding the previous approach while addressing both the problems of inlining into loops or generating IR for dead IL. In the previously mentioned test suite, this further reduced the memory usage to 300MB.

Remnants of the unnecessary no_inlining approach still lingered in the code, leading to disabling of inline optimization in some reachable code. This triggered a significant performance regression which this PR addresses.

dotnet/perf-autofiling-issues#45939
dotnet/perf-autofiling-issues#45894
dotnet/perf-autofiling-issues#45945

Many methods in the BCL, especially hwintrins related, contain a lot of code that is detected as dead during compilation. On mono, inlining happens during IL import and a lot of optimizations are run as later passes. This exposed the issue where we have a lot of dead code bloat from inlining, with optimizations later running on it. A simple solution for this problem was tracking jump counts for each bblock (dotnet#97514), which are initialized when bblocks are first created, before IL import stage. Then a small set of IL import level optimizations were added, in order to reduce the jump targets of each bblock. As we were further importing IL, if we reached a bblock with 0 jump targets, we would disable inlining into it, in order to reduce code bloat. Disabling code emit altogether was too challenging. Another limitation of this approach was that we would fail to detect dead code if it was part of a loop. The results were good however, by reducing mem usage in `System.Numerics.Tensor.Tests` from 6GB to 600MB. For an unrelated issue, the order in which we generate bblocks was redesigned in order to account for bblock stack state initialization in weird control flow scenarios (dotnet#108731). This was achieved by deferring IL import into bblocks that were not yet reached from other live bblocks. A side effect of this is that we no longer generate code at all in unreachable bblocks, completely superseding the previous approach while addressing both the problems of inlining into loops or generating IR for dead IL. In the previously mentioned test suite, this further reduced the memory usage to 300MB. Remnants of the unnecessary `no_inlining` approach still lingered in the code, leading to disabling of inline optimization in some reachable code. This triggered a significant performance regression which this PR addresses.

dotnet-policy-service · 2024-12-06T10:25:41Z

Tagging subscribers to this area: @BrzVlad, @kotlarmilos
See info in area-owners.md if you want to be subscribed.

kotlarmilos

LGTM!

…net#110468) Many methods in the BCL, especially hwintrins related, contain a lot of code that is detected as dead during compilation. On mono, inlining happens during IL import and a lot of optimizations are run as later passes. This exposed the issue where we have a lot of dead code bloat from inlining, with optimizations later running on it. A simple solution for this problem was tracking jump counts for each bblock (dotnet#97514), which are initialized when bblocks are first created, before IL import stage. Then a small set of IL import level optimizations were added, in order to reduce the jump targets of each bblock. As we were further importing IL, if we reached a bblock with 0 jump targets, we would disable inlining into it, in order to reduce code bloat. Disabling code emit altogether was too challenging. Another limitation of this approach was that we would fail to detect dead code if it was part of a loop. The results were good however, by reducing mem usage in `System.Numerics.Tensor.Tests` from 6GB to 600MB. For an unrelated issue, the order in which we generate bblocks was redesigned in order to account for bblock stack state initialization in weird control flow scenarios (dotnet#108731). This was achieved by deferring IL import into bblocks that were not yet reached from other live bblocks. A side effect of this is that we no longer generate code at all in unreachable bblocks, completely superseding the previous approach while addressing both the problems of inlining into loops or generating IR for dead IL. In the previously mentioned test suite, this further reduced the memory usage to 300MB. Remnants of the unnecessary `no_inlining` approach still lingered in the code, leading to disabling of inline optimization in some reachable code. This triggered a significant performance regression which this PR addresses.

BrzVlad requested review from kotlarmilos, steveisok and vitek-karas as code owners December 6, 2024 10:24

dotnet-issue-labeler bot added the area-Codegen-Interpreter-mono label Dec 6, 2024

dotnet-policy-service bot assigned BrzVlad Dec 6, 2024

build-analysis bot mentioned this pull request Dec 6, 2024

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

steveisok requested review from kg and lateralusX December 6, 2024 15:03

kotlarmilos approved these changes Dec 9, 2024

View reviewed changes

BrzVlad merged commit ad30479 into dotnet:main Dec 10, 2024
68 of 70 checks passed

This was referenced Dec 12, 2024

[Perf] Linux/x64: 321 Improvements on 12/10/2024 1:21:18 PM dotnet/perf-autofiling-issues#46467

Open

[Perf] Linux/x64: 34 Improvements on 12/10/2024 1:21:18 PM dotnet/perf-autofiling-issues#46461

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mono][interp] Remove no_inlining functionality for dead bblocks #110468

[mono][interp] Remove no_inlining functionality for dead bblocks #110468

BrzVlad commented Dec 6, 2024 •

edited

Loading

dotnet-policy-service bot commented Dec 6, 2024

kotlarmilos left a comment

[mono][interp] Remove no_inlining functionality for dead bblocks #110468

[mono][interp] Remove no_inlining functionality for dead bblocks #110468

Conversation

BrzVlad commented Dec 6, 2024 • edited Loading

dotnet-policy-service bot commented Dec 6, 2024

kotlarmilos left a comment

Choose a reason for hiding this comment

BrzVlad commented Dec 6, 2024 •

edited

Loading