-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mono][interp] Remove no_inlining functionality for dead bblocks #110468
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Many methods in the BCL, especially hwintrins related, contain a lot of code that is detected as dead during compilation. On mono, inlining happens during IL import and a lot of optimizations are run as later passes. This exposed the issue where we have a lot of dead code bloat from inlining, with optimizations later running on it. A simple solution for this problem was tracking jump counts for each bblock (dotnet#97514), which are initialized when bblocks are first created, before IL import stage. Then a small set of IL import level optimizations were added, in order to reduce the jump targets of each bblock. As we were further importing IL, if we reached a bblock with 0 jump targets, we would disable inlining into it, in order to reduce code bloat. Disabling code emit altogether was too challenging. Another limitation of this approach was that we would fail to detect dead code if it was part of a loop. The results were good however, by reducing mem usage in `System.Numerics.Tensor.Tests` from 6GB to 600MB. For an unrelated issue, the order in which we generate bblocks was redesigned in order to account for bblock stack state initialization in weird control flow scenarios (dotnet#108731). This was achieved by deferring IL import into bblocks that were not yet reached from other live bblocks. A side effect of this is that we no longer generate code at all in unreachable bblocks, completely superseding the previous approach while addressing both the problems of inlining into loops or generating IR for dead IL. In the previously mentioned test suite, this further reduced the memory usage to 300MB. Remnants of the unnecessary `no_inlining` approach still lingered in the code, leading to disabling of inline optimization in some reachable code. This triggered a significant performance regression which this PR addresses.
BrzVlad
requested review from
kotlarmilos,
steveisok and
vitek-karas
as code owners
December 6, 2024 10:24
Tagging subscribers to this area: @BrzVlad, @kotlarmilos |
Open
3 tasks
This was referenced Dec 6, 2024
kotlarmilos
approved these changes
Dec 9, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
hez2010
pushed a commit
to hez2010/runtime
that referenced
this pull request
Dec 14, 2024
…net#110468) Many methods in the BCL, especially hwintrins related, contain a lot of code that is detected as dead during compilation. On mono, inlining happens during IL import and a lot of optimizations are run as later passes. This exposed the issue where we have a lot of dead code bloat from inlining, with optimizations later running on it. A simple solution for this problem was tracking jump counts for each bblock (dotnet#97514), which are initialized when bblocks are first created, before IL import stage. Then a small set of IL import level optimizations were added, in order to reduce the jump targets of each bblock. As we were further importing IL, if we reached a bblock with 0 jump targets, we would disable inlining into it, in order to reduce code bloat. Disabling code emit altogether was too challenging. Another limitation of this approach was that we would fail to detect dead code if it was part of a loop. The results were good however, by reducing mem usage in `System.Numerics.Tensor.Tests` from 6GB to 600MB. For an unrelated issue, the order in which we generate bblocks was redesigned in order to account for bblock stack state initialization in weird control flow scenarios (dotnet#108731). This was achieved by deferring IL import into bblocks that were not yet reached from other live bblocks. A side effect of this is that we no longer generate code at all in unreachable bblocks, completely superseding the previous approach while addressing both the problems of inlining into loops or generating IR for dead IL. In the previously mentioned test suite, this further reduced the memory usage to 300MB. Remnants of the unnecessary `no_inlining` approach still lingered in the code, leading to disabling of inline optimization in some reachable code. This triggered a significant performance regression which this PR addresses.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Many methods in the BCL, especially hwintrins related, contain a lot of code that is detected as dead during compilation. On mono, inlining happens during IL import and a lot of optimizations are run as later passes. This exposed the issue where we have a lot of dead code bloat from inlining, with optimizations later running on it.
A simple solution for this problem was tracking jump counts for each bblock (#97514), which are initialized when bblocks are first created, before IL import stage. Then a small set of IL import level optimizations were added, in order to reduce the jump targets of each bblock. As we were further importing IL, if we reached a bblock with 0 jump targets, we would disable inlining into it, in order to reduce code bloat. Disabling code emit altogether was too challenging. Another limitation of this approach was that we would fail to detect dead code if it was part of a loop. The results were good however, by reducing mem usage in
System.Numerics.Tensor.Tests
from 6GB to 600MB.For an unrelated issue, the order in which we generate bblocks was redesigned in order to account for bblock stack state initialization in weird control flow scenarios (#108731). This was achieved by deferring IL import into bblocks that were not yet reached from other live bblocks. A side effect of this is that we no longer generate code at all in unreachable bblocks, completely superseding the previous approach while addressing both the problems of inlining into loops or generating IR for dead IL. In the previously mentioned test suite, this further reduced the memory usage to 300MB.
Remnants of the unnecessary
no_inlining
approach still lingered in the code, leading to disabling of inline optimization in some reachable code. This triggered a significant performance regression which this PR addresses.dotnet/perf-autofiling-issues#45939
dotnet/perf-autofiling-issues#45894
dotnet/perf-autofiling-issues#45945