-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array interface method devirtualization #62457
Comments
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThe jit cannot devirtualize interface calls made through arrays, because this aspect of arrays has a special implementation in the runtime. I noticed this a while back but never wrote up anything. I was reminded of this again by #62266. Consider a simple program like: using System;
using System.Collections.Generic;
class X {
public static void Main() {
IList<int> list = new int[10];
for (int i = 0; i < 100; i++) {
foreach (var _ in list) {
// nothing
}
}
}
} Here all types are known and we should be able to de-abstract this and come close to the equivalent for loop. Currently none of the interface calls devirtualize or inline, the method allocates, and the try/finally can't be optimized away, despite the empty Fixing all this requires resolving a number of issues:
I will point out that the "equivalent" direct code (with loop non-empty to avoid it being optimized away entirely) public static int M_enum()
{
int[] a = new int[1000];
int result = 0;
foreach (var v in a)
{
result ^= v;
}
return result;
} is marginally less efficient than the for loop version: public static int M()
{
int[] a = new int[1000];
int result = 0;
for (int i = 0; i < a.Length; i++)
{
result ^= a[i];
}
return result;
} In the more general case where we don't know the exact collection class, we might hope PGO would get us to an equivalent place. But here there are additional challenges.
This is perhaps the simplest case of de-abstraction. Other collection types have more complex enumerators.
|
Update jit and runtime to devirtualize interface calls on arrays. This resolves the first two issues noted in dotnet#62457.
Key loader log diff in cwt case #62497 (comment). In the BAD case the stub desc is created early during devirt.
|
Update jit and runtime to devirtualize interface calls on arrays over non-shared element types. Shared types are not handled. The resulting enumerator calls would not inline as they are on a different class. This addresses the first two issues noted in dotnet#62457.
I have a version that backs off from devirtualizing arrays over shared types. Still trying to decide if there's sufficient benefit in this overall. @jkotas is it feasible to change |
It would de-optimize it. The interface method calls would have to go through unboxing instantiating stubs. |
Ok, thanks. Keeping it as a class, we'd need to rely on object stack allocation to drive promotion. But we can't yet rely on object stack allocation until we can devirtualize the subsequent calls so we can know the enumerator can't escape. That is, with the current information flow in the JIT, we'd need to know what type of enumerator was going to be returned without having to inline And that would only work well when we directly devirtualize; if we also need guarded devirtualization to guess at the type, then the flow from the known and unknown enumerator types merges before the calls on the enumerator and we can't conditionally disambiguate this in the GDVs that those calls inspire, so it will again look like even if we know the type that the enumerator object might escape. Seems like too many missing pieces here and unknowable overall benefit. Going to shelve this for now. |
Latest changes here: https://github.com/AndyAyersMS/runtime/tree/ArrayInterfaceDevirt |
Update JIT and runtime to devirtualize interface calls on arrays over non-shared element types. Shared types are not (yet) handled. Add intrinsic and inlining attributes to key methods in the BCL. This allows the JIT to devirtualize and inline enumerator creation and devirtualize and inline all methods that access the enumerator. And this in turn allows the enumerator to be stack allocated. However, the enumerator fields are not (yet) physically promoted, because of an optimization in the BCL to return a static empty array enumerator. So the object being accessed later is ambiguous. Progress towards dotnet#62457.
I have revived this. Now that we have stack allocation and physical promotion, some of the issues noted above are more tractable. Latest changes here: https://github.com/AndyAyersMS/runtime/tree/ArrayInterfaceDevirtV2 This includes fixes for points (4) and (5) above: we special-case knowledge of the return type early on, so the subsequent calls devirtualize, and we mark them as aggressive inline candidates. With this the enumerator object is stack allocated(*) and all methods that touch its fields are devirtualized and inlined. So things are set up to enable physical promotion, but it doesn't happen yet because of the empty array optimization done here:
This pattern causes the enumerator methods to see two reaching defs, one from the stack allocated case and another from the static. If the array length is known we might be able to propagate it to the conditional above and eliminate one case or the other, but if the length is not known (which will commonly be the case) this won't help. Broadly speaking if we can stack allocate then we'd be better off if the empty case wasn't handled like this, but it's not clear yet how to anticipate and/or undo that optimization... (*) unfortunately not in the example above, there the allocation site is in a loop and we don't support stack allocation when the site is in a loop yet. So we're looking at the simpler case public static void Main()
{
IList<int> list = new int[10];
foreach (var _ in list)
{
// nothing
}
} and the with the changes above the JIT produces ; Assembly listing for method X:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rbp based frame
; fully interruptible
; No PGO data
; 3 inlinees with PGO data; 4 single block inlinees; 2 inlinees without PGO data
; Final local variable assignments
;
; V00 loc0 [V00,T02] ( 9, 47 ) byref -> rcx class-hnd exact single-def <System.SZGenericArrayEnumerator`1[int]>
; V01 OutArgs [V01 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
; V02 tmp1 [V02,T05] ( 2, 2 ) byref -> rcx class-hnd "Inline return value spill temp" <System.Collections.Generic.IEnumerator`1[int]>
;* V03 tmp2 [V03 ] ( 0, 0 ) ref -> zero-ref class-hnd exact "Inlining Arg" <int[]>
; V04 tmp3 [V04,T06] ( 2, 2 ) ref -> rax class-hnd exact single-def "Inline stloc first use temp" <int[]>
;* V05 tmp4 [V05,T07] ( 0, 0 ) int -> zero-ref single-def "Inline stloc first use temp"
;* V06 tmp5 [V06 ] ( 0, 0 ) long -> zero-ref class-hnd exact "NewObj constructor temp" <System.SZGenericArrayEnumerator`1[int]>
;* V07 tmp6 [V07,T08] ( 0, 0 ) int -> zero-ref single-def
;* V08 tmp7 [V08 ] ( 0, 0 ) ubyte -> zero-ref "Inlining Arg"
;* V09 tmp8 [V09 ] ( 0, 0 ) ubyte -> zero-ref "Inlining Arg"
;* V10 tmp9 [V10 ] ( 0, 0 ) ubyte -> zero-ref "Inline return value spill temp"
; V11 tmp10 [V11,T04] ( 3, 20 ) int -> rax "Inline stloc first use temp"
; V12 tmp11 [V12 ] ( 5, 5 ) struct (24) [rbp-0x18] do-not-enreg[XSF] must-init addr-exposed "stack allocated ref class temp" <System.SZGenericArrayEnumerator`1[int]>
; V13 tmp12 [V13,T00] ( 2, 64 ) ref -> rdx "arr expr"
; V14 tmp13 [V14,T01] ( 2, 64 ) int -> rax "index expr"
; V15 PSPSym [V15,T09] ( 1, 1 ) long -> [rbp-0x20] do-not-enreg[V] "PSPSym"
; V16 cse0 [V16,T03] ( 4, 24 ) int -> rax "CSE #02: aggressive"
;
; Lcl frame size = 64
G_M9330_IG01: ;; offset=0x0000
push rbp
sub rsp, 64
lea rbp, [rsp+0x40]
vxorps xmm4, xmm4, xmm4
vmovdqu xmmword ptr [rbp-0x18], xmm4
xor eax, eax
mov qword ptr [rbp-0x08], rax
mov qword ptr [rbp-0x20], rsp
;; size=29 bbWeight=1 PerfScore 6.33
G_M9330_IG02: ;; offset=0x001D
mov rcx, 0x7FFF9098E958 ; int[]
mov edx, 10
call CORINFO_HELP_NEWARR_1_VC
mov rcx, 0x7FFF91289818 ; System.SZGenericArrayEnumerator`1[int]
mov qword ptr [rbp-0x18], rcx
mov dword ptr [rbp-0x10], -1
mov dword ptr [rbp-0x0C], 10
mov gword ptr [rbp-0x08], rax
lea rcx, [rbp-0x18]
align [0 bytes for IG03]
;; size=56 bbWeight=1 PerfScore 6.25
G_M9330_IG03: ;; offset=0x0055
mov eax, dword ptr [rcx+0x08]
inc eax
cmp eax, dword ptr [rcx+0x0C]
jb SHORT G_M9330_IG05
;; size=10 bbWeight=8 PerfScore 50.00
G_M9330_IG04: ;; offset=0x005F
mov eax, dword ptr [rcx+0x0C]
mov dword ptr [rcx+0x08], eax
jmp SHORT G_M9330_IG09
;; size=8 bbWeight=1 PerfScore 5.00
G_M9330_IG05: ;; offset=0x0067
mov dword ptr [rcx+0x08], eax
mov eax, dword ptr [rcx+0x08]
cmp eax, dword ptr [rcx+0x0C]
jae SHORT G_M9330_IG08
;; size=11 bbWeight=4 PerfScore 28.00
G_M9330_IG06: ;; offset=0x0072
mov rdx, gword ptr [rcx+0x10]
cmp eax, dword ptr [rdx+0x08]
jae SHORT G_M9330_IG07
jmp SHORT G_M9330_IG03
;; size=11 bbWeight=16 PerfScore 128.00
G_M9330_IG07: ;; offset=0x007D
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
G_M9330_IG08: ;; offset=0x0083
mov ecx, eax
call [System.ThrowHelper:ThrowInvalidOperationException_EnumCurrent(int)]
int3
;; size=9 bbWeight=0 PerfScore 0.00
G_M9330_IG09: ;; offset=0x008C
add rsp, 64
pop rbp
ret
;; size=6 bbWeight=1 PerfScore 1.75
G_M9330_IG10: ;; offset=0x0092
push rbp
sub rsp, 48
mov rbp, qword ptr [rcx+0x20]
mov qword ptr [rsp+0x20], rbp
lea rbp, [rbp+0x40]
;; size=18 bbWeight=0 PerfScore 0.00
G_M9330_IG11: ;; offset=0x00A4
add rsp, 48
pop rbp
ret
;; size=6 bbWeight=0 PerfScore 0.00 Note that the empty array case is optimized away, but that happens too late for physical promotion, so all the enumerator accesses are stack accesses. There's also an empty fault handler... need to look into why that doesn't get optimized away. |
Getting this case to work is just the first step -- after that we need to look at the cases where we have only |
Looks like the finally/fault has some residual IR which ends up getting optimized away by RBO. So perhaps it makes sense to rerun the EH opts after global opts? |
I worked up some (arguably hacky) changes to flag the ALLOCOBJ in SZArray's GetEnumerator as part of an "empty static pattern" and used that to prune away the control flow when the JIT is able to stack allocate the enumerator object. However even with this the JIT is unable to promote the enumerator fields, because the enumerator address So we either need to keep track of where definitions can happen (so local morph can know not to kill certain assertions when entering loops) or try to do a more aggressive address propagation beforehand. |
Update JIT and runtime to devirtualize interface calls on arrays over non-shared element types. Shared types are not (yet) handled. Add intrinsic and inlining attributes to key methods in the BCL. This allows the JIT to devirtualize and inline enumerator creation and devirtualize and inline all methods that access the enumerator. And this in turn allows the enumerator to be stack allocated. However, the enumerator fields are not (yet) physically promoted, because of an optimization in the BCL to return a static empty array enumerator. So the object being accessed later is ambiguous. Progress towards dotnet#62457.
Update JIT and runtime to devirtualize interface calls on arrays over non-shared element types. Shared types are not (yet) handled. Add intrinsic and inlining attributes to key methods in the BCL. This allows the JIT to devirtualize and inline enumerator creation and devirtualize and inline all methods that access the enumerator. And this in turn allows the enumerator to be stack allocated in some simple cases. However, the enumerator fields are not (yet) physically promoted, because of an optimization in the BCL to return a static empty array enumerator. So the object being accessed later is ambiguous. Alse ensure that since GDV resolves the virtual call twice, and expects to get similar results both times, things work for the array case by keeping track of the initial devirtualization inputs. Progress towards #62457.
Update JIT and runtime to devirtualize interface calls on arrays over non-shared element types. Shared types are not (yet) handled. Add intrinsic and inlining attributes to key methods in the BCL. This allows the JIT to devirtualize and inline enumerator creation and devirtualize and inline all methods that access the enumerator. And this in turn allows the enumerator to be stack allocated in some simple cases. However, the enumerator fields are not (yet) physically promoted, because of an optimization in the BCL to return a static empty array enumerator. So the object being accessed later is ambiguous. Alse ensure that since GDV resolves the virtual call twice, and expects to get similar results both times, things work for the array case by keeping track of the initial devirtualization inputs. Progress towards dotnet#62457.
Update JIT and runtime to devirtualize interface calls on arrays over non-shared element types. Shared types are not (yet) handled. Add intrinsic and inlining attributes to key methods in the BCL. This allows the JIT to devirtualize and inline enumerator creation and devirtualize and inline all methods that access the enumerator. And this in turn allows the enumerator to be stack allocated in some simple cases. However, the enumerator fields are not (yet) physically promoted, because of an optimization in the BCL to return a static empty array enumerator. So the object being accessed later is ambiguous. Alse ensure that since GDV resolves the virtual call twice, and expects to get similar results both times, things work for the array case by keeping track of the initial devirtualization inputs. Progress towards dotnet#62457.
In an example like [MethodImpl(MethodImplOptions.NoInlining)]
IEnumerable<int> getQ() { ... }
int f3q()
{
var e = getQ();
var sum = 0;
foreach (int i in e) sum += i;
return sum;
} Here the jit must rely on GDV to determine the type of the collection. Ideally we'd then clone the loop based on the enumerator type. But the loop created by the
If we modify the "loop has a try" check to instead be that all loop blocks are in the same EH region (which seems more sensible), cloning fails because of the preheader. If we bypass that check, cloning puts blocks in the wrong EH region. But seemingly we can split the preheader edge and get a second preheader in the right region? |
Update jit and runtime to devirtualize interface calls on arrays over non-shared element types. Shared types are not handled. The resulting enumerator calls would not inline as they are on a different class. This addresses the first two issues noted in dotnet#62457.
The jit cannot devirtualize interface calls made through arrays, because this aspect of arrays has a special implementation in the runtime. I noticed this a while back but never wrote up anything. I was reminded of this again by #62266.
Consider a simple program like:
Here all types are known and we should be able to de-abstract this and come close to the equivalent for loop.
Currently none of the interface calls devirtualize or inline, the method allocates, and the try/finally can't be optimized away, despite the empty
Dispose
from the enumerator.Fixing all this requires resolving a number of issues:
resolveVirtualMethodHelper
on the jit interface callsCanCastToInterface
-- this bypasses special checks inCanCastTo
that handle interface methods for arrays.SZArrayHelper.GetEnumerator<T>
which is a generic method, not an instance method on a generic class. So the context handling inresolveVirtualMethodHelper
needs to be updated here and also over inimpDevirtualizeMethod
-- currently we always assume class context.GetEnumerator
call, but this happens "too late" to devirtualize the subsequentMoveNext
andget_Current
calls through the enumerator. The fix here is likely to mark theseGetEnumerator
methods as intrinsics and extendgetSpecialIntrinsicExactReturnType
to propagate the right type downstream during importation.GetEnumerator
call is too big to inline by default. Likely we can just mark this withAggressiveInlining
. But pragmatically we might want to boost inlining for methods that return a type (especially a sealed type) that is more derived that the declared return type. Doing so likely requires resolving more tokens during the pre-inline scan, but perhaps we're already doing this fornewobj
and it might not be too costly.MoveNext
andDispose
as we see the improved type flowing out of the inlined method -- but not the call toget_Current
-- not sure why just yet.I will point out that the "equivalent" direct code (with loop non-empty to avoid it being optimized away entirely)
is marginally less efficient than the for loop version:
In the more general case where we don't know the exact collection class, we might hope PGO would get us to an equivalent place. But here there are additional challenges.
GetEnumerator
GDV test and simplify that path further. So, in the "good" case we do one type test to see what kind of collection we have, and then branch to a specialized loop that knows exactly what kind of enumerator we have.This is perhaps the simplest case of de-abstraction. Other collection types have more complex enumerators.
category:cq
theme:devirtualization
skill-level:expert
cost:small
impact:medium
The text was updated successfully, but these errors were encountered: