-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove GT_ASG nodes #10873
Comments
Unfortunately getting rid of I'm going to take a closer look to see what needs to be done, if only for the sake of curiosity. I cleaned up some Doing it all at once may be difficult but it's not clear if that's feasible. For a while I thought that it may be good to do this first for SSA phases but it's not so clear if it's actually simpler because these still involve morph and the assertion propagation has shared code for local and global propagation. Another way to split work may be scalar/struct or lclvar/indir. Dealing with scalars first may be easier because it avoids potential first-class structs complications. Dealing with indirs first may be easier because the JIT doesn't do a lot of work with indirs, compared to lclvars. There may be complications with lclvars written to via indirs. For some reason the JIT keeps checking for such cases, instead of simply converting And finally, I've made on bone headed attempt to generate a |
@dotnet/jit-contrib |
There's been a general desire to import "rationalized" IR and therefore remove the "rationalizer" phase (after which there is no |
@mikedn thanks for opening this issue! The RyuJIT overview mentions the desire to move the Rationalized IR earlier in the JIT: https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md#rationalization And actually, I think that eliminating COMMA nodes will be potentially more challenging, but I could be wrong. |
Thanks, I knew I saw some discussion somewhere but couldn't remember where, I thought it was in a PR/issue. That document also mentions
Hmm, I'd put that in a different bucket - side effects and ordering. I used to think that moving the whole JIT to LIR could be a good idea. Now I think that, while it may better than the current implementation, it's not the best solution - LIR is good for some things and bad for others (try doing expression reassociation, you'll either battle with maintaining the LIR links or "sequence" the whole thing and take a throughput hit). So I have this vague idea of a model where only side effects are maintained in linear order while other nodes roam freely and they're referenced only via side effect nodes. But that's another issue and my current impressions is that if we do any of this removing |
I was looking at lclvar nodes and noticed something that may explain how In general, it may be useful to revisit the lclvar node hierarchy as it may be possible to improve some SSA related stuff - either replace the SSA number with a pointer to the SSA def (problematic now because on 64 bit targets it increases the node size) or find room to store an additional SSA number for If something like |
Removing assignment (and
Despite the fact that the assignment is "indirect" the Ideally there should be no indirect store needed. But how to avoid it, add |
Assignment to array elements produces somewhat strange trees:
The LHS of the assignment isn't an IND node, it's a COMMA. The actual IND node is hidden behind the comma. To get rid of ASG we'll need to make it so that the COMMA tree produces the address (as a byref) and then ASG + IND can be easily changed to a STOREIND. We should probably ensure that the ASG LHS is always a "location" node (LCL_VAR, LCL_FLD, IND etc.) before attempting to remove ASG nodes. |
There are some COMMA patterns that aren't well handled by CSE, and in my (somewhat dated) experience of change the block stores to assignments, I ran into some of those. When we transform these array element assignments we'll have to ensure that we preserve CSE's ability to reason about them. |
Speaking of block stores/assignments, assigning a struct to an array element generates a different tree. The LHS of the ASG is a BLK (good), the COMMA produces an address (good), the address is produced by taking the address of a struct IND (kind of weird), the IND doesn't have GTF_DONT_CSE set (scary):
Guess that the only thing that prevents CSE from messing up with the indir is the fact that its a struct. Well, at least the assignment part is clear, it simply maps to a STORE_BLK. |
@CarolEidt What do you think about adding a Ideally we'd just add the block size to Potential alternatives:
|
I'm out of the office today and it's hard to look at the sources on my phone ;-) but I'd like to see us split off the variants of LclFld. One that represents an actual field, and from which one can recover the handle info (and therefore the size) and a variant that represents a field access for an opaque or possibly overlapping field access. I'm not crazy about If necessary I think I'd prefer to constrain size and offset over using a large node for the common case. |
Hmm, I know you mentioned this in the past but I don't quite understand what would be the advantage of having a variant of LclFld. To me it seems simpler to do something like:
|
Turns out that the offset is already constrained: https://github.com/dotnet/coreclr/blob/631407852a6332a659a6a57ede670d3e94c81efb/src/jit/morph.cpp#L13142-L13145 So an |
@mikedn - I like the idea of using an offset and an index into a "layout table" (I'm sure there are other places where we separately allocate the layout table, and could instead index into a common table). |
Oh, and to clarify - the reason that I thought it would be useful to have two variants of lclFld would be to distinguish the variant that needs to be handled conservatively (accessing a field through a different type and/or accessing a field of a struct with overlapping fields) from the one that is non-aliasing. While it may be moot for the front-end, where we rely on value numbers, it may be useful in reducing the places (e.g. in morph) where we treat all lclFlds conservatively, as well as downstream where we no longer have valid SSA/value numbers. |
Hmm, some of these cases could be handled in a different manner. For example, accessing a |
It looks to me that the best place to start is actually GT_BLK/OBJ. GT_OBJ in particular because it has so much stuff in it that it end up being a large node. Very early work, just enough to show the approach and survive some diffing (no diffs): mikedn/coreclr@086c7fa Block size, class handle and gc pointer table are moved to a separate "layout" object and IMO it has a bunch of advantages:
The need for hashtable might be a disadvantage but then the number of struct types you encounter in a method is usually very small so it shouldn't be a problem. And the hashtable lookup cost is likely to be smaller than the redundant VM calls we can avoid. Once this is done it should be pretty easy to extend it to |
Yes, I think that, in general, using |
I'm more concerned about the backend. I'm not sure if and how we'll be able to handle something like BITCAST between |
Now, if we allow Looks like I'll end up trying to do this kind of stuff in |
@CarolEidt Since It's quite trivial. Saves ~1KB in corelib. The only problem will be adding support for long-double conversion on x86 (and ARM32 I guess). |
Would a similar optimization around |
For what types? |
I wasn't thinking about hardcoding it to particular types, but rather seeing if a more generalized option would be possible (any two structs that are both blittable and the same size is likely a good basis). S.P.Corelib (and CoreFX) use
|
Well, with some exceptions structs aren't enregistrable. Since they're already in memory, reinterpretation shouldn't have much impact on the generated code. That said, I don't remember looking at such cases so perhaps there are some improvements that could be made.
Perhaps that was about primitives and not structs? The problem with primitives is that the current handling of reinterpretation (by using |
There is an interesting case that BITCAST (well, at least in its current form) can't handle and that came up in "real world" (Decimal) code: struct Pair { public int lo, hi; }
…
long l = *(long*)&pair.lo; Today this forces the struct in memory which is not ideal. The JIT uses a Trouble is, it's not that simple to decide when it's better to spill the struct to memory so you can simply load the mov ecx, eax ; eax contains lo
shl rdx, 32 ; edx contains hi
or rcx, rdx ; rcx contains the result If the values aren't enregistered then the code gets a bit more messy as you may end up with 2 loads instead of the original one load: mov ecx, dword ptr [lo]
mov edx, dword ptr [hi]
shl rdx, 32
or rcx, rdx instead of just mov rcx, qword ptr [lo] There will be trade-offs to be made. Now, all this is a bit ortoghonal to this issue - *(long*)&pair.lo = l; And end up with similar drawbacks: mov dword ptr [lo], rax
shr rax, 32
mov dword ptr [hi], rax instead of single And it gets even more complicated if you have more than 2 fields (e.g. you could read an |
Going back to struct typed The main problem is where exactly to produce such struct typed |
The example from Decimal is probably optimal with direct memory accesses because the structs are always in memory anyway (often passed in |
Yes but the general case can be more complicated. You could use the And then there's also the pesky problem of the SYS V calling convention that stuffs 2
Actually a 4 Hmm, that's actually something I should pay attention to. If you have 4 |
And of course, this isn't actually SYS V specific. It happens on Windows as well except that only for structs having size <= 8 (e.g. on Linux The not so fortunate effects of JIT's limitation in this area can be seen in the recently added mov qword ptr [rsp+10H], rdx
mov eax, dword ptr [rsp+10H] ; GT_LCL_FLD
mov edx, dword ptr [rcx]
cmp eax, edx
jne SHORT G_M32203_IG04
mov eax, dword ptr [rsp+14H] ; GT_LCL_FLD
mov edx, dword ptr [rcx+4]
cmp eax, edx The ideal codegen would be something like: cmp dword ptr [rcx], edx
jne SHORT G_M32203_IG04
shr edx, 32
cmp dword ptr [rcx+4], edx @CarolEidt Do you know if there's any opened issue tracking such problems? Should there be? This isn't directly related to |
This came up occasionally in discussions but AFAIK there's no issue associated with it and IMO there should be due to the significant (negative) impact
GT_ASG
nodes have on RyuJIT's IR.IMO it's one of the worst IR "features", if not plain and simple the worst. So far I only encountered pros and no cons:
GT_LCL_VAR
on the RHS is a use but on the LHS is a def. The JIT uses various means to deal with this issue -GTF_VAR_DEF
,GTF_IND_ASG_LHS
,gtGetParent()
etc.GT_LCL_VAR
node on the LHS, from there you need to usegtGetParent()
to get theGT_ASG
node and then get the assignment's second operand. If you're lucky the LHS follow the assignment, otherwisegtGetParent()
will need to traverse multiple nodes.mov eax, ebx
is assignment but that's only true ifeax
andebx
happen to be enregistered variables. Otherwise you'd be looking atmov [eax], ebx
which is more similar to a store.Anyone knows any advantages?
AFAIR Phoenix did have
Assign
but its IR was very different, looking more like assembly. I know next to nothing about LLVM but I don't think it has assignment.category:implementation
theme:ir
skill-level:expert
cost:extra-large
The text was updated successfully, but these errors were encountered: