-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64: Evaluate if it is possible to combine subsequent field loads in a single load #64815
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
@dotnet/jit-contrib |
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsEvaluate to see how feasible it would be to combine loads of subsequent fields using class Body { public double x, y, z, vx, vy, vz, mass; }
...
foreach (var b in bodies) {
b.x += dt * b.vx; b.y += dt * b.vy; b.z += dt * b.vz;
} Below code is generated for the loop that deals with multiplication of
G_M56457_IG05: ;; offset=0114H
D37D7C43 ubfiz x3, x2, #3, #32
91004063 add x3, x3, #16
F8636803 ldr x3, [x0, x3]
FD400470 ldr d16, [x3,#8] ; <-- #1
FD401071 ldr d17, [x3,#32] ; <-- #2
1E710811 fmul d17, d0, d17
1E712A10 fadd d16, d16, d17
FD000470 str d16, [x3,#8] ; <-- #3
FD400870 ldr d16, [x3,#16] ; <-- #1
FD401471 ldr d17, [x3,#40] ; <-- #2
1E710811 fmul d17, d0, d17
1E712A10 fadd d16, d16, d17
FD000870 str d16, [x3,#16] ; <-- #3
FD400C70 ldr d16, [x3,#24]
FD401871 ldr d17, [x3,#48]
1E710811 fmul d17, d0, d17
1E712A10 fadd d16, d16, d17
FD000C70 str d16, [x3,#24]
11000442 add w2, w2, #1
6B02003F cmp w1, w2
54FFFD8C bgt G_M56457_IG05
Reference: https://godbolt.org/z/9jY5hYnoa
|
Un-assigning myself |
@BruceForstall please move to future if we cannot accomodate this in .NET 7. |
@jkotas Is this transformation safe to do under the memory model? I ask because we would be combining two |
@TIHan
|
Yes, it is fine. From https://github.com/dotnet/runtime/blob/main/docs/design/specs/Memory-model.md#order-of-memory-operations: The effects of ordinary reads and writes can be reordered as long as that preserves single-thread consistency. Such reordering can happen both due to code generation strategy of the compiler or due to weak memory ordering in the hardware. |
Might be worth calling this out more explicitly in the memory model doc? We explicitly call out coalescing adjacent reads/writes and we state "single-thread consistency", but never actually define what "single-thread consistency" means. Presumably this requires us to know or be able to assume that there is no aliasing between the read/store? That is, I'd presume this would be invalid given two parameters Does "consistency" then also require no 'side-effects' and so the obvious things like |
cc @VSadov I believe that we are assuming the "single-thread consistency" definition from ECMA 335 I.12.6.4 here: "Conforming implementations of the CLI are free to execute programs using any technology that guarantees, within a single thread of execution, that side-effects and exceptions generated by a thread are visible in the order specified by the CIL. For this purpose only volatile operations (including volatile reads) constitute visible side-effects." |
Not sure I understand this last bit:
I would have assumed exceptions and stores are both counted as "visible" as well. For example, lets say you have a However, if we change this to |
Re: using ldp instead of two ldr Yes. If two nonvolatile loads are adjacent they can certainly be emitted as a single ldp. It is basically just a matter of encoding. Hardware would not guarantee the relative order of the loads anyways. This is also applicable in cases where the loads are not adjacent, but can be moved together without violating memory model. I think internally JIT uses slightly different form of analysis - in terms of intervening accesses, but it is basically a superset of whether the operations can be moved together, just easier to compute in the JIT implementation. |
We have some existing optimizations where we coalesce reads from the same location. |
RIght, it is what the paragraph that I quoted is trying to say. "side-effects and exceptions generated by a |
Re: exceptions For two ordinary ldr reads, the order of reads is not defined - that is in terms of the values that are being read. In terms of exceptions the reads happen in program order (speculative reads do not cause out of order faults). I believe ldp preserves the order of faults. I did not see in the spec something like "if one read faults, then whether the other read happens is undefined". The pseudocode given for LDP semantics seems to indicate that T1 is loaded before T2 and overall ldp does look like just a compact form of encoding two subsequent reads. Perhaps it is worth reaching to ARM folks to clarify that, just in case? (interestingly the result is undefined if T1 and T2 are the same register, I assume it refers mostly to the fetched value though) |
The effect of the ldp instruction has to happen fully or not at all. If it was not the case, page faults from instructions like |
Right. The pseudocode can also be read as if there are internal buffers that are filled by the reads and then registers are filled from the temps. I think such "transactional" implementation would make more sense. For us it means that if the second read may AV while the first read would not, we can't combine the reads. |
Evaluate to see how feasible it would be to combine loads of subsequent fields using
ldp
instead of loading them separately during the use. It cannot be considered as a peep-hole optimization, but an analysis is needed in earlier phases to check around for consecutive field loads and if found one, combine them to a single load.Below code is generated for the loop that deals with multiplication of
double
.#1
can be combined intoldp dX, dY, [x3, #8]
#2
can be combined intoldp dX, dY, [x3, #32]
#3
can be combined intostp dX, dY, [x3, #8]
Reference: https://godbolt.org/z/9jY5hYnoa
category:implementation
theme:codegen
skill-level:intermediate
cost:medium
impact:medium
The text was updated successfully, but these errors were encountered: