Arm64: Evaluate if it is possible to combine subsequent field loads in a single load #64815

kunalspathak · 2022-02-04T15:59:14Z

Evaluate to see how feasible it would be to combine loads of subsequent fields using ldp instead of loading them separately during the use. It cannot be considered as a peep-hole optimization, but an analysis is needed in earlier phases to check around for consecutive field loads and if found one, combine them to a single load.

class Body { public double x, y, z, vx, vy, vz, mass; }
...
foreach (var b in bodies) {
    b.x += dt * b.vx; b.y += dt * b.vy; b.z += dt * b.vz;
}

Below code is generated for the loop that deals with multiplication of double.

#1 can be combined into ldp dX, dY, [x3, #8]
#2 can be combined into ldp dX, dY, [x3, #32]
#3 can be combined into stp dX, dY, [x3, #8]

G_M56457_IG05:              ;; offset=0114H
        D37D7C43          ubfiz   x3, x2, #3, #32
        91004063          add     x3, x3, #16
        F8636803          ldr     x3, [x0, x3]
        FD400470          ldr     d16, [x3,#8]   ; <-- #1
        FD401071          ldr     d17, [x3,#32]  ; <-- #2
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000470          str     d16, [x3,#8]   ; <-- #3
        FD400870          ldr     d16, [x3,#16]  ; <-- #1
        FD401471          ldr     d17, [x3,#40]  ; <-- #2
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000870          str     d16, [x3,#16]  ; <-- #3
        FD400C70          ldr     d16, [x3,#24]
        FD401871          ldr     d17, [x3,#48]
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000C70          str     d16, [x3,#24]
        11000442          add     w2, w2, #1
        6B02003F          cmp     w1, w2
        54FFFD8C          bgt     G_M56457_IG05

Reference: https://godbolt.org/z/9jY5hYnoa

category:implementation
theme:codegen
skill-level:intermediate
cost:medium
impact:medium

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2022-02-04T15:59:17Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

kunalspathak · 2022-02-04T15:59:30Z

@dotnet/jit-contrib

ghost · 2022-02-04T19:55:55Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Evaluate to see how feasible it would be to combine loads of subsequent fields using ldp instead of loading them separately during the use. It cannot be considered as a peep-hole optimization, but an analysis is needed in earlier phases to check around for consecutive field loads and if found one, combine them to a single load.

class Body { public double x, y, z, vx, vy, vz, mass; }
...
foreach (var b in bodies) {
    b.x += dt * b.vx; b.y += dt * b.vy; b.z += dt * b.vz;
}

Below code is generated for the loop that deals with multiplication of double.

#1 can be combined into ldp dX, dY, [x3, #8]
#2 can be combined into ldp dX, dY, [x3, #32]
#3 can be combined into stp dX, dY, [x3, #8]

G_M56457_IG05:              ;; offset=0114H
        D37D7C43          ubfiz   x3, x2, #3, #32
        91004063          add     x3, x3, #16
        F8636803          ldr     x3, [x0, x3]
        FD400470          ldr     d16, [x3,#8]   ; <-- #1
        FD401071          ldr     d17, [x3,#32]  ; <-- #2
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000470          str     d16, [x3,#8]   ; <-- #3
        FD400870          ldr     d16, [x3,#16]  ; <-- #1
        FD401471          ldr     d17, [x3,#40]  ; <-- #2
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000870          str     d16, [x3,#16]  ; <-- #3
        FD400C70          ldr     d16, [x3,#24]
        FD401871          ldr     d17, [x3,#48]
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000C70          str     d16, [x3,#24]
        11000442          add     w2, w2, #1
        6B02003F          cmp     w1, w2
        54FFFD8C          bgt     G_M56457_IG05

Reference: https://godbolt.org/z/9jY5hYnoa

Author:	kunalspathak
Assignees:	echesakovMSFT
Labels:	`area-CodeGen-coreclr`, `untriaged`
Milestone:	-

echesakov · 2022-03-15T22:05:47Z

Un-assigning myself
cc @BruceForstall

JulieLeeMSFT · 2022-05-20T20:08:40Z

@BruceForstall please move to future if we cannot accomodate this in .NET 7.

TIHan · 2023-03-21T19:56:48Z

@jkotas Is this transformation safe to do under the memory model? I ask because we would be combining two ldrs together when they both have a store, str, in-between.

EgorBo · 2023-03-21T20:06:42Z

@TIHan ldp is practically the same as two subsequent ldr the only real difference that it save 4 bytes of instructions in codegen (might make some hot loop body smaller, etc)

ldp has 2x smaller throughput than ldr

jkotas · 2023-03-21T20:08:57Z

Is this transformation safe to do under the memory model?

Yes, it is fine. From https://github.com/dotnet/runtime/blob/main/docs/design/specs/Memory-model.md#order-of-memory-operations: The effects of ordinary reads and writes can be reordered as long as that preserves single-thread consistency. Such reordering can happen both due to code generation strategy of the compiler or due to weak memory ordering in the hardware.

tannergooding · 2023-03-21T20:34:35Z

The effects of ordinary reads and writes can be reordered as long as that preserves single-thread consistency

Might be worth calling this out more explicitly in the memory model doc?

We explicitly call out coalescing adjacent reads/writes and we state "single-thread consistency", but never actually define what "single-thread consistency" means.

Presumably this requires us to know or be able to assume that there is no aliasing between the read/store? That is, I'd presume this would be invalid given two parameters ref src and ref dst, since these could alias and therefore the store could modify what was loaded from src. I'd' guess the same would be true for ref T or T* in general in relation to two field reads, since those "could" alias the fields.

Does "consistency" then also require no 'side-effects' and so the obvious things like volatile or barriers would need to block the optimization. But so would things like division, indexing (with unhoisted bounds checks), or dereferences (with unhoisted null checks)?

jkotas · 2023-03-21T20:56:02Z

Might be worth calling this out more explicitly in the memory model doc?
never actually define what "single-thread consistency" means.

cc @VSadov

I believe that we are assuming the "single-thread consistency" definition from ECMA 335 I.12.6.4 here: "Conforming implementations of the CLI are free to execute programs using any technology that guarantees, within a single thread of execution, that side-effects and exceptions generated by a thread are visible in the order specified by the CIL. For this purpose only volatile operations (including volatile reads) constitute visible side-effects."

tannergooding · 2023-03-21T21:15:14Z

I believe that we are assuming the "single-thread consistency" definition from ECMA 335 I.12.6.4 here...

Not sure I understand this last bit:

For this purpose only volatile operations (including volatile reads) constitute visible side-effects."

I would have assumed exceptions and stores are both counted as "visible" as well.

For example, lets say you have a store classA.fieldX, read classA.fieldZ, store classA.fieldY where X/Y are sequential fields. In this case, it's "safe" since we know that the first store classA.fieldX will throw the NRE if classA is null. So it's safe to coalesce store.classA.fieldY alongside it (it can't become null and no single thread aliases, so it can't side effect outside its own store occurring).

However, if we change this to store classA.fieldX, read classB.fieldZ, store classA.fieldY, then the read classB.fieldZ can throw. I would assume that coalescing in this case is invalid since the caller could have a catch (NullReferenceException) handler and then do something based on the state of classA.fieldY. Therefore, moving it before the potentially side effecting read classB.fieldZ would have to be considered "inconsistent" with regards to single thread consistency, no?

VSadov · 2023-03-21T22:07:09Z

Re: using ldp instead of two ldr

Yes. If two nonvolatile loads are adjacent they can certainly be emitted as a single ldp. It is basically just a matter of encoding. Hardware would not guarantee the relative order of the loads anyways.

This is also applicable in cases where the loads are not adjacent, but can be moved together without violating memory model.

I think internally JIT uses slightly different form of analysis - in terms of intervening accesses, but it is basically a superset of whether the operations can be moved together, just easier to compute in the JIT implementation.

VSadov · 2023-03-21T22:14:49Z

We have some existing optimizations where we coalesce reads from the same location.
I think folding two ldr into one ldp would be valid under the same conditions/analysis.

jkotas · 2023-03-22T01:11:54Z

I would have assumed exceptions and stores are both counted as "visible" as well.

RIght, it is what the paragraph that I quoted is trying to say. "side-effects and exceptions generated by a
thread are visible in the order specified by the CIL. For this purpose only volatile operations
(including volatile reads) constitute visible side-effects."

VSadov · 2023-03-22T03:37:01Z

Re: exceptions

For two ordinary ldr reads, the order of reads is not defined - that is in terms of the values that are being read. In terms of exceptions the reads happen in program order (speculative reads do not cause out of order faults).

I believe ldp preserves the order of faults. I did not see in the spec something like "if one read faults, then whether the other read happens is undefined".
If that was the case, I think we might need to think harder about when this optimization is valid.

The pseudocode given for LDP semantics seems to indicate that T1 is loaded before T2 and overall ldp does look like just a compact form of encoding two subsequent reads.

Perhaps it is worth reaching to ARM folks to clarify that, just in case?

(interestingly the result is undefined if T1 and T2 are the same register, I assume it refers mostly to the fetched value though)

jkotas · 2023-03-22T05:25:28Z

The effect of the ldp instruction has to happen fully or not at all. If it was not the case, page faults from instructions like ldp r1, r2, [r1] would be impossible to handle.

VSadov · 2023-03-22T19:11:52Z

Right. The pseudocode can also be read as if there are internal buffers that are filled by the reads and then registers are filled from the temps. I think such "transactional" implementation would make more sense.

For us it means that if the second read may AV while the first read would not, we can't combine the reads.
However, since this is considered specifically for field reads, I think it cannot happen. In type-safe code at least.

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 4, 2022

kunalspathak assigned echesakov Feb 4, 2022

kunalspathak mentioned this issue Feb 4, 2022

Improving ARM64 Performance in .NET 7.0 #64820

Closed

32 tasks

jeffhandley added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 4, 2022

BruceForstall added this to the 7.0.0 milestone Feb 8, 2022

BruceForstall added arch-arm64 and removed untriaged New issue has not been triaged by the area owner labels Feb 8, 2022

echesakov removed their assignment Mar 15, 2022

BruceForstall added the optimization label Mar 16, 2022

JulieLeeMSFT assigned BruceForstall May 20, 2022

JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label May 20, 2022

BruceForstall modified the milestones: 7.0.0, 8.0.0 May 24, 2022

BruceForstall removed their assignment May 24, 2022

kunalspathak mentioned this issue Oct 13, 2022

Improving Arm64 Performance in .NET 8.0 #77010

Closed

28 tasks

TIHan self-assigned this Mar 17, 2023

JulieLeeMSFT added the Priority:3 Work that is nice to have label Mar 20, 2023

TIHan modified the milestones: 8.0.0, Future Mar 27, 2023

kunalspathak mentioned this issue Sep 29, 2023

JIT: Reorder indirs on arm64 to make them amenable to ldp optimization #92768

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Oct 6, 2023

BruceForstall assigned jakobbotsch and unassigned TIHan Oct 6, 2023

jakobbotsch modified the milestones: Future, 9.0.0 Oct 6, 2023

jakobbotsch mentioned this issue Oct 6, 2023

JIT: Consider reordering loads on ARM64 to make them amenable to ldp optimization #92756

Closed

jakobbotsch closed this as completed in #92768 Oct 6, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Oct 6, 2023

ghost locked as resolved and limited conversation to collaborators Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm64: Evaluate if it is possible to combine subsequent field loads in a single load #64815

Arm64: Evaluate if it is possible to combine subsequent field loads in a single load #64815

kunalspathak commented Feb 4, 2022 •

edited by BruceForstall

Loading

dotnet-issue-labeler bot commented Feb 4, 2022

kunalspathak commented Feb 4, 2022

ghost commented Feb 4, 2022

echesakov commented Mar 15, 2022

JulieLeeMSFT commented May 20, 2022

TIHan commented Mar 21, 2023 •

edited

Loading

EgorBo commented Mar 21, 2023 •

edited

Loading

jkotas commented Mar 21, 2023

tannergooding commented Mar 21, 2023 •

edited

Loading

jkotas commented Mar 21, 2023

tannergooding commented Mar 21, 2023 •

edited

Loading

VSadov commented Mar 21, 2023

VSadov commented Mar 21, 2023

jkotas commented Mar 22, 2023

VSadov commented Mar 22, 2023

jkotas commented Mar 22, 2023

VSadov commented Mar 22, 2023

Arm64: Evaluate if it is possible to combine subsequent field loads in a single load #64815

Arm64: Evaluate if it is possible to combine subsequent field loads in a single load #64815

Comments

kunalspathak commented Feb 4, 2022 • edited by BruceForstall Loading

dotnet-issue-labeler bot commented Feb 4, 2022

kunalspathak commented Feb 4, 2022

ghost commented Feb 4, 2022

echesakov commented Mar 15, 2022

JulieLeeMSFT commented May 20, 2022

TIHan commented Mar 21, 2023 • edited Loading

EgorBo commented Mar 21, 2023 • edited Loading

jkotas commented Mar 21, 2023

tannergooding commented Mar 21, 2023 • edited Loading

jkotas commented Mar 21, 2023

tannergooding commented Mar 21, 2023 • edited Loading

VSadov commented Mar 21, 2023

VSadov commented Mar 21, 2023

jkotas commented Mar 22, 2023

VSadov commented Mar 22, 2023

jkotas commented Mar 22, 2023

VSadov commented Mar 22, 2023

kunalspathak commented Feb 4, 2022 •

edited by BruceForstall

Loading

TIHan commented Mar 21, 2023 •

edited

Loading

EgorBo commented Mar 21, 2023 •

edited

Loading

tannergooding commented Mar 21, 2023 •

edited

Loading

tannergooding commented Mar 21, 2023 •

edited

Loading