Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: Evaluate if it is possible to combine subsequent field loads in a single load #64815

Closed
Tracked by #77010
kunalspathak opened this issue Feb 4, 2022 · 17 comments · Fixed by #92768
Closed
Tracked by #77010
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration optimization Priority:3 Work that is nice to have
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Feb 4, 2022

Evaluate to see how feasible it would be to combine loads of subsequent fields using ldp instead of loading them separately during the use. It cannot be considered as a peep-hole optimization, but an analysis is needed in earlier phases to check around for consecutive field loads and if found one, combine them to a single load.

class Body { public double x, y, z, vx, vy, vz, mass; }
...
foreach (var b in bodies) {
    b.x += dt * b.vx; b.y += dt * b.vy; b.z += dt * b.vz;
}

Below code is generated for the loop that deals with multiplication of double.

  • #1 can be combined into ldp dX, dY, [x3, #8]
  • #2 can be combined into ldp dX, dY, [x3, #32]
  • #3 can be combined into stp dX, dY, [x3, #8]
G_M56457_IG05:              ;; offset=0114H
        D37D7C43          ubfiz   x3, x2, #3, #32
        91004063          add     x3, x3, #16
        F8636803          ldr     x3, [x0, x3]
        FD400470          ldr     d16, [x3,#8]   ; <-- #1
        FD401071          ldr     d17, [x3,#32]  ; <-- #2
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000470          str     d16, [x3,#8]   ; <-- #3
        FD400870          ldr     d16, [x3,#16]  ; <-- #1
        FD401471          ldr     d17, [x3,#40]  ; <-- #2
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000870          str     d16, [x3,#16]  ; <-- #3
        FD400C70          ldr     d16, [x3,#24]
        FD401871          ldr     d17, [x3,#48]
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000C70          str     d16, [x3,#24]
        11000442          add     w2, w2, #1
        6B02003F          cmp     w1, w2
        54FFFD8C          bgt     G_M56457_IG05

Reference: https://godbolt.org/z/9jY5hYnoa

category:implementation
theme:codegen
skill-level:intermediate
cost:medium
impact:medium

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 4, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

@jeffhandley jeffhandley added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 4, 2022
@ghost
Copy link

ghost commented Feb 4, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Evaluate to see how feasible it would be to combine loads of subsequent fields using ldp instead of loading them separately during the use. It cannot be considered as a peep-hole optimization, but an analysis is needed in earlier phases to check around for consecutive field loads and if found one, combine them to a single load.

class Body { public double x, y, z, vx, vy, vz, mass; }
...
foreach (var b in bodies) {
    b.x += dt * b.vx; b.y += dt * b.vy; b.z += dt * b.vz;
}

Below code is generated for the loop that deals with multiplication of double.

  • #1 can be combined into ldp dX, dY, [x3, #8]
  • #2 can be combined into ldp dX, dY, [x3, #32]
  • #3 can be combined into stp dX, dY, [x3, #8]
G_M56457_IG05:              ;; offset=0114H
        D37D7C43          ubfiz   x3, x2, #3, #32
        91004063          add     x3, x3, #16
        F8636803          ldr     x3, [x0, x3]
        FD400470          ldr     d16, [x3,#8]   ; <-- #1
        FD401071          ldr     d17, [x3,#32]  ; <-- #2
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000470          str     d16, [x3,#8]   ; <-- #3
        FD400870          ldr     d16, [x3,#16]  ; <-- #1
        FD401471          ldr     d17, [x3,#40]  ; <-- #2
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000870          str     d16, [x3,#16]  ; <-- #3
        FD400C70          ldr     d16, [x3,#24]
        FD401871          ldr     d17, [x3,#48]
        1E710811          fmul    d17, d0, d17
        1E712A10          fadd    d16, d16, d17
        FD000C70          str     d16, [x3,#24]
        11000442          add     w2, w2, #1
        6B02003F          cmp     w1, w2
        54FFFD8C          bgt     G_M56457_IG05

Reference: https://godbolt.org/z/9jY5hYnoa

Author: kunalspathak
Assignees: echesakovMSFT
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@BruceForstall BruceForstall added this to the 7.0.0 milestone Feb 8, 2022
@BruceForstall BruceForstall added arch-arm64 and removed untriaged New issue has not been triaged by the area owner labels Feb 8, 2022
@echesakov
Copy link
Contributor

Un-assigning myself
cc @BruceForstall

@echesakov echesakov removed their assignment Mar 15, 2022
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label May 20, 2022
@JulieLeeMSFT
Copy link
Member

@BruceForstall please move to future if we cannot accomodate this in .NET 7.

@BruceForstall BruceForstall modified the milestones: 7.0.0, 8.0.0 May 24, 2022
@BruceForstall BruceForstall removed their assignment May 24, 2022
@TIHan TIHan self-assigned this Mar 17, 2023
@JulieLeeMSFT JulieLeeMSFT added the Priority:3 Work that is nice to have label Mar 20, 2023
@TIHan
Copy link
Contributor

TIHan commented Mar 21, 2023

@jkotas Is this transformation safe to do under the memory model? I ask because we would be combining two ldrs together when they both have a store, str, in-between.

@EgorBo
Copy link
Member

EgorBo commented Mar 21, 2023

@TIHan ldp is practically the same as two subsequent ldr the only real difference that it save 4 bytes of instructions in codegen (might make some hot loop body smaller, etc)

ldp has 2x smaller throughput than ldr

@jkotas
Copy link
Member

jkotas commented Mar 21, 2023

Is this transformation safe to do under the memory model?

Yes, it is fine. From https://github.com/dotnet/runtime/blob/main/docs/design/specs/Memory-model.md#order-of-memory-operations: The effects of ordinary reads and writes can be reordered as long as that preserves single-thread consistency. Such reordering can happen both due to code generation strategy of the compiler or due to weak memory ordering in the hardware.

@tannergooding
Copy link
Member

tannergooding commented Mar 21, 2023

The effects of ordinary reads and writes can be reordered as long as that preserves single-thread consistency

Might be worth calling this out more explicitly in the memory model doc?

We explicitly call out coalescing adjacent reads/writes and we state "single-thread consistency", but never actually define what "single-thread consistency" means.

Presumably this requires us to know or be able to assume that there is no aliasing between the read/store? That is, I'd presume this would be invalid given two parameters ref src and ref dst, since these could alias and therefore the store could modify what was loaded from src. I'd' guess the same would be true for ref T or T* in general in relation to two field reads, since those "could" alias the fields.

Does "consistency" then also require no 'side-effects' and so the obvious things like volatile or barriers would need to block the optimization. But so would things like division, indexing (with unhoisted bounds checks), or dereferences (with unhoisted null checks)?

@jkotas
Copy link
Member

jkotas commented Mar 21, 2023

Might be worth calling this out more explicitly in the memory model doc?
never actually define what "single-thread consistency" means.

cc @VSadov

I believe that we are assuming the "single-thread consistency" definition from ECMA 335 I.12.6.4 here: "Conforming implementations of the CLI are free to execute programs using any technology that guarantees, within a single thread of execution, that side-effects and exceptions generated by a thread are visible in the order specified by the CIL. For this purpose only volatile operations (including volatile reads) constitute visible side-effects."

@tannergooding
Copy link
Member

tannergooding commented Mar 21, 2023

I believe that we are assuming the "single-thread consistency" definition from ECMA 335 I.12.6.4 here...

Not sure I understand this last bit:

For this purpose only volatile operations (including volatile reads) constitute visible side-effects."

I would have assumed exceptions and stores are both counted as "visible" as well.

For example, lets say you have a store classA.fieldX, read classA.fieldZ, store classA.fieldY where X/Y are sequential fields. In this case, it's "safe" since we know that the first store classA.fieldX will throw the NRE if classA is null. So it's safe to coalesce store.classA.fieldY alongside it (it can't become null and no single thread aliases, so it can't side effect outside its own store occurring).

However, if we change this to store classA.fieldX, read classB.fieldZ, store classA.fieldY, then the read classB.fieldZ can throw. I would assume that coalescing in this case is invalid since the caller could have a catch (NullReferenceException) handler and then do something based on the state of classA.fieldY. Therefore, moving it before the potentially side effecting read classB.fieldZ would have to be considered "inconsistent" with regards to single thread consistency, no?

@VSadov
Copy link
Member

VSadov commented Mar 21, 2023

Re: using ldp instead of two ldr

Yes. If two nonvolatile loads are adjacent they can certainly be emitted as a single ldp. It is basically just a matter of encoding. Hardware would not guarantee the relative order of the loads anyways.

This is also applicable in cases where the loads are not adjacent, but can be moved together without violating memory model.

I think internally JIT uses slightly different form of analysis - in terms of intervening accesses, but it is basically a superset of whether the operations can be moved together, just easier to compute in the JIT implementation.

@VSadov
Copy link
Member

VSadov commented Mar 21, 2023

We have some existing optimizations where we coalesce reads from the same location.
I think folding two ldr into one ldp would be valid under the same conditions/analysis.

@jkotas
Copy link
Member

jkotas commented Mar 22, 2023

I would have assumed exceptions and stores are both counted as "visible" as well.

RIght, it is what the paragraph that I quoted is trying to say. "side-effects and exceptions generated by a
thread are visible in the order specified by the CIL. For this purpose only volatile operations
(including volatile reads) constitute visible side-effects."

@VSadov
Copy link
Member

VSadov commented Mar 22, 2023

Re: exceptions

For two ordinary ldr reads, the order of reads is not defined - that is in terms of the values that are being read. In terms of exceptions the reads happen in program order (speculative reads do not cause out of order faults).

I believe ldp preserves the order of faults. I did not see in the spec something like "if one read faults, then whether the other read happens is undefined".
If that was the case, I think we might need to think harder about when this optimization is valid.

The pseudocode given for LDP semantics seems to indicate that T1 is loaded before T2 and overall ldp does look like just a compact form of encoding two subsequent reads.

Perhaps it is worth reaching to ARM folks to clarify that, just in case?

(interestingly the result is undefined if T1 and T2 are the same register, I assume it refers mostly to the fetched value though)

@jkotas
Copy link
Member

jkotas commented Mar 22, 2023

The effect of the ldp instruction has to happen fully or not at all. If it was not the case, page faults from instructions like ldp r1, r2, [r1] would be impossible to handle.

@VSadov
Copy link
Member

VSadov commented Mar 22, 2023

Right. The pseudocode can also be read as if there are internal buffers that are filled by the reads and then registers are filled from the temps. I think such "transactional" implementation would make more sense.

For us it means that if the second read may AV while the first read would not, we can't combine the reads.
However, since this is considered specifically for field reads, I think it cannot happen. In type-safe code at least.

@TIHan TIHan modified the milestones: 8.0.0, Future Mar 27, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Oct 6, 2023
@BruceForstall BruceForstall assigned jakobbotsch and unassigned TIHan Oct 6, 2023
@jakobbotsch jakobbotsch modified the milestones: Future, 9.0.0 Oct 6, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Oct 6, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Nov 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration optimization Priority:3 Work that is nice to have
Projects
None yet
Development

Successfully merging a pull request may close this issue.