Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arm64] addressing mode inefficiencies in Guid:op_Equality(Guid,Guid):bool #35622

Closed
BruceForstall opened this issue Apr 29, 2020 · 4 comments
Closed
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@BruceForstall
Copy link
Member

BruceForstall commented Apr 29, 2020

The arm64 generated code for Guid::op_Equality() could be better by (1) incorporating the fp address calculation into the ldr addressing modes, and (2) not using stack at all.

The code:

public static bool operator ==(Guid a, Guid b) =>
    a._a == b._a &&
        Unsafe.Add(ref a._a, 1) == Unsafe.Add(ref b._a, 1) &&
        Unsafe.Add(ref a._a, 2) == Unsafe.Add(ref b._a, 2) &&
        Unsafe.Add(ref a._a, 3) == Unsafe.Add(ref b._a, 3);

This code itself is weird, comparing 4 int values instead of comparing field-by-field of one int, two short, and eight byte. It should compare 2 long on 64-bit.

x64 code is pretty direct translation of this C# code.

arm64 first pushes the 2 16-byte struct-in-register-pair arguments to stack, then reloads each 4-byte element one at a time to compare. The base address of the stack locals are computed over and over, instead of being folded into the subsequent addressing modes that add the offset.

x64 assembly
G_M24558_IG01:
                        ;; bbWeight=1    PerfScore 0.00
G_M24558_IG02:
       8B01                 mov      eax, dword ptr [rcx]
       3B02                 cmp      eax, dword ptr [rdx]
       751D                 jne      SHORT G_M24558_IG05
                        ;; bbWeight=1    PerfScore 5.00
G_M24558_IG03:
       8B4104               mov      eax, dword ptr [rcx+4]
       3B4204               cmp      eax, dword ptr [rdx+4]
       7515                 jne      SHORT G_M24558_IG05
       8B4108               mov      eax, dword ptr [rcx+8]
       3B4208               cmp      eax, dword ptr [rdx+8]
       750D                 jne      SHORT G_M24558_IG05
       8B410C               mov      eax, dword ptr [rcx+12]
       3B420C               cmp      eax, dword ptr [rdx+12]
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
                        ;; bbWeight=0.50 PerfScore 7.63
G_M24558_IG04:
       C3                   ret
                        ;; bbWeight=0.50 PerfScore 0.50
G_M24558_IG05:
       33C0                 xor      eax, eax
                        ;; bbWeight=0.50 PerfScore 0.13
G_M24558_IG06:
       C3                   ret
arm64 assembly
G_M24558_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        F90013A0          str     x0, [fp,#32]
        F90017A1          str     x1, [fp,#40]
        F9000BA2          str     x2, [fp,#16]
        F9000FA3          str     x3, [fp,#24]
                        ;; bbWeight=1    PerfScore 5.50
G_M24558_IG02:
        B94023A0          ldr     w0, [fp,#32]
        B94013A1          ldr     w1, [fp,#16]
        6B01001F          cmp     w0, w1
        540002A1          bne     G_M24558_IG05
                        ;; bbWeight=1    PerfScore 5.50
G_M24558_IG03:
        910083A0          add     x0, fp, #32
        B9400400          ldr     w0, [x0,#4]
        910043A1          add     x1, fp, #16
        B9400421          ldr     w1, [x1,#4]
        6B01001F          cmp     w0, w1
        540001E1          bne     G_M24558_IG05
        910083A0          add     x0, fp, #32
        B9400800          ldr     w0, [x0,#8]
        910043A1          add     x1, fp, #16
        B9400821          ldr     w1, [x1,#8]
        6B01001F          cmp     w0, w1
        54000121          bne     G_M24558_IG05
        910083A0          add     x0, fp, #32
        B9400C00          ldr     w0, [x0,#12]
        910043A1          add     x1, fp, #16
        B9400C21          ldr     w1, [x1,#12]
        6B01001F          cmp     w0, w1
        9A9F17E0          cset    x0, eq
                        ;; bbWeight=0.50 PerfScore 12.50
G_M24558_IG04:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
                        ;; bbWeight=0.50 PerfScore 1.00
G_M24558_IG05:
        52800000          mov     w0, #0
                        ;; bbWeight=0.50 PerfScore 0.25
G_M24558_IG06:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
Possible arm64 assembly after fixing address calculations
G_M24558_IG01:
        stp     fp, lr, [sp,#-48]!
        mov     fp, sp
        str     x0, [fp,#32]
        str     x1, [fp,#40]
        str     x2, [fp,#16]
        str     x3, [fp,#24]
G_M24558_IG02:
        ldr     w0, [fp,#32]
        ldr     w1, [fp,#16]
        cmp     w0, w1
        bne     G_M24558_IG05
G_M24558_IG03:
        ldr     w0, [fp,#36]
        ldr     w1, [fp,#20]
        cmp     w0, w1
        bne     G_M24558_IG05
        ldr     w0, [fp,#40]
        ldr     w1, [fp,#24]
        cmp     w0, w1
        bne     G_M24558_IG05
        ldr     w0, [fp,#44]
        ldr     w1, [fp,#28]
        cmp     w0, w1
        cset    x0, eq
G_M24558_IG04:
        ldp     fp, lr, [sp],#48
        ret     lr
G_M24558_IG05:
        mov     w0, #0
G_M24558_IG06:
        ldp     fp, lr, [sp],#48
        ret     lr

The JIT shouldn't need to put the argument structs on the stack at all. In which case we could generate code like the following (also assuming we can compare full registers, and not 4 bytes at a time).

Possible arm64 assembly fully optimized
G_M24558_IG01:
        stp     fp, lr, [sp,#-16]!
        mov     fp, sp
G_M24558_IG02:
        cmp     x0, x2
        bne     G_M24558_IG05
        cmp     x1, x3
        cset    x0, eq
G_M24558_IG04:
        ldp     fp, lr, [sp],#16
        ret     lr
G_M24558_IG05:
        mov     w0, #0
G_M24558_IG06:
        ldp     fp, lr, [sp],#16
        ret     lr

category:cq
theme:optimization
skill-level:intermediate
cost:medium

@BruceForstall BruceForstall added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization labels Apr 29, 2020
@BruceForstall BruceForstall added this to the 5.0 milestone Apr 29, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 29, 2020
@BruceForstall
Copy link
Member Author

cc @kunalspathak

@BruceForstall
Copy link
Member Author

Related: #35635

@BruceForstall
Copy link
Member Author

Changing the Guid:op_Equality implementation to:

public static bool operator ==(Guid a, Guid b) =>
    Unsafe.As<int, long>(ref a._a) == Unsafe.As<int, long>(ref b._a) &&
        Unsafe.As<byte, long>(ref a._d) == Unsafe.As<byte, long>(ref b._d);

fixes the odd address base recalculations and only leaves the unnecessary stack usage.

arm64 assembly with new Guid implementation
G_M51749_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        F90013A0          str     x0, [fp,#32]
        F90017A1          str     x1, [fp,#40]
        F9000BA2          str     x2, [fp,#16]
        F9000FA3          str     x3, [fp,#24]
                        ;; bbWeight=1    PerfScore 5.50
G_M51749_IG02:
        F94013A0          ldr     x0, [fp,#32]
        F9400BA1          ldr     x1, [fp,#16]
        EB01001F          cmp     x0, x1
        540000E1          bne     G_M51749_IG05
                        ;; bbWeight=1    PerfScore 5.50
G_M51749_IG03:
        F94017A0          ldr     x0, [fp,#40]
        F9400FA1          ldr     x1, [fp,#24]
        EB01001F          cmp     x0, x1
        9A9F17E0          cset    x0, eq
                        ;; bbWeight=0.50 PerfScore 2.50
G_M51749_IG04:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
                        ;; bbWeight=0.50 PerfScore 1.00
G_M51749_IG05:
        52800000          mov     w0, #0
                        ;; bbWeight=0.50 PerfScore 0.25
G_M51749_IG06:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr

BruceForstall added a commit to BruceForstall/runtime that referenced this issue Apr 30, 2020
Current code does four 32-bit comparisons. Instead,
do two 64-bit comparisons. On x86, the JIT-generated
code is slightly different, but equally fast.

This will be even better on arm64 which passes everything
in 64-bit registers, after dotnet#35622 is addressed in the JIT.

Perf results:

x64:

|            Method | Tool |       Mean |     Error |    StdDev |     Median |        Min |        Max | Ratio | RatioSD |
|------------------ |------|-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|--------:|
|        EqualsSame | base |   2.322 ns | 0.0210 ns | 0.0187 ns |   2.324 ns |   2.294 ns |   2.351 ns |  1.00 |    0.00 |
|        EqualsSame | diff |   1.547 ns | 0.0092 ns | 0.0071 ns |   1.547 ns |   1.535 ns |   1.559 ns |  0.67 |    0.01 |
|    EqualsOperator | base |   2.890 ns | 0.3896 ns | 0.4169 ns |   3.030 ns |   1.722 ns |   3.074 ns |  1.00 |    0.00 |
|    EqualsOperator | diff |   1.346 ns | 0.0160 ns | 0.0150 ns |   1.346 ns |   1.331 ns |   1.380 ns |  0.49 |    0.12 |
| NotEqualsOperator | base |   1.738 ns | 0.0306 ns | 0.0255 ns |   1.730 ns |   1.712 ns |   1.805 ns |  1.00 |    0.00 |
| NotEqualsOperator | diff |   1.401 ns | 0.0425 ns | 0.0355 ns |   1.389 ns |   1.360 ns |   1.476 ns |  0.81 |    0.02 |

x86:

|            Method | Tool |       Mean |     Error |    StdDev |     Median |        Min |        Max | Ratio | RatioSD |
|------------------ |------|-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|--------:|
|        EqualsSame | base |   3.164 ns | 0.0234 ns | 0.0208 ns |   3.159 ns |   3.136 ns |   3.203 ns |  1.00 |    0.00 |
|        EqualsSame | diff |   3.079 ns | 0.0327 ns | 0.0306 ns |   3.074 ns |   3.041 ns |   3.146 ns |  0.97 |    0.01 |
|    EqualsOperator | base |   2.736 ns | 0.0252 ns | 0.0236 ns |   2.726 ns |   2.710 ns |   2.783 ns |  1.00 |    0.00 |
|    EqualsOperator | diff |   2.613 ns | 0.0262 ns | 0.0245 ns |   2.600 ns |   2.589 ns |   2.662 ns |  0.95 |    0.01 |
| NotEqualsOperator | base |   2.708 ns | 0.0096 ns | 0.0080 ns |   2.705 ns |   2.699 ns |   2.723 ns |  1.00 |    0.00 |
| NotEqualsOperator | diff |   2.573 ns | 0.0666 ns | 0.0591 ns |   2.552 ns |   2.526 ns |   2.709 ns |  0.95 |    0.02 |
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label May 4, 2020
@BruceForstall BruceForstall modified the milestones: 5.0, Future May 5, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 14, 2020
@BruceForstall BruceForstall modified the milestones: Future, 6.0.0 Nov 14, 2020
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@BruceForstall BruceForstall removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Apr 8, 2021
@BruceForstall BruceForstall modified the milestones: 6.0.0, Future Apr 8, 2021
@kunalspathak
Copy link
Member

The latest code is much better:

; Assembly listing for method System.Guid:EqualsCore(byref,byref):bool
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;  V01 arg1         [V01,T01] (  3,  3   )   byref  ->   x1         single-def
;* V02 loc0         [V02    ] (  0,  0   )   byref  ->  zero-ref
;* V03 loc1         [V03    ] (  0,  0   )   byref  ->  zero-ref
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;  V05 rat0         [V05,T02] (  3,  6   )  simd16  ->  d16         HFA(simd16)  "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 0

G_M26697_IG01:
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
                                                ;; size=8 bbWeight=1    PerfScore 1.50
G_M26697_IG02:
            ld1     {v16.16b}, [x0]
            ld1     {v17.16b}, [x1]
            cmeq    v16.16b, v16.16b, v17.16b
            uminp   v16.4s, v16.4s, v16.4s
            umov    x0, v16.d[0]
            cmn     x0, #1
            cset    x0, eq
                                                ;; size=28 bbWeight=1    PerfScore 10.00
G_M26697_IG03:
            ldp     fp, lr, [sp], #0x10
            ret     lr

@ghost ghost locked as resolved and limited conversation to collaborators Oct 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

4 participants