[Arm64] addressing mode inefficiencies in Guid:op_Equality(Guid,Guid):bool #35622

BruceForstall · 2020-04-29T19:50:01Z

The arm64 generated code for Guid::op_Equality() could be better by (1) incorporating the fp address calculation into the ldr addressing modes, and (2) not using stack at all.

The code:

public static bool operator ==(Guid a, Guid b) =>
    a._a == b._a &&
        Unsafe.Add(ref a._a, 1) == Unsafe.Add(ref b._a, 1) &&
        Unsafe.Add(ref a._a, 2) == Unsafe.Add(ref b._a, 2) &&
        Unsafe.Add(ref a._a, 3) == Unsafe.Add(ref b._a, 3);

This code itself is weird, comparing 4 int values instead of comparing field-by-field of one int, two short, and eight byte. It should compare 2 long on 64-bit.

x64 code is pretty direct translation of this C# code.

arm64 first pushes the 2 16-byte struct-in-register-pair arguments to stack, then reloads each 4-byte element one at a time to compare. The base address of the stack locals are computed over and over, instead of being folded into the subsequent addressing modes that add the offset.

x64 assembly

G_M24558_IG01:
                        ;; bbWeight=1    PerfScore 0.00
G_M24558_IG02:
       8B01                 mov      eax, dword ptr [rcx]
       3B02                 cmp      eax, dword ptr [rdx]
       751D                 jne      SHORT G_M24558_IG05
                        ;; bbWeight=1    PerfScore 5.00
G_M24558_IG03:
       8B4104               mov      eax, dword ptr [rcx+4]
       3B4204               cmp      eax, dword ptr [rdx+4]
       7515                 jne      SHORT G_M24558_IG05
       8B4108               mov      eax, dword ptr [rcx+8]
       3B4208               cmp      eax, dword ptr [rdx+8]
       750D                 jne      SHORT G_M24558_IG05
       8B410C               mov      eax, dword ptr [rcx+12]
       3B420C               cmp      eax, dword ptr [rdx+12]
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
                        ;; bbWeight=0.50 PerfScore 7.63
G_M24558_IG04:
       C3                   ret
                        ;; bbWeight=0.50 PerfScore 0.50
G_M24558_IG05:
       33C0                 xor      eax, eax
                        ;; bbWeight=0.50 PerfScore 0.13
G_M24558_IG06:
       C3                   ret

arm64 assembly

G_M24558_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        F90013A0          str     x0, [fp,#32]
        F90017A1          str     x1, [fp,#40]
        F9000BA2          str     x2, [fp,#16]
        F9000FA3          str     x3, [fp,#24]
                        ;; bbWeight=1    PerfScore 5.50
G_M24558_IG02:
        B94023A0          ldr     w0, [fp,#32]
        B94013A1          ldr     w1, [fp,#16]
        6B01001F          cmp     w0, w1
        540002A1          bne     G_M24558_IG05
                        ;; bbWeight=1    PerfScore 5.50
G_M24558_IG03:
        910083A0          add     x0, fp, #32
        B9400400          ldr     w0, [x0,#4]
        910043A1          add     x1, fp, #16
        B9400421          ldr     w1, [x1,#4]
        6B01001F          cmp     w0, w1
        540001E1          bne     G_M24558_IG05
        910083A0          add     x0, fp, #32
        B9400800          ldr     w0, [x0,#8]
        910043A1          add     x1, fp, #16
        B9400821          ldr     w1, [x1,#8]
        6B01001F          cmp     w0, w1
        54000121          bne     G_M24558_IG05
        910083A0          add     x0, fp, #32
        B9400C00          ldr     w0, [x0,#12]
        910043A1          add     x1, fp, #16
        B9400C21          ldr     w1, [x1,#12]
        6B01001F          cmp     w0, w1
        9A9F17E0          cset    x0, eq
                        ;; bbWeight=0.50 PerfScore 12.50
G_M24558_IG04:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
                        ;; bbWeight=0.50 PerfScore 1.00
G_M24558_IG05:
        52800000          mov     w0, #0
                        ;; bbWeight=0.50 PerfScore 0.25
G_M24558_IG06:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr

Possible arm64 assembly after fixing address calculations

G_M24558_IG01:
        stp     fp, lr, [sp,#-48]!
        mov     fp, sp
        str     x0, [fp,#32]
        str     x1, [fp,#40]
        str     x2, [fp,#16]
        str     x3, [fp,#24]
G_M24558_IG02:
        ldr     w0, [fp,#32]
        ldr     w1, [fp,#16]
        cmp     w0, w1
        bne     G_M24558_IG05
G_M24558_IG03:
        ldr     w0, [fp,#36]
        ldr     w1, [fp,#20]
        cmp     w0, w1
        bne     G_M24558_IG05
        ldr     w0, [fp,#40]
        ldr     w1, [fp,#24]
        cmp     w0, w1
        bne     G_M24558_IG05
        ldr     w0, [fp,#44]
        ldr     w1, [fp,#28]
        cmp     w0, w1
        cset    x0, eq
G_M24558_IG04:
        ldp     fp, lr, [sp],#48
        ret     lr
G_M24558_IG05:
        mov     w0, #0
G_M24558_IG06:
        ldp     fp, lr, [sp],#48
        ret     lr

The JIT shouldn't need to put the argument structs on the stack at all. In which case we could generate code like the following (also assuming we can compare full registers, and not 4 bytes at a time).

Possible arm64 assembly fully optimized

G_M24558_IG01:
        stp     fp, lr, [sp,#-16]!
        mov     fp, sp
G_M24558_IG02:
        cmp     x0, x2
        bne     G_M24558_IG05
        cmp     x1, x3
        cset    x0, eq
G_M24558_IG04:
        ldp     fp, lr, [sp],#16
        ret     lr
G_M24558_IG05:
        mov     w0, #0
G_M24558_IG06:
        ldp     fp, lr, [sp],#16
        ret     lr

category:cq
theme:optimization
skill-level:intermediate
cost:medium

The text was updated successfully, but these errors were encountered:

BruceForstall · 2020-04-29T19:50:13Z

cc @kunalspathak

BruceForstall · 2020-04-30T00:42:19Z

Related: #35635

BruceForstall · 2020-04-30T00:45:42Z

Changing the Guid:op_Equality implementation to:

public static bool operator ==(Guid a, Guid b) =>
    Unsafe.As<int, long>(ref a._a) == Unsafe.As<int, long>(ref b._a) &&
        Unsafe.As<byte, long>(ref a._d) == Unsafe.As<byte, long>(ref b._d);

fixes the odd address base recalculations and only leaves the unnecessary stack usage.

arm64 assembly with new Guid implementation

G_M51749_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        F90013A0          str     x0, [fp,#32]
        F90017A1          str     x1, [fp,#40]
        F9000BA2          str     x2, [fp,#16]
        F9000FA3          str     x3, [fp,#24]
                        ;; bbWeight=1    PerfScore 5.50
G_M51749_IG02:
        F94013A0          ldr     x0, [fp,#32]
        F9400BA1          ldr     x1, [fp,#16]
        EB01001F          cmp     x0, x1
        540000E1          bne     G_M51749_IG05
                        ;; bbWeight=1    PerfScore 5.50
G_M51749_IG03:
        F94017A0          ldr     x0, [fp,#40]
        F9400FA1          ldr     x1, [fp,#24]
        EB01001F          cmp     x0, x1
        9A9F17E0          cset    x0, eq
                        ;; bbWeight=0.50 PerfScore 2.50
G_M51749_IG04:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
                        ;; bbWeight=0.50 PerfScore 1.00
G_M51749_IG05:
        52800000          mov     w0, #0
                        ;; bbWeight=0.50 PerfScore 0.25
G_M51749_IG06:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr

Current code does four 32-bit comparisons. Instead, do two 64-bit comparisons. On x86, the JIT-generated code is slightly different, but equally fast. This will be even better on arm64 which passes everything in 64-bit registers, after dotnet#35622 is addressed in the JIT. Perf results: x64: | Method | Tool | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | |------------------ |------|-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|--------:| | EqualsSame | base | 2.322 ns | 0.0210 ns | 0.0187 ns | 2.324 ns | 2.294 ns | 2.351 ns | 1.00 | 0.00 | | EqualsSame | diff | 1.547 ns | 0.0092 ns | 0.0071 ns | 1.547 ns | 1.535 ns | 1.559 ns | 0.67 | 0.01 | | EqualsOperator | base | 2.890 ns | 0.3896 ns | 0.4169 ns | 3.030 ns | 1.722 ns | 3.074 ns | 1.00 | 0.00 | | EqualsOperator | diff | 1.346 ns | 0.0160 ns | 0.0150 ns | 1.346 ns | 1.331 ns | 1.380 ns | 0.49 | 0.12 | | NotEqualsOperator | base | 1.738 ns | 0.0306 ns | 0.0255 ns | 1.730 ns | 1.712 ns | 1.805 ns | 1.00 | 0.00 | | NotEqualsOperator | diff | 1.401 ns | 0.0425 ns | 0.0355 ns | 1.389 ns | 1.360 ns | 1.476 ns | 0.81 | 0.02 | x86: | Method | Tool | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | |------------------ |------|-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|--------:| | EqualsSame | base | 3.164 ns | 0.0234 ns | 0.0208 ns | 3.159 ns | 3.136 ns | 3.203 ns | 1.00 | 0.00 | | EqualsSame | diff | 3.079 ns | 0.0327 ns | 0.0306 ns | 3.074 ns | 3.041 ns | 3.146 ns | 0.97 | 0.01 | | EqualsOperator | base | 2.736 ns | 0.0252 ns | 0.0236 ns | 2.726 ns | 2.710 ns | 2.783 ns | 1.00 | 0.00 | | EqualsOperator | diff | 2.613 ns | 0.0262 ns | 0.0245 ns | 2.600 ns | 2.589 ns | 2.662 ns | 0.95 | 0.01 | | NotEqualsOperator | base | 2.708 ns | 0.0096 ns | 0.0080 ns | 2.705 ns | 2.699 ns | 2.723 ns | 1.00 | 0.00 | | NotEqualsOperator | diff | 2.573 ns | 0.0666 ns | 0.0591 ns | 2.552 ns | 2.526 ns | 2.709 ns | 0.95 | 0.02 |

kunalspathak · 2022-09-30T15:52:51Z

The latest code is much better:

; Assembly listing for method System.Guid:EqualsCore(byref,byref):bool
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;  V01 arg1         [V01,T01] (  3,  3   )   byref  ->   x1         single-def
;* V02 loc0         [V02    ] (  0,  0   )   byref  ->  zero-ref
;* V03 loc1         [V03    ] (  0,  0   )   byref  ->  zero-ref
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;  V05 rat0         [V05,T02] (  3,  6   )  simd16  ->  d16         HFA(simd16)  "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 0

G_M26697_IG01:
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
                                                ;; size=8 bbWeight=1    PerfScore 1.50
G_M26697_IG02:
            ld1     {v16.16b}, [x0]
            ld1     {v17.16b}, [x1]
            cmeq    v16.16b, v16.16b, v17.16b
            uminp   v16.4s, v16.4s, v16.4s
            umov    x0, v16.d[0]
            cmn     x0, #1
            cset    x0, eq
                                                ;; size=28 bbWeight=1    PerfScore 10.00
G_M26697_IG03:
            ldp     fp, lr, [sp], #0x10
            ret     lr

BruceForstall added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization labels Apr 29, 2020

BruceForstall added this to the 5.0 milestone Apr 29, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 29, 2020

BruceForstall mentioned this issue Apr 30, 2020

Improve Guid equality checks on 64-bit platforms #35654

Closed

BruceForstall removed the untriaged New issue has not been triaged by the area owner label May 4, 2020

kunalspathak mentioned this issue May 5, 2020

Improving ARM64 Performance in .NET 5.0 – Closing the gap with x64 #35853

Closed

myblindy mentioned this issue May 5, 2020

Add an (U)Int128 type #20640

Closed

BruceForstall modified the milestones: 5.0, Future May 5, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 14, 2020

BruceForstall modified the milestones: Future, 6.0.0 Nov 14, 2020

JulieLeeMSFT assigned BruceForstall Mar 23, 2021

JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021

BruceForstall removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Apr 8, 2021

BruceForstall modified the milestones: 6.0.0, Future Apr 8, 2021

kunalspathak closed this as completed Sep 30, 2022

ghost locked as resolved and limited conversation to collaborators Oct 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arm64] addressing mode inefficiencies in Guid:op_Equality(Guid,Guid):bool #35622

[Arm64] addressing mode inefficiencies in Guid:op_Equality(Guid,Guid):bool #35622

BruceForstall commented Apr 29, 2020 •

edited

Loading

BruceForstall commented Apr 29, 2020

BruceForstall commented Apr 30, 2020

BruceForstall commented Apr 30, 2020

kunalspathak commented Sep 30, 2022

[Arm64] addressing mode inefficiencies in Guid:op_Equality(Guid,Guid):bool #35622

[Arm64] addressing mode inefficiencies in Guid:op_Equality(Guid,Guid):bool #35622

Comments

BruceForstall commented Apr 29, 2020 • edited Loading

BruceForstall commented Apr 29, 2020

BruceForstall commented Apr 30, 2020

BruceForstall commented Apr 30, 2020

kunalspathak commented Sep 30, 2022

BruceForstall commented Apr 29, 2020 •

edited

Loading