-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arm64] addressing mode inefficiencies in Guid:op_Equality(Guid,Guid):bool #35622
Labels
arch-arm64
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
optimization
Milestone
Comments
BruceForstall
added
arch-arm64
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
optimization
labels
Apr 29, 2020
Dotnet-GitSync-Bot
added
the
untriaged
New issue has not been triaged by the area owner
label
Apr 29, 2020
Related: #35635 |
Changing the Guid:op_Equality implementation to:
fixes the odd address base recalculations and only leaves the unnecessary stack usage. arm64 assembly with new Guid implementationG_M51749_IG01:
A9BD7BFD stp fp, lr, [sp,#-48]!
910003FD mov fp, sp
F90013A0 str x0, [fp,#32]
F90017A1 str x1, [fp,#40]
F9000BA2 str x2, [fp,#16]
F9000FA3 str x3, [fp,#24]
;; bbWeight=1 PerfScore 5.50
G_M51749_IG02:
F94013A0 ldr x0, [fp,#32]
F9400BA1 ldr x1, [fp,#16]
EB01001F cmp x0, x1
540000E1 bne G_M51749_IG05
;; bbWeight=1 PerfScore 5.50
G_M51749_IG03:
F94017A0 ldr x0, [fp,#40]
F9400FA1 ldr x1, [fp,#24]
EB01001F cmp x0, x1
9A9F17E0 cset x0, eq
;; bbWeight=0.50 PerfScore 2.50
G_M51749_IG04:
A8C37BFD ldp fp, lr, [sp],#48
D65F03C0 ret lr
;; bbWeight=0.50 PerfScore 1.00
G_M51749_IG05:
52800000 mov w0, #0
;; bbWeight=0.50 PerfScore 0.25
G_M51749_IG06:
A8C37BFD ldp fp, lr, [sp],#48
D65F03C0 ret lr |
BruceForstall
added a commit
to BruceForstall/runtime
that referenced
this issue
Apr 30, 2020
Current code does four 32-bit comparisons. Instead, do two 64-bit comparisons. On x86, the JIT-generated code is slightly different, but equally fast. This will be even better on arm64 which passes everything in 64-bit registers, after dotnet#35622 is addressed in the JIT. Perf results: x64: | Method | Tool | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | |------------------ |------|-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|--------:| | EqualsSame | base | 2.322 ns | 0.0210 ns | 0.0187 ns | 2.324 ns | 2.294 ns | 2.351 ns | 1.00 | 0.00 | | EqualsSame | diff | 1.547 ns | 0.0092 ns | 0.0071 ns | 1.547 ns | 1.535 ns | 1.559 ns | 0.67 | 0.01 | | EqualsOperator | base | 2.890 ns | 0.3896 ns | 0.4169 ns | 3.030 ns | 1.722 ns | 3.074 ns | 1.00 | 0.00 | | EqualsOperator | diff | 1.346 ns | 0.0160 ns | 0.0150 ns | 1.346 ns | 1.331 ns | 1.380 ns | 0.49 | 0.12 | | NotEqualsOperator | base | 1.738 ns | 0.0306 ns | 0.0255 ns | 1.730 ns | 1.712 ns | 1.805 ns | 1.00 | 0.00 | | NotEqualsOperator | diff | 1.401 ns | 0.0425 ns | 0.0355 ns | 1.389 ns | 1.360 ns | 1.476 ns | 0.81 | 0.02 | x86: | Method | Tool | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | |------------------ |------|-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|--------:| | EqualsSame | base | 3.164 ns | 0.0234 ns | 0.0208 ns | 3.159 ns | 3.136 ns | 3.203 ns | 1.00 | 0.00 | | EqualsSame | diff | 3.079 ns | 0.0327 ns | 0.0306 ns | 3.074 ns | 3.041 ns | 3.146 ns | 0.97 | 0.01 | | EqualsOperator | base | 2.736 ns | 0.0252 ns | 0.0236 ns | 2.726 ns | 2.710 ns | 2.783 ns | 1.00 | 0.00 | | EqualsOperator | diff | 2.613 ns | 0.0262 ns | 0.0245 ns | 2.600 ns | 2.589 ns | 2.662 ns | 0.95 | 0.01 | | NotEqualsOperator | base | 2.708 ns | 0.0096 ns | 0.0080 ns | 2.705 ns | 2.699 ns | 2.723 ns | 1.00 | 0.00 | | NotEqualsOperator | diff | 2.573 ns | 0.0666 ns | 0.0591 ns | 2.552 ns | 2.526 ns | 2.709 ns | 0.95 | 0.02 |
BruceForstall
removed
the
untriaged
New issue has not been triaged by the area owner
label
May 4, 2020
Closed
JulieLeeMSFT
added
the
needs-further-triage
Issue has been initially triaged, but needs deeper consideration or reconsideration
label
Mar 23, 2021
BruceForstall
removed
the
needs-further-triage
Issue has been initially triaged, but needs deeper consideration or reconsideration
label
Apr 8, 2021
The latest code is much better: ; Assembly listing for method System.Guid:EqualsCore(byref,byref):bool
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def
; V01 arg1 [V01,T01] ( 3, 3 ) byref -> x1 single-def
;* V02 loc0 [V02 ] ( 0, 0 ) byref -> zero-ref
;* V03 loc1 [V03 ] ( 0, 0 ) byref -> zero-ref
;# V04 OutArgs [V04 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
; V05 rat0 [V05,T02] ( 3, 6 ) simd16 -> d16 HFA(simd16) "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 0
G_M26697_IG01:
stp fp, lr, [sp, #-0x10]!
mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M26697_IG02:
ld1 {v16.16b}, [x0]
ld1 {v17.16b}, [x1]
cmeq v16.16b, v16.16b, v17.16b
uminp v16.4s, v16.4s, v16.4s
umov x0, v16.d[0]
cmn x0, #1
cset x0, eq
;; size=28 bbWeight=1 PerfScore 10.00
G_M26697_IG03:
ldp fp, lr, [sp], #0x10
ret lr |
ghost
locked as resolved and limited conversation to collaborators
Oct 30, 2022
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
arch-arm64
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
optimization
The arm64 generated code for
Guid::op_Equality()
could be better by (1) incorporating thefp
address calculation into theldr
addressing modes, and (2) not using stack at all.The code:
This code itself is weird, comparing 4
int
values instead of comparing field-by-field of oneint
, twoshort
, and eightbyte
. It should compare 2long
on 64-bit.x64 code is pretty direct translation of this C# code.
arm64 first pushes the 2 16-byte struct-in-register-pair arguments to stack, then reloads each 4-byte element one at a time to compare. The base address of the stack locals are computed over and over, instead of being folded into the subsequent addressing modes that add the offset.
x64 assembly
arm64 assembly
Possible arm64 assembly after fixing address calculations
The JIT shouldn't need to put the argument structs on the stack at all. In which case we could generate code like the following (also assuming we can compare full registers, and not 4 bytes at a time).
Possible arm64 assembly fully optimized
category:cq
theme:optimization
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: