-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM64] Possible perf regression: slicing #41704
Comments
@kunalspathak please look into this. CC @dotnet/jit-contrib |
Noting down some observations, not necessarily related to the regression. runtime/src/coreclr/src/jit/importer.cpp Line 1917 in 6072e4d
Another observation is in .NET 3.1, we didn't JIT SequenceEqual(), but I see that we JIT that method in .NET 5. It is contrary to the fact that it was marked with |
The benchmarks that regressed operates on The jit assembly for G_M18749_IG07:
add x0, fp, #32 // [V01 loc0]
mov w2, #5
bl ReadOnlyMemory`1:Slice(int):struct:this
str x0, [fp,#16] // [V02 loc1]
str x1, [fp,#24] // [V02 loc1+0x08]
mov x0, x20
ldr x2, [x21,#16]
cbnz x2, G_M18749_IG08 ; <----------- This condition checks if obj != null
movz x1, #0xd1ffab1e
movk x1, #0xd1ffab1e LSL #16
movk x1, #0xd1ffab1e LSL #32
bl CORINFO_HELP_RUNTIMEHANDLE_CLASS
mov x2, x0
G_M18749_IG08:
add x1, fp, #16 // [V02 loc1]
mov x0, x2
bl Slice`1:Consume(byref)
mov x0, x20
ldr x1, [x21,#24]
cbnz x1, G_M18749_IG09
movz x1, #0xd1ffab1e
movk x1, #0xd1ffab1e LSL #16
movk x1, #0xd1ffab1e LSL #32
bl CORINFO_HELP_RUNTIMEHANDLE_CLASS
mov x1, x0 For happy path (where G_M6359_IG14:
add x0, fp, #64 // [V01 loc0]
mov w2, #5
bl ReadOnlyMemory`1:Slice(int):ReadOnlyMemory`1:this
str x0, [fp,#48] // [V02 loc1]
str x1, [fp,#56] // [V02 loc1+0x08]
mov x0, x20
ldr x1, [x21,#24]
cbz x1, G_M6359_IG16 ; <----------- This condition now checks if obj == null
;; bbWeight=1 PerfScore 8.50
G_M6359_IG15:
str x1, [fp,#32] // [V14 tmp11]
b G_M6359_IG17
;; bbWeight=0.25 PerfScore 0.50
G_M6359_IG16:
movz x1, #0xd1ffab1e
movk x1, #0xd1ffab1e LSL #16
movk x1, #0xd1ffab1e LSL #32
bl CORINFO_HELP_RUNTIMEHANDLE_CLASS
str x0, [fp,#32] // [V14 tmp11]
;; bbWeight=0.25 PerfScore 0.88
G_M6359_IG17:
add x1, fp, #48 // [V02 loc1]
ldr x0, [fp,#32] // [V14 tmp11]
bl Slice`1:Consume(byref)
mov x0, x20
ldr x1, [x21,#32]
cbz x1, G_M6359_IG19 Now, in happy path, we check for The branches in Edit: In .NET 3.1, the condition would just have 1 arm and so it gets flipped, however in .NET 5, we have 2 arms on that condition, one coming from "expandable generic dictionaries" work that we did in .NET 5, so most like given that, bumping the non-call branch won't make a difference. .NET 3.1 tree:
.NET 5.0 tree
|
Here is my analysis for benchmarks in Below is the assembly code for Consume(memory.Slice(Size / 2)); // private const int Size = 10; In .NET 3.1, here is disassembly. G_M45388_IG06:
B94037B3 ldr w19, [fp,#52] // [V01 loc0+0x0c]
7100167F cmp w19, #5
540005A3 blo G_M45388_IG12
G_M45388_IG07:
F94017B4 ldr x20, [fp,#40] // [V01 loc0]
AA1403E0 mov x0, x20
51001675 sub w21, w19, #5
2A1503E1 mov w1, w21
528000A2 mov w2, #5
F9000FA0 str x0, [fp,#24] // [V21 tmp18] ; this._object = _object
B90023A2 str w2, [fp,#32] // [V22 tmp19] ; this._index = _index
B90027A1 str w1, [fp,#36] // [V23 tmp20] ; this._length = _length
910063A0 add x0, fp, #24 // [V02 loc1]
94000000 bl Slice`1:Consume(byref)
7100167F cmp w19, #5
54000483 blo G_M45388_IG13 In .NET 5, here is disassembly. There are more memory access in G_M6359_IG05:
B9402FA0 ldr w0, [fp,#44] // [V23 tmp20]
7100141F cmp w0, #5
540006C3 blo G_M6359_IG11
;; bbWeight=0.50 PerfScore 1.75
G_M6359_IG06:
B9402BA0 ldr w0, [fp,#40] // [V22 tmp19]
11001400 add w0, w0, #5 ; <-- In .NET 3.1, we constant prop value of _index and turns this to " = 5"
B9402FA1 ldr w1, [fp,#44] // [V23 tmp20] ; <-- could have skipped because we already load it in IG05, the way it is skipped in .NET 3.1
51001421 sub w1, w1, #5
F94013A2 ldr x2, [fp,#32] // [V21 tmp18]
F9000BA2 str x2, [fp,#16] // [V24 tmp21] ; this._object = _object
B9001BA0 str w0, [fp,#24] // [V25 tmp22] ; this._index = _index
B9001FA1 str w1, [fp,#28] // [V26 tmp23] ; this._length = _length
910043A0 add x0, fp, #16 // [V02 loc1]
94000000 bl Slice`1:Consume(byref)
B9402FA0 ldr w0, [fp,#44] // [V23 tmp20] ; <-- could have been skipped
7100141F cmp w0, #5
54000523 blo G_M6359_IG11 I believe some of them are happening because we fail to constant propagate the value of .NET 3.1 dump:
.NET 5.0 dump:
I am still trying to see why we fail to detect |
@CarolEidt pointed out that in .NET 3.1, |
Also, it turns out that the fix for |
Here are the benchmarks that regressed (taken from the description but filtered just the ones that regressed)
|
Actionable item: Need to double check the numbers if we did anything in .NET 6.0 to address #41704 (comment). |
Here is my analysis after comparing the assembly of .NET 6 vs. .NET 3.1. Here is the diff for No zero register used mov x1, #0
str x1, [fp,#48] // [V133 tmp130]
str w1, [fp,#56] // [V134 tmp131]
str w1, [fp,#60] // [V135 tmp132]
b G_M31903_IG05 Update: This PR is out - #52269 Redundant ldr from same locationI noticed redundant blo G_M31903_IG88
ldr x19, [fp,#48] // [V133 tmp130]
ldr w1, [fp,#56] // [V134 tmp131]
add w20, w1, #5
ldr w1, [fp,#60] // [V135 tmp132]
sub w21, w1, #5
ldr x1, [fp,#48] // [V133 tmp130]
cbz x1, G_M31903_IG13 Update: Related issue: #6761 Repeated loading of
|
So most of the regression is happening due to CSE not working for add and sub. During value numbering, we name
The reason is what I mentioned in #41704 (comment). Since it is related to the struct multireg return, I would assign this to @sandreenko . |
Multi-def SSA is a large work item that won't happen in 6.0. Marking it as Future for now. |
After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to slicing.
It looks like these are ARM64 specific regressions, I was not able to reproduce it for ARM (the 32-bit variant).
Repro
@kunalspathak is there any chance you could take a look at the produced assembly code and verify if this is an actual regression in code gen or not?
category:cq
theme:ssa
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: