-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potentially bad loop invariant code motion for vectors #108092
Comments
My understanding that you can't use volatile registers here since you have a call in the loop, so you will have to do spill inside the loop.
Well, you're trading an instruction with Latency=3 (e.g. Tiger Lake) in a loop vs a single stack spill/reload, I am not so sure it's cheaper, especially, for a long-running loop. cc @dotnet/jit-contrib PS: if you check codegen for Linux (SysV 64 ABI) it will emit what you expect due to lack of callee-saved float regs |
With PGO we GVD the call, but don't clone the loop because the GCV test is not loop invariant, the dispatch is via a static. Seems like a missed opportunity, if we cloned, the fast loop would have no synchronization points and we could then CSE the GDV and end up with a fast loop with no calls. However we'd still have the slow loop with the call; this would likely lead to the same spill unless we're less aggressive with "cheap" LICM in the slow loop (which seems possible/reasonable). If we are too aggressive this might make a case for thinking more seriously about shrink-wrapping (deferring prolog saves until "needed"). For FP/Vector perhaps it is not too bad? I suspect the full example is different and we may in fact clone, so would be good to study that more closely. |
Here's the tier 0 / tier 1 disasmo from the full example on NET8 with tiering + PGO enabled. In this case, it's ; Assembly listing for method DisasmHarness:TryGetValue():bool (Instrumented Tier0)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Instrumented Tier0 code
; rbp based frame
; partially interruptible
G_M000_IG01: ;; offset=0x0000
push rbp
sub rsp, 48
lea rbp, [rsp+0x30]
xor eax, eax
mov qword ptr [rbp-0x08], rax
G_M000_IG02: ;; offset=0x0010
mov rcx, 0x1C6D4C01D60
mov rcx, gword ptr [rcx]
mov rdx, 0x1C6D4C01D68
mov rdx, gword ptr [rdx]
lea r8, [rbp-0x08]
cmp dword ptr [rcx], ecx
call [SimdDictionary.SimdDictionary`2[System.__Canon,long]:TryGetValue(System.__Canon,byref):bool:this]
nop
G_M000_IG03: ;; offset=0x0037
add rsp, 48
pop rbp
ret
; Total bytes of code 61
; Assembly listing for method DisasmHarness:TryGetValue():bool (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 5 inlinees with PGO data; 10 single block inlinees; 3 inlinees without PGO data
G_M000_IG01: ;; offset=0x0000
push r15
push r14
push r13
push r12
push rdi
push rsi
push rbp
push rbx
sub rsp, 56
vzeroupper
vmovaps xmmword ptr [rsp+0x20], xmm6 ;; making space for the search vector that got LICM'd
G_M000_IG02: ;; offset=0x0019
mov rcx, 0x1C6D4C01D60
mov rbx, gword ptr [rcx]
mov rcx, 0x1C6D4C01D68
mov rsi, gword ptr [rcx]
mov rdi, gword ptr [rbx+0x08]
mov rcx, rdi
mov rdx, rsi
mov r11, 0x7FFA3D2F0030
call [r11]System.Collections.Generic.IEqualityComparer`1[System.__Canon]:GetHashCode(System.__Canon):int:this
mov r8, gword ptr [rbx+0x10]
test r8, r8
je G_M000_IG17
G_M000_IG03: ;; offset=0x0057
lea rbp, bword ptr [r8+0x10]
mov r14d, dword ptr [r8+0x08]
G_M000_IG04: ;; offset=0x005F
mov r8d, eax
imul r8, qword ptr [rbx+0x18] ;; fastmod to turn hash code into bucket index
shr r8, 32
inc r8
mov ecx, r14d
imul r8, rcx
shr r8, 32
sub r14d, r8d
dec r14d
movsxd r8, r8d
imul r15, r8, 240 ;; turning bucket index into address of first bucket
add r15, rbp
mov r13, r15
shr eax, 24 ;; compute hash suffix
movzx r8, al
mov ecx, 255 ;; cmov to turn hash suffix into 255 if it's 0
test r8d, r8d
cmovne ecx, r8d
vpbroadcastb xmm6, ecx ;; compute search vector from suffix (this was LICM'd)
G_M000_IG05: ;; offset=0x00A8
movzx r12, byte ptr [r15+0x0E] ;; start of loop body; read count byte from bucket
vpcmpeqb xmm0, xmm6, xmmword ptr [r15] ;; search bucket suffixes for suffix
vpmovmskb r8d, xmm0
tzcnt r8d, r8d ;; convert search result vector to index of first (if any) result
mov rcx, r15
sub r12d, r8d
test r12d, r12d
jg SHORT G_M000_IG10
G_M000_IG06: ;; offset=0x00C6
xor rbx, rbx
G_M000_IG07: ;; offset=0x00C8
test rbx, rbx
je SHORT G_M000_IG13
G_M000_IG08: ;; offset=0x00CD
xor eax, eax
test rbx, rbx
setne al
G_M000_IG09: ;; offset=0x00D5
vmovaps xmm6, xmmword ptr [rsp+0x20] ;; restore xmm6 due to LICM
add rsp, 56
pop rbx
pop rbp
pop rsi
pop rdi
pop r12
pop r13
pop r14
pop r15
ret
G_M000_IG10: ;; offset=0x00EC
movsxd r8, r8d
shl r8, 4
lea rbx, bword ptr [rcx+r8+0x10]
G_M000_IG11: ;; offset=0x00F8
mov r8, gword ptr [rbx]
mov rcx, rdi
mov rdx, rsi
mov r11, 0x7FFA3D2F0038 ;; the method code below appears to force use of xmm6, which makes sense
call [r11]System.Collections.Generic.IEqualityComparer`1[System.__Canon]:Equals(System.__Canon,System.__Canon):bool:this
test eax, eax
je SHORT G_M000_IG18
G_M000_IG12: ;; offset=0x0112
jmp SHORT G_M000_IG07
G_M000_IG13: ;; offset=0x0114
cmp byte ptr [r15+0x0F], 0
je SHORT G_M000_IG16
test r14d, r14d
jg SHORT G_M000_IG14
mov r15, rbp
mov r14d, 0x7FFFFFFF
jmp SHORT G_M000_IG15
G_M000_IG14: ;; offset=0x012B
add r15, 240
dec r14d
G_M000_IG15: ;; offset=0x0135
cmp r15, r13
jne G_M000_IG05
G_M000_IG16: ;; offset=0x013E
xor rbx, rbx
jmp SHORT G_M000_IG08
G_M000_IG17: ;; offset=0x0142
xor rbp, rbp
xor r14d, r14d
jmp G_M000_IG04
G_M000_IG18: ;; offset=0x014C
dec r12d
je G_M000_IG06
add rbx, 16
jmp SHORT G_M000_IG11
; Total bytes of code 347 |
Still surprised we don't clone for the inner loop GDV. Also don't like seeing the loop split by the epilog. This may be fixed in 9... |
I'll run some tests with 9 soon, and also run some tests on ARM64 to see how our codegen looks there. |
@kg, please get back to us how it looks with 9.
|
I've been unable to get the LICM to happen inside a BDN benchmark harness w/DisassemblyDiagnoser, and I couldn't figure out how to get disasmo working with the .NET 9 preview. Is there an alternate method I should be using to get accurate disassembly from the 9 preview to compare with 8? Regardless, it looks like 9's codegen for this scenario might be a little better, as the performance is improved. However, I can't get disassembly captured for the 9 preview unless I set AggressiveInlining for all of the code, which means the lookups get woven in to the benchmark loop and the result integrity check, which may be what's throwing things off. It's also possible PGO/tiering is the issue here, and the codegen is being influenced by the fact that the benchmark always finds the key it's looking for. Attaching a ZIP that contains the DisassemblyDiagnoser output for a The code size is much smaller than I expected, so I'm not sure what's going on with that either. In disasmo for net8 I get 562 bytes of generated code, vs the 472 reported here.
|
Re: the loop split, it looks like this still happens in 9. I tried changing the disassembly harness so that the TryGetValue call alternates between succeeding and failing, and that seems to have moved the epilog further down towards the end of the method in NET8, so I think it comes down to the tiering/PGO having previously seen that the path taken through the method was always the same (since I was always looking up the same key in the disasmo harness, unlike the BDN benchmarks.) I'll note that once I changed it to call TryGetValue twice per iteration, it added a message like this to the disassembly. I don't know if it means anything or not but I figured I'd cite it here. "edge weights are invalid" seems weird.
|
The "edge weights are invalid" is a .NET 8 and earlier thing. Until recently the JIT would try and derive flow graph edge weights from block weights and (if things were sufficiently messy) might fail to find a suitable set of weights and so declare them invalid. This is gone in .NET 9. It's possible that the PGO data here is a bit thin. Since your benchmark method has a loop it may inspire OSR and depending on how BDN stages things the run may not get too many cycles in Tier0+instr and/or not see profile data for inlinees. OSR+PGO interaction is not as smooth as we'd like. If you're on windows you can (as admin) run BDN with |
There is a little yellow |
It all works fine for me on older .NET versions. I don't use the IDE for running perf tests so don't know if the dependency warning is normal or not (suspect it might "normal"/ok as the diagnostics is only used by the projects BDN creates, not by the benchmark project itself). Maybe @adamsitnik can help? |
There are multiple public types in that library: https://github.com/dotnet/BenchmarkDotNet/blob/346bbab62a508fbce8179965ba05452e7a361367/src/BenchmarkDotNet.Diagnostics.Windows/EtwProfiler.cs#L22
most likely the packages has not been restored properly. Could you please run |
@kg, please try this and let us know what you find out. |
With the restore fixed I'm able to apply |
I'm not sure how to interpret this, but i don't see any mention of OSR. Unless the unmappable addresses are OSR code?
|
If things in BDN are working properly you won't see OSR methods during the measurement intervals, and your data above shows just that. The unmappable addresses are most likely indirection stubs or VSD or something similar that the runtime doesn't describe via ETL. So all the key methods being measured are Tier1. That is good. But the question is whether the earlier runs of methods during BDN's warmup are generating sufficiently good profile data for the JIT.
Sorting all this is a bit painful to answer because you now need to look across the multiple versions of potentially multiple methods to get the full story. I would probably start by collecting an SPMI trace of the benchmark on a release corerun with checked JIT and then generating a JIT dump of the key methods above based on that via SPMI replay (You can also generate the dump directly while running BDN but there's a chance this throws off the timing since all the dump logging is slow). I will try and set some time aside soon to dig in. FYI @dotnet/jit-contrib in case anyone else is up for taking a look. |
FWIW, I did some more testing/measurements and I think I'm convinced that this particular LICM decision is not, on average, a bad one. It's just unfortunate that it results in a stack spill when without the LICM I could potentially do some sort of a dance where it gets pre-generated in a volatile register and then generated again after any method calls trample the contents of the volatile vector register. I may do some tests later to see if that's possible with the current state of the JIT. I'm not sure how common this scenario of "it would be nice to keep this data around, but it's not necessary to do a stack spill, I can regenerate it more cheaply" is. To try and re-capture things and summarize: At present I'm keeping the search vector around and discarding the suffix+hashcode, so the vector has to live in a nonvolatile vector register. If I were to retain the suffix in a scalar register (I tried this in the past because in some cases I have a scalar register free for it) and re-generate the search vector on every use, then I wouldn't need it to live in a nonvolatile vector register, and in some cases that's what the JIT would do (put it in a volatile one). But the LICM forces the search vector into a nonvolatile register and causes a stack spill. Maybe I should re-phrase the issue title to something like "Loop Invariant Code Motion for vectors can force use of non-volatile registers for temporaries"? I'm not sure that's accurate though. |
CC @jakobbotsch for loop invariant related issue. |
Description
If you have a loop that does
Vector128.Create(scalar)
each iteration on purpose, RyuJIT can LICM the vector constant out of the loop. This sounds great in theory, but if you look at the generated code, this can move the cached vpbroadcastb result into xmm6 or a similar nonvolatile register, which can force your method to spill the previous value of xmm6 to the stack on entry and restore it on exit. In my real-world scenario, without this LICM the stack isn't used other than some regular register pushes at entry.Simple method calls or method-call-free loop bodies don't seem to trigger this, RyuJIT will happily use the volatile xmm registers instead and everything is good. An IEqualityComparer.Equals call inside the loop body is enough to force use of xmm6, though.
Reproduction Steps
This demonstrates the LICM in a toy scenario. For a real-world scenario, https://github.com/kg/SimdDictionary/blob/e42d42f7158d2552c97e1a517b82d4bb32750456/DisasmHarness/DisasmHarness.cs#L14 when disasmo'd will demonstrate it.
Expected behavior
Neither of the
vpbroadcastb
's should be LICM'd out of the loop if they end up in a nonvolatile register IMO, since doing that forces a stack spill at entry and a stack restore at exit. LICMing them into volatile registers feels fine.Actual behavior
The provably-constant
vpbroadcastb
is LICM'd out of the loop, which is correct but potentially deleterious for performanceRegression?
Don't know. Can't test since my real use case won't work on NET7.
Known Workarounds
EDIT: Changing the type of the
scalar
parameter toin byte
prevents the broadcast from being LICM'd. Putting a Thread.MemoryBarrier doesn't stop the broadcast from moving out of the loop (past the barrier), which is a little surprising, but probably completely safe. You can also block LICM by falsely taking the address of the local containing the scalar, i.e.EDIT 2: Inserting a
Sse2.MemoryFence()
right before the broadcast operation stops it from being LICM'd, but it's pretty clear that it isn't a good solution.EDIT 3: A
Sse.Prefetch0(...)
also works as a LICM barrier but seems to generate extra wasted code to calculate the address to prefetch for some reason.Configuration
Microsoft Visual Studio Community 2022 (64-bit) - Current
Version 17.11.3
Other information
This vector LICM seems like it is probably profitable - even with the stack spill - for more expensive vector operations. vpbroadcastb has extremely low latency/cost from what I know though, so in this case it's potentially worse. The code I'm optimizing typically runs its loop body 0-1 times, so the LICM causes a performance hit.
The text was updated successfully, but these errors were encountered: