Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially bad loop invariant code motion for vectors #108092

Open
kg opened this issue Sep 21, 2024 · 19 comments
Open

Potentially bad loop invariant code motion for vectors #108092

kg opened this issue Sep 21, 2024 · 19 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kg
Copy link
Member

kg commented Sep 21, 2024

Description

If you have a loop that does Vector128.Create(scalar) each iteration on purpose, RyuJIT can LICM the vector constant out of the loop. This sounds great in theory, but if you look at the generated code, this can move the cached vpbroadcastb result into xmm6 or a similar nonvolatile register, which can force your method to spill the previous value of xmm6 to the stack on entry and restore it on exit. In my real-world scenario, without this LICM the stack isn't used other than some regular register pushes at entry.

Simple method calls or method-call-free loop bodies don't seem to trigger this, RyuJIT will happily use the volatile xmm registers instead and everything is good. An IEqualityComparer.Equals call inside the loop body is enough to force use of xmm6, though.

Reproduction Steps

This demonstrates the LICM in a toy scenario. For a real-world scenario, https://github.com/kg/SimdDictionary/blob/e42d42f7158d2552c97e1a517b82d4bb32750456/DisasmHarness/DisasmHarness.cs#L14 when disasmo'd will demonstrate it.

    public static IEqualityComparer<byte> Comparer = EqualityComparer<byte>.Default;

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static void VectorLicm (byte scalar) {
        int result = 0;
        for (int i = 0; i < 256; i++) {
            var mask = Vector128.Equals(Vector128.Create(scalar), Vector128.Create(unchecked((byte)i)));
            if (!Comparer.Equals(mask.ToScalar(), 0))
                result++;
        }
    }

Expected behavior

Neither of the vpbroadcastb's should be LICM'd out of the loop if they end up in a nonvolatile register IMO, since doing that forces a stack spill at entry and a stack restore at exit. LICMing them into volatile registers feels fine.

Actual behavior

The provably-constant vpbroadcastb is LICM'd out of the loop, which is correct but potentially deleterious for performance

; Assembly listing for method DisasmHarness:VectorLicm(ubyte) (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 48
       vzeroupper 
       vmovaps  xmmword ptr [rsp+0x20], xmm6
 
G_M000_IG02:
       xor      ebx, ebx
       xor      esi, esi
       movzx    rdx, cl
       vpbroadcastb  xmm6, edx
       mov      rdi, 0xD1FFAB1E
 
G_M000_IG03:
       movzx    rdx, sil
       vpbroadcastb  xmm0, edx
       vpcmpeqb xmm0, xmm6, xmm0
       vmovd    edx, xmm0
       movzx    rdx, dl
       mov      rcx, gword ptr [rdi]
       mov      r11, 0xD1FFAB1E
       xor      r8d, r8d
       call     [r11]System.Collections.Generic.IEqualityComparer`1[ubyte]:Equals(ubyte,ubyte):bool:this
       test     eax, eax
       jne      SHORT G_M000_IG05
 
G_M000_IG04:
       inc      ebx
 
G_M000_IG05:
       inc      esi
       cmp      esi, 256
       jl       SHORT G_M000_IG03
 
G_M000_IG06:
       vmovaps  xmm6, xmmword ptr [rsp+0x20]
       add      rsp, 48
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
 
; Total bytes of code 109

Regression?

Don't know. Can't test since my real use case won't work on NET7.

Known Workarounds

EDIT: Changing the type of the scalar parameter to in byte prevents the broadcast from being LICM'd. Putting a Thread.MemoryBarrier doesn't stop the broadcast from moving out of the loop (past the barrier), which is a little surprising, but probably completely safe. You can also block LICM by falsely taking the address of the local containing the scalar, i.e.

...
ref var temp = ref scalar;
...
return temp;

EDIT 2: Inserting a Sse2.MemoryFence() right before the broadcast operation stops it from being LICM'd, but it's pretty clear that it isn't a good solution.

EDIT 3: A Sse.Prefetch0(...) also works as a LICM barrier but seems to generate extra wasted code to calculate the address to prefetch for some reason.

Configuration

Microsoft Visual Studio Community 2022 (64-bit) - Current
Version 17.11.3

PS C:\Users\kg> dotnet --info
.NET SDK:
 Version:           8.0.400
 Commit:            36fe6dda56
 Workload version:  8.0.400-manifests.6c274a57
 MSBuild version:   17.11.3+0c8610977

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.22631
 OS Platform: Windows
 RID:         win-x64
 Base Path:   C:\Program Files\dotnet\sdk\8.0.400\

Other information

This vector LICM seems like it is probably profitable - even with the stack spill - for more expensive vector operations. vpbroadcastb has extremely low latency/cost from what I know though, so in this case it's potentially worse. The code I'm optimizing typically runs its loop body 0-1 times, so the LICM causes a performance hit.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 21, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 21, 2024
@EgorBo
Copy link
Member

EgorBo commented Sep 21, 2024

LICMing them into volatile registers feels fine.

My understanding that you can't use volatile registers here since you have a call in the loop, so you will have to do spill inside the loop.

This vector LICM seems like it is probably profitable - even with the stack spill - for more expensive vector operations. vpbroadcastb has extremely low latency/cost from what I know though, so in this case it's potentially worse.

Well, you're trading an instruction with Latency=3 (e.g. Tiger Lake) in a loop vs a single stack spill/reload, I am not so sure it's cheaper, especially, for a long-running loop. cc @dotnet/jit-contrib

PS: if you check codegen for Linux (SysV 64 ABI) it will emit what you expect due to lack of callee-saved float regs

@AndyAyersMS
Copy link
Member

With PGO we GVD the call, but don't clone the loop because the GCV test is not loop invariant, the dispatch is via a static. Seems like a missed opportunity, if we cloned, the fast loop would have no synchronization points and we could then CSE the GDV and end up with a fast loop with no calls.

However we'd still have the slow loop with the call; this would likely lead to the same spill unless we're less aggressive with "cheap" LICM in the slow loop (which seems possible/reasonable). If we are too aggressive this might make a case for thinking more seriously about shrink-wrapping (deferring prolog saves until "needed"). For FP/Vector perhaps it is not too bad?

I suspect the full example is different and we may in fact clone, so would be good to study that more closely.

@kg
Copy link
Member Author

kg commented Sep 21, 2024

Here's the tier 0 / tier 1 disasmo from the full example on NET8 with tiering + PGO enabled. In this case, it's SimdDictionary<string, long>. Some various optimizations I did have managed to get the stack spills out of the loop itself, so the only real use of the stack is the xmm6 save/restore upon entry and exit.

; Assembly listing for method DisasmHarness:TryGetValue():bool (Instrumented Tier0)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Instrumented Tier0 code
; rbp based frame
; partially interruptible

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       sub      rsp, 48
       lea      rbp, [rsp+0x30]
       xor      eax, eax
       mov      qword ptr [rbp-0x08], rax
 
G_M000_IG02:                ;; offset=0x0010
       mov      rcx, 0x1C6D4C01D60
       mov      rcx, gword ptr [rcx]
       mov      rdx, 0x1C6D4C01D68
       mov      rdx, gword ptr [rdx]
       lea      r8, [rbp-0x08]
       cmp      dword ptr [rcx], ecx
       call     [SimdDictionary.SimdDictionary`2[System.__Canon,long]:TryGetValue(System.__Canon,byref):bool:this]
       nop      
 
G_M000_IG03:                ;; offset=0x0037
       add      rsp, 48
       pop      rbp
       ret      
 
; Total bytes of code 61

; Assembly listing for method DisasmHarness:TryGetValue():bool (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 5 inlinees with PGO data; 10 single block inlinees; 3 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
       push     r15
       push     r14
       push     r13
       push     r12
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 56
       vzeroupper 
       vmovaps  xmmword ptr [rsp+0x20], xmm6 ;; making space for the search vector that got LICM'd
 
G_M000_IG02:                ;; offset=0x0019
       mov      rcx, 0x1C6D4C01D60
       mov      rbx, gword ptr [rcx]
       mov      rcx, 0x1C6D4C01D68
       mov      rsi, gword ptr [rcx]
       mov      rdi, gword ptr [rbx+0x08]
       mov      rcx, rdi
       mov      rdx, rsi
       mov      r11, 0x7FFA3D2F0030
       call     [r11]System.Collections.Generic.IEqualityComparer`1[System.__Canon]:GetHashCode(System.__Canon):int:this
       mov      r8, gword ptr [rbx+0x10]
       test     r8, r8
       je       G_M000_IG17
 
G_M000_IG03:                ;; offset=0x0057
       lea      rbp, bword ptr [r8+0x10]
       mov      r14d, dword ptr [r8+0x08]
 
G_M000_IG04:                ;; offset=0x005F
       mov      r8d, eax
       imul     r8, qword ptr [rbx+0x18] ;; fastmod to turn hash code into bucket index
       shr      r8, 32
       inc      r8
       mov      ecx, r14d
       imul     r8, rcx
       shr      r8, 32
       sub      r14d, r8d
       dec      r14d
       movsxd   r8, r8d
       imul     r15, r8, 240 ;; turning bucket index into address of first bucket
       add      r15, rbp
       mov      r13, r15
       shr      eax, 24 ;; compute hash suffix
       movzx    r8, al
       mov      ecx, 255 ;; cmov to turn hash suffix into 255 if it's 0
       test     r8d, r8d
       cmovne   ecx, r8d
       vpbroadcastb  xmm6, ecx ;; compute search vector from suffix (this was LICM'd)
 
G_M000_IG05:                ;; offset=0x00A8
       movzx    r12, byte  ptr [r15+0x0E] ;; start of loop body; read count byte from bucket
       vpcmpeqb xmm0, xmm6, xmmword ptr [r15] ;; search bucket suffixes for suffix
       vpmovmskb r8d, xmm0
       tzcnt    r8d, r8d ;; convert search result vector to index of first (if any) result
       mov      rcx, r15
       sub      r12d, r8d
       test     r12d, r12d
       jg       SHORT G_M000_IG10
 
G_M000_IG06:                ;; offset=0x00C6
       xor      rbx, rbx
 
G_M000_IG07:                ;; offset=0x00C8
       test     rbx, rbx
       je       SHORT G_M000_IG13
 
G_M000_IG08:                ;; offset=0x00CD
       xor      eax, eax
       test     rbx, rbx
       setne    al
 
G_M000_IG09:                ;; offset=0x00D5
       vmovaps  xmm6, xmmword ptr [rsp+0x20] ;; restore xmm6 due to LICM
       add      rsp, 56
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       pop      r12
       pop      r13
       pop      r14
       pop      r15
       ret      
 
G_M000_IG10:                ;; offset=0x00EC
       movsxd   r8, r8d
       shl      r8, 4
       lea      rbx, bword ptr [rcx+r8+0x10]
 
G_M000_IG11:                ;; offset=0x00F8
       mov      r8, gword ptr [rbx]
       mov      rcx, rdi
       mov      rdx, rsi
       mov      r11, 0x7FFA3D2F0038 ;; the method code below appears to force use of xmm6, which makes sense
       call     [r11]System.Collections.Generic.IEqualityComparer`1[System.__Canon]:Equals(System.__Canon,System.__Canon):bool:this
       test     eax, eax
       je       SHORT G_M000_IG18
 
G_M000_IG12:                ;; offset=0x0112
       jmp      SHORT G_M000_IG07
 
G_M000_IG13:                ;; offset=0x0114
       cmp      byte  ptr [r15+0x0F], 0
       je       SHORT G_M000_IG16
       test     r14d, r14d
       jg       SHORT G_M000_IG14
       mov      r15, rbp
       mov      r14d, 0x7FFFFFFF
       jmp      SHORT G_M000_IG15
 
G_M000_IG14:                ;; offset=0x012B
       add      r15, 240
       dec      r14d
 
G_M000_IG15:                ;; offset=0x0135
       cmp      r15, r13
       jne      G_M000_IG05
 
G_M000_IG16:                ;; offset=0x013E
       xor      rbx, rbx
       jmp      SHORT G_M000_IG08
 
G_M000_IG17:                ;; offset=0x0142
       xor      rbp, rbp
       xor      r14d, r14d
       jmp      G_M000_IG04
 
G_M000_IG18:                ;; offset=0x014C
       dec      r12d
       je       G_M000_IG06
       add      rbx, 16
       jmp      SHORT G_M000_IG11
 
; Total bytes of code 347

@AndyAyersMS
Copy link
Member

Still surprised we don't clone for the inner loop GDV.

Also don't like seeing the loop split by the epilog. This may be fixed in 9...

@kg
Copy link
Member Author

kg commented Sep 22, 2024

I'll run some tests with 9 soon, and also run some tests on ARM64 to see how our codegen looks there.

@JulieLeeMSFT JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Sep 23, 2024
@JulieLeeMSFT
Copy link
Member

@kg, please get back to us how it looks with 9.

I'll run some tests with 9 soon, and also run some tests on ARM64 to see how our codegen looks there.

@kg
Copy link
Member Author

kg commented Sep 23, 2024

I've been unable to get the LICM to happen inside a BDN benchmark harness w/DisassemblyDiagnoser, and I couldn't figure out how to get disasmo working with the .NET 9 preview. Is there an alternate method I should be using to get accurate disassembly from the 9 preview to compare with 8?

Regardless, it looks like 9's codegen for this scenario might be a little better, as the performance is improved. However, I can't get disassembly captured for the 9 preview unless I set AggressiveInlining for all of the code, which means the lookups get woven in to the benchmark loop and the result integrity check, which may be what's throwing things off. It's also possible PGO/tiering is the issue here, and the codegen is being influenced by the fact that the benchmark always finds the key it's looking for.

Attaching a ZIP that contains the DisassemblyDiagnoser output for a <string, long> lookup w/default comparer on 8.0 and 9.0p1.
Benchmarks.StringLookup-report.zip

The code size is much smaller than I expected, so I'm not sure what's going on with that either. In disasmo for net8 I get 562 bytes of generated code, vs the 472 reported here.

Method Runtime Mean Error StdDev Ratio RatioSD Code Size
FindExistingSIMD .NET 8.0 511.6 μs 9.33 μs 8.73 μs 1.00 0.02 472 B
FindExistingSIMD .NET 9.0 487.8 μs 8.96 μs 8.38 μs 0.95 0.02 467 B

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Sep 23, 2024
@kg
Copy link
Member Author

kg commented Sep 24, 2024

Still surprised we don't clone for the inner loop GDV.

Also don't like seeing the loop split by the epilog. This may be fixed in 9...

Re: the loop split, it looks like this still happens in 9. I tried changing the disassembly harness so that the TryGetValue call alternates between succeeding and failing, and that seems to have moved the epilog further down towards the end of the method in NET8, so I think it comes down to the tiering/PGO having previously seen that the path taken through the method was always the same (since I was always looking up the same key in the disasmo harness, unlike the BDN benchmarks.)

I'll note that once I changed it to call TryGetValue twice per iteration, it added a message like this to the disassembly. I don't know if it means anything or not but I figured I'd cite it here. "edge weights are invalid" seems weird.

; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 2110
; 10 inlinees with PGO data; 20 single block inlinees; 6 inlinees without PGO data

@AndyAyersMS
Copy link
Member

The "edge weights are invalid" is a .NET 8 and earlier thing. Until recently the JIT would try and derive flow graph edge weights from block weights and (if things were sufficiently messy) might fail to find a suitable set of weights and so declare them invalid. This is gone in .NET 9.

It's possible that the PGO data here is a bit thin. Since your benchmark method has a loop it may inspire OSR and depending on how BDN stages things the run may not get too many cycles in Tier0+instr and/or not see profile data for inlinees. OSR+PGO interaction is not as smooth as we'd like.

If you're on windows you can (as admin) run BDN with -p ETW (assuming you're using BenchmarkSwitcher) and pass the resulting file to jitutils/instructionsretiredeexplorer with the -benchmark option to see if any BDN measurement intervals involved OSR code.

@JulieLeeMSFT JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Sep 25, 2024
@kg
Copy link
Member Author

kg commented Oct 10, 2024

The "edge weights are invalid" is a .NET 8 and earlier thing. Until recently the JIT would try and derive flow graph edge weights from block weights and (if things were sufficiently messy) might fail to find a suitable set of weights and so declare them invalid. This is gone in .NET 9.

It's possible that the PGO data here is a bit thin. Since your benchmark method has a loop it may inspire OSR and depending on how BDN stages things the run may not get too many cycles in Tier0+instr and/or not see profile data for inlinees. OSR+PGO interaction is not as smooth as we'd like.

If you're on windows you can (as admin) run BDN with -p ETW (assuming you're using BenchmarkSwitcher) and pass the resulting file to jitutils/instructionsretiredeexplorer with the -benchmark option to see if any BDN measurement intervals involved OSR code.

-p ETW gives me an error about the IProfiler diagnoser, I've installed BenchmarkDotNet.Diagnostics.Windows but it's not working. It suggests 'consuming one of its public types' but none of the types defined in that assembly are available even though I added a dependency. Is this something that doesn't work for NET8 targets and I need to retarget to the NET9 SDK for it to work?

There is a little yellow ⚠️ on the Diagnostics dependency in Packages but there are no errors or warnings in the build or package manager logs.

@dotnet-policy-service dotnet-policy-service bot removed needs-author-action An issue or pull request that requires more info or actions from the author. no-recent-activity labels Oct 10, 2024
@AndyAyersMS
Copy link
Member

It all works fine for me on older .NET versions. I don't use the IDE for running perf tests so don't know if the dependency warning is normal or not (suspect it might "normal"/ok as the diagnostics is only used by the projects BDN creates, not by the benchmark project itself).

Maybe @adamsitnik can help?

@adamsitnik
Copy link
Member

It suggests 'consuming one of its public types' but none of the types defined in that assembly are available even though I added a dependency.

There are multiple public types in that library: https://github.com/dotnet/BenchmarkDotNet/blob/346bbab62a508fbce8179965ba05452e7a361367/src/BenchmarkDotNet.Diagnostics.Windows/EtwProfiler.cs#L22

There is a little yellow ⚠️ on the Diagnostics dependency in Packages but there are no errors or warnings in the build or package manager logs.

most likely the packages has not been restored properly. Could you please run dotnet restore, then add [EtwProfier] attribute to your class with benchmarks and then run the app as admin in Release mode again?

@JulieLeeMSFT JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 29, 2024
@JulieLeeMSFT
Copy link
Member

Could you please run dotnet restore, then add [EtwProfier] attribute to your class with benchmarks and then run the app as admin in Release mode again?

@kg, please try this and let us know what you find out.

@kg
Copy link
Member Author

kg commented Oct 29, 2024

dotnet restore from powershell worked, so the problem there was that visual studio nuget restore is broken, at least on my work machine in 17.11.5. I had tried a few different methods to get it to restore previously, but not that.

With the restore fixed I'm able to apply EtwProfiler to the test suite and do -p ETW successfully in an administrator terminal. It looks like it generated an .etl file which I will dig into later today.

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 29, 2024
@kg
Copy link
Member Author

kg commented Oct 29, 2024

I'm not sure how to interpret this, but i don't see any mention of OSR. Unless the unmappable addresses are OSR code?

eh? unknown module ID 140707851616256
eh? unknown module ID 140707851616256
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventSourceAttribute.set_Name(class System.Object,pMT: 00007FF919A73218<class System.Object>)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Level(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Task(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,class System.Object,int*)
eh? reloading method [System.Private.CoreLib]dynamicClass.InvokeStub_EventAttribute.set_Opcode(class System.Object,pMT: 00007FF919A73218<class System.Object>)
Can't map address 7FF9199A4530 -- 3140000 counts
Can't map address 7FF9199A4548 -- 200000 counts
Can't map address 7FF9199A4550 -- 4210000 counts
Can't map address 7FF9199A455A -- 200000 counts
Can't map address 7FF9199A455D -- 200000 counts
Can't map address 7FF9199A456E -- 200000 counts
Samples for dotnet: 25553 events for Benchmark Intervals
Jitting           : 00.46% 8.8E+05  samples 1623 methods
  JitInterface    : 00.18% 3.5E+05  samples
Jit-generated code: 92.34% 1.76E+08 samples
  Jitted code     : 92.34% 1.76E+08 samples
  MinOpts code    : 00.00% 0        samples
  FullOpts code   : 00.00% 0        samples
  Tier-0 code     : 00.00% 0        samples
  Tier-1 code     : 92.34% 1.76E+08 samples
  R2R code        : 00.00% 0        samples

04.73%   9.04E+06    ?        Unknown
66.23%   1.266E+08   Tier-1   [Benchmarks]StringLookup.FindExistingSIMD()
16.50%   3.153E+07   Tier-1   [System.Private.CoreLib]Marvin.ComputeHash32(unsigned int8&,unsigned int32,unsigned int32,unsigned int32)
05.83%   1.115E+07   Tier-1   [System.Private.CoreLib]OrdinalCaseSensitiveComparer.Equals(class System.String,class System.String)
02.84%   5.42E+06    Tier-1   [System.Private.CoreLib]OrdinalCaseSensitiveComparer.GetHashCode(class System.String)
01.12%   2.14E+06    native   IntcAudioBus.sys
00.93%   1.77E+06    Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceEqual(unsigned int8&,unsigned int8&,unsigned int)
00.74%   1.41E+06    native   ntoskrnl.exe
00.36%   6.8E+05     native   igdkmdn64.sys
00.28%   5.3E+05     native   clrjit.dll
00.10%   2E+05       native   coreclr.dll
00.06%   1.2E+05     native   ntdll.dll
00.06%   1.2E+05     native   Netwtw14.sys
00.06%   1.1E+05     native   dxgmms2.sys

Benchmark: found 31 intervals; mean interval 613.458ms

@AndyAyersMS
Copy link
Member

If things in BDN are working properly you won't see OSR methods during the measurement intervals, and your data above shows just that.

The unmappable addresses are most likely indirection stubs or VSD or something similar that the runtime doesn't describe via ETL.

So all the key methods being measured are Tier1. That is good. But the question is whether the earlier runs of methods during BDN's warmup are generating sufficiently good profile data for the JIT.

  • if they are, then the issue is somewhere in the JIT's handling of profile data, or the heuristics
  • if they aren't, then the issue is likely loss of instrumentation for some prejitted or OSR'd methods. We hope to fix this issue in .NET 10 (see Support profile instrumentation for inlinees #44372)

Sorting all this is a bit painful to answer because you now need to look across the multiple versions of potentially multiple methods to get the full story. I would probably start by collecting an SPMI trace of the benchmark on a release corerun with checked JIT and then generating a JIT dump of the key methods above based on that via SPMI replay (You can also generate the dump directly while running BDN but there's a chance this throws off the timing since all the dump logging is slow).

I will try and set some time aside soon to dig in.

FYI @dotnet/jit-contrib in case anyone else is up for taking a look.

@kg
Copy link
Member Author

kg commented Nov 4, 2024

FWIW, I did some more testing/measurements and I think I'm convinced that this particular LICM decision is not, on average, a bad one. It's just unfortunate that it results in a stack spill when without the LICM I could potentially do some sort of a dance where it gets pre-generated in a volatile register and then generated again after any method calls trample the contents of the volatile vector register. I may do some tests later to see if that's possible with the current state of the JIT.

I'm not sure how common this scenario of "it would be nice to keep this data around, but it's not necessary to do a stack spill, I can regenerate it more cheaply" is.

To try and re-capture things and summarize:
The dataflow in question is
key scalar arg -> hashcode scalar local -> 'suffix' scalar local -> 'search vector' vector local
so even if the search vector ends up in a volatile reg and is trampled, I could re-generate it from the suffix or from the hashcode.

At present I'm keeping the search vector around and discarding the suffix+hashcode, so the vector has to live in a nonvolatile vector register. If I were to retain the suffix in a scalar register (I tried this in the past because in some cases I have a scalar register free for it) and re-generate the search vector on every use, then I wouldn't need it to live in a nonvolatile vector register, and in some cases that's what the JIT would do (put it in a volatile one). But the LICM forces the search vector into a nonvolatile register and causes a stack spill.

Maybe I should re-phrase the issue title to something like "Loop Invariant Code Motion for vectors can force use of non-volatile registers for temporaries"? I'm not sure that's accurate though.

@AndyAyersMS
Copy link
Member

I'm not sure how common this scenario of "it would be nice to keep this data around, but it's not necessary to do a stack spill, I can regenerate it more cheaply" is.

This is what we call rematerialization, a long-standing wish for the allocator ... see #6264, #70182

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Nov 6, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Nov 6, 2024
@JulieLeeMSFT JulieLeeMSFT assigned AndyAyersMS and unassigned kg Nov 6, 2024
@JulieLeeMSFT
Copy link
Member

CC @jakobbotsch for loop invariant related issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

5 participants