Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Propagate LCL_ADDR nodes during local morph #102808

Merged
merged 11 commits into from
May 31, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented May 29, 2024

This changes local morph to run in RPO when optimizations are enabled. It adds infrastructure to track and propagate LCL_ADDR values assigned to locals during local morph. This allows us to avoid address exposure in cases where the destination local does not actually end up escaping in any way.

Example:

public struct Awaitable
{
    public int Opts;

    public Awaitable(bool value)
    {
        Opts = value ? 1 : 2;
    }
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test() => new Awaitable(false).Opts;

Before:

G_M59043_IG01:  ;; offset=0x0000
       push     rax
						;; size=1 bbWeight=1 PerfScore 1.00

G_M59043_IG02:  ;; offset=0x0001
       xor      eax, eax
       mov      dword ptr [rsp], eax
       mov      dword ptr [rsp], 2
       mov      eax, dword ptr [rsp]
						;; size=15 bbWeight=1 PerfScore 3.25

G_M59043_IG03:  ;; offset=0x0010
       add      rsp, 8
       ret
						;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 21

After:

G_M59043_IG02:  ;; offset=0x0000
       mov      eax, 2
						;; size=5 bbWeight=1 PerfScore 0.25

G_M59043_IG03:  ;; offset=0x0005
       ret

Propagating the addresses works much like local assertion prop in morph does. Proving that the locals that were stored to do not escape afterwards is done with a simplistic approach: we check globally that no reads of the locals exists, and if so, we replace the LCL_ADDR stored to them by a constant 0. We leave it up to liveness to clean up the stores themselves.

If we were able to remove any LCL_ADDR in this way then we run an additional pass over the locals of the IR to compute the final set of exposed locals.

This could be more sophisticated, but in practice this handles the reported cases just fine.

Fix #87072
Fix #102273
Fix #102518

This is still not sufficient to handle #69254. To handle that we would need more support around tracking the values of struct fields, and handling of promoted fields. This PR currently does not handle promoted fields at all; we use lvHasLdAddrOp as a conservative approximation of address exposure on the destination locals, and promoted structs almost always have this set. If we were to handle promoted fields we would need some other way to determine that a destination holding a local address couldn't be exposed.

This changes local morph to run in RPO when optimizations are enabled.
It adds infrastructure to track and propagate LCL_ADDR values assigned
to locals (or struct fields) during local morph. This allows us to avoid
address exposure in cases where the destination local does not actually
end up escaping in any way.

Example:
```csharp
public struct Awaitable
{
    public int Opts;

    public Awaitable(bool value)
    {
        Opts = value ? 1 : 2;
    }
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test() => new Awaitable(false).Opts;
```

Before:
```asm
G_M59043_IG01:  ;; offset=0x0000
       push     rax
						;; size=1 bbWeight=1 PerfScore 1.00

G_M59043_IG02:  ;; offset=0x0001
       xor      eax, eax
       mov      dword ptr [rsp], eax
       mov      dword ptr [rsp], 2
       mov      eax, dword ptr [rsp]
						;; size=15 bbWeight=1 PerfScore 3.25

G_M59043_IG03:  ;; offset=0x0010
       add      rsp, 8
       ret
						;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 21

```

After:
```asm
G_M59043_IG02:  ;; offset=0x0000
       mov      eax, 2
						;; size=5 bbWeight=1 PerfScore 0.25

G_M59043_IG03:  ;; offset=0x0005
       ret
```

Propagating the addresses works much like local assertion prop in morph
does. Proving that the locations that were stored to do not escape
afterwards is done with a simplistic approach: we check globally that no
reads of the location exists, and if so, we replace the `LCL_ADDR`
stored to them by a constant 0. We leave it up to liveness to clean up
the stores themselves.

This could be more sophisticated, but in practice this handles the
reported cases just fine.

If we were able to remove any `LCL_ADDR` in this way then we run an
additional pass over the locals of the IR to compute the final set of
exposed locals.

Fix dotnet#87072
Fix dotnet#102273
Fix dotnet#102518

This is still not sufficient to handle dotnet#69254. To handle that case we
need to handle transferring assertions for struct copies, and also
handle proving that specific struct fields containing local addresses do
not escape. It is probably doable, but for now I will leave it as future
work.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 29, 2024
@jakobbotsch
Copy link
Member Author

For obvious reasons we cannot use assertions about locals that themselves are exposed. For example:

int x = 1234;
int y = 5678;
int* px = &x;
int** ppx = &px;
*ppx = &y;
... *px ... // cannot be replaced with x

However, it's hard to detect this precisely, since the exposure could be happening in a block coming in from a backedge. It's not really sufficient to rely on the RPO to mark the LHS as "could have been exposed by now".

I'll have to switch to lvHasLdAddrOp. Hopefully that won't have much negative impact.
We could also alternatively do some dataflow, or build loops and look at the loops ahead of time.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@jakobbotsch
Copy link
Member Author

Given the switch to lvHasLdAddrOp I decided to simplify things and just only focus on GT_LCL_VAR related to unpromoted locals. Practically all promoted structs end up with lvHasLdAddrOp set, and handling these or GT_LCL_FLD became no longer interesting. I also hit bugs around handling promoted locals correctly (getting the logic right around the parent locals or fields on writes/reads was non-trivial and costly for throughput).

Overall the simplification had no diffs, but it does mean that most of the logic that would allow us to resolve #69254 is now gone. I think to work towards something like that we would want to move old promotion to happen after local morph, or remove old promotion entirely.

@jakobbotsch
Copy link
Member Author

Current detailed throughput of benchmarks.run_pgo looks like:

Base: 113516714380, Diff: 113681788609, +0.1454%

159435077  : NA       : 33.11% : +0.1405% : public: void __cdecl LocalAddressVisitor::VisitBlock(struct BasicBlock *)                                                                                                                                                                                                                                               
22702317   : NA       : 4.71%  : +0.0200% : private: bool __cdecl Compiler::fgExposeUnpropagatedLocals(bool, class LocalEqualsLocalAddrAssertions *)                                                                                                                                                                                                                
19686816   : NA       : 4.09%  : +0.0173% : private: void __cdecl LocalAddressVisitor::HandleLocalAssertions(struct GenTreeLclVarCommon *, class LocalAddressVisitor::Value &)                                                                                                                                                                                      
15915961   : +2.80%   : 3.30%  : +0.0140% : public: enum Compiler::fgWalkResult __cdecl LocalAddressVisitor::PostOrderVisit(struct GenTree **, struct GenTree *)                                                                                                                                                                                                    
15758409   : NA       : 3.27%  : +0.0139% : private: void __cdecl LocalAddressVisitor::HandleLocalStoreAssertions(struct GenTreeLclVarCommon *, class LocalAddressVisitor::Value &)                                                                                                                                                                                 
10086932   : +12.12%  : 2.09%  : +0.0089% : private: void __cdecl SsaRenameState::Push(class SsaRenameState::Stack *, struct BasicBlock *, unsigned int)                                                                                                                                                                                                            
9465645    : +7.97%   : 1.97%  : +0.0083% : private: void __cdecl SsaBuilder::RenameVariables(void)                                                                                                                                                                                                                                                                 
7598598    : +0.31%   : 1.58%  : +0.0067% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)                                                                                                                                                                                                                                                 
6707150    : +6.36%   : 1.39%  : +0.0059% : public: enum PhaseStatus __cdecl Compiler::fgValueNumber(void)                                                                                                                                                                                                                                                          
6453700    : +19.39%  : 1.34%  : +0.0057% : public: void __cdecl SsaRenameState::Push(struct BasicBlock *, unsigned int, unsigned int)                                                                                                                                                                                                                              
5635614    : +40.18%  : 1.17%  : +0.0050% : public: bool __cdecl Compiler::fgVarIsNeverZeroInitializedInProlog(unsigned int)                                                                                                                                                                                                                                        
4403012    : +0.45%   : 0.91%  : +0.0039% : memset                                                                                                                                                                                                                                                                                                                  
4387505    : +4.50%   : 0.91%  : +0.0039% : public: unsigned int __cdecl ValueNumStore::VNZeroForType(enum var_types)                                                                                                                                                                                                                                               
4356861    : +11.73%  : 0.90%  : +0.0038% : public: enum PhaseStatus __cdecl Compiler::optVNBasedDeadStoreRemoval(void)                                                                                                                                                                                                                                             
4339763    : NA       : 0.90%  : +0.0038% : private: void __cdecl LocalAddressVisitor::ProcessIndirection(struct GenTree **, class LocalAddressVisitor::Value &, struct GenTree *)                                                                                                                                                                                  
3719938    : NA       : 0.77%  : +0.0033% : public: __cdecl LocalEqualsLocalAddrAssertions::LocalEqualsLocalAddrAssertions(class Compiler *)                                                                                                                                                                                                                        
3087451    : +252.75% : 0.64%  : +0.0027% : private: unsigned int __cdecl ValueNumStore::VnForConst<unsigned __int64, class ValueNumStore::VNMap<unsigned __int64, struct JitLargePrimitiveKeyFuncs<unsigned __int64>>>(unsigned __int64, class ValueNumStore::VNMap<unsigned __int64, struct JitLargePrimitiveKeyFuncs<unsigned __int64>> *, enum var_types)       
2644605    : +7.16%   : 0.55%  : +0.0023% : private: void __cdecl SsaDefArray<class LclSsaVarDsc>::GrowArray(class CompAllocator)                                                                                                                                                                                                                                   
1091175    : +0.10%   : 0.23%  : +0.0010% : public: unsigned int __cdecl ValueNumStore::VNForFunc(enum var_types, enum VNFunc, unsigned int, unsigned int)                                                                                                                                                                                                          
1051327    : +259.48% : 0.22%  : +0.0009% : public: unsigned int __cdecl ValueNumStore::VNForByrefCon(unsigned __int64)                                                                                                                                                                                                                                             
1008619    : +14.73%  : 0.21%  : +0.0009% : private: void __cdecl LocalAddressVisitor::EscapeAddress(class LocalAddressVisitor::Value &, struct GenTree *)                                                                                                                                                                                                          
974021     : +1.68%   : 0.20%  : +0.0009% : public: enum Compiler::fgWalkResult __cdecl GenTreeVisitor<class LocalSequencer>::WalkTree(struct GenTree **, struct GenTree *)                                                                                                                                                                                         
973383     : +0.21%   : 0.20%  : +0.0009% : public: bool __cdecl ValueNumStore::VNMap<struct ValueNumStore::VNDefFuncApp<2>, struct ValueNumStore::VNDefFuncAppKeyFuncs<2>>::Set(struct ValueNumStore::VNDefFuncApp<2>, unsigned int)                                                                                                                               
875906     : +9.62%   : 0.18%  : +0.0008% : private: void __cdecl JitHashTable<__int64, struct JitLargePrimitiveKeyFuncs<__int64>, unsigned int, class CompAllocator, class JitHashTableBehavior>::Grow(void)                                                                                                                                                       
802452     : +0.32%   : 0.17%  : +0.0007% : public: unsigned int __cdecl ValueNumStore::VNForMapSelectWork(enum ValueNumKind, enum var_types, unsigned int, unsigned int, int *, bool *, class SmallValueNumSet &)                                                                                                                                                  
721893     : +0.50%   : 0.15%  : +0.0006% : public: void __cdecl LinearScan::resolveEdge(struct BasicBlock *, struct BasicBlock *, enum LinearScan::ResolveType, unsigned __int64 *const &, unsigned __int64)                                                                                                                                                       
706519     : +0.17%   : 0.15%  : +0.0006% : private: struct ValueNumStore::Chunk * __cdecl ValueNumStore::GetAllocChunk(enum var_types, enum ValueNumStore::ChunkExtraAttribs)                                                                                                                                                                                      
545372     : +1.87%   : 0.11%  : +0.0005% : public: __cdecl ValueNumStore::Chunk::Chunk(class CompAllocator, unsigned int *, enum var_types, enum ValueNumStore::ChunkExtraAttribs)                                                                                                                                                                                 
-485709    : -0.11%   : 0.10%  : -0.0004% : public: void __cdecl Compiler::fgComputeLifeLIR(unsigned __int64 *&, struct BasicBlock *, unsigned __int64 *const &)                                                                                                                                                                                                    
-490672    : -0.12%   : 0.10%  : -0.0004% : GenTreeVisitor<`Compiler::lvaMarkLocalVars'::`2'::MarkLocalVarsVisitor>::WalkTree                                                                                                                                                                                                                                       
-508593    : -0.10%   : 0.11%  : -0.0004% : public: struct GenTree * __cdecl Compiler::gtFoldExpr(struct GenTree *)                                                                                                                                                                                                                                                 
-528047    : -0.15%   : 0.11%  : -0.0005% : public: void __cdecl Compiler::fgComputeLife(unsigned __int64 *&, struct GenTree *, struct GenTree *, unsigned __int64 *const &, bool *)                                                                                                                                                                                
-542747    : -0.10%   : 0.11%  : -0.0005% : public: void __cdecl Compiler::fgPerNodeLocalVarLiveness(struct GenTree *)                                                                                                                                                                                                                                              
-563375    : -0.08%   : 0.12%  : -0.0005% : private: enum Compiler::fgWalkResult __cdecl Rationalizer::RewriteNode(struct GenTree **, class ArrayStack<struct GenTree *> &)                                                                                                                                                                                         
-568110    : -0.08%   : 0.12%  : -0.0005% : public: struct GenTree * __cdecl Compiler::optAssertionProp(unsigned __int64 *const &, struct GenTree *, struct Statement *, struct BasicBlock *)                                                                                                                                                                       
-593651    : -0.16%   : 0.12%  : -0.0005% : protected: void __cdecl Compiler::lvaMarkLclRefs(struct GenTree *, struct BasicBlock *, struct Statement *, bool)                                                                                                                                                                                                       
-606352    : -0.44%   : 0.13%  : -0.0005% : public: void __cdecl Compiler::fgValueNumberStore(struct GenTree *)                                                                                                                                                                                                                                                     
-651530    : -0.03%   : 0.14%  : -0.0006% : public: void __cdecl LinearScan::allocateRegisters(void)                                                                                                                                                                                                                                                                
-654030    : -0.08%   : 0.14%  : -0.0006% : GenTreeVisitor<`Compiler::fgSetTreeSeq'::`2'::SetTreeSeqVisitor>::WalkTree                                                                                                                                                                                                                                              
-663374    : -0.95%   : 0.14%  : -0.0006% : private: unsigned int __cdecl SsaBuilder::RenamePushDef(struct GenTree *, struct BasicBlock *, unsigned int, bool)                                                                                                                                                                                                      
-663801    : -0.19%   : 0.14%  : -0.0006% : public: struct GenTree * __cdecl Compiler::optVNBasedFoldConstExpr(struct BasicBlock *, struct GenTree *, struct GenTree *)                                                                                                                                                                                             
-678870    : -0.12%   : 0.14%  : -0.0006% : protected: static enum Compiler::fgWalkResult __cdecl Compiler::optVNAssertionPropCurStmtVisitor(struct GenTree **, struct Compiler::fgWalkData *)                                                                                                                                                                      
-701651    : -0.11%   : 0.15%  : -0.0006% : private: void __cdecl Compiler::fgMarkUseDef(struct GenTreeLclVarCommon *)                                                                                                                                                                                                                                              
-726730    : -0.23%   : 0.15%  : -0.0006% : public: void __cdecl LclVarDsc::incRefCnts(double, class Compiler *, enum RefCountState, bool)                                                                                                                                                                                                                          
-729229    : -0.10%   : 0.15%  : -0.0006% : public: void __cdecl Compiler::optAssertionGen(struct GenTree *)                                                                                                                                                                                                                                                        
-837951    : -0.11%   : 0.17%  : -0.0007% : public: void __cdecl Compiler::lvaComputeRefCounts(bool, bool)                                                                                                                                                                                                                                                          
-838520    : -0.11%   : 0.17%  : -0.0007% : public: struct GenTree * __cdecl Compiler::optAssertionProp_LclVar(unsigned __int64 *const &, struct GenTreeLclVarCommon *, struct Statement *)                                                                                                                                                                         
-861274    : -0.12%   : 0.18%  : -0.0008% : public: void __cdecl Compiler::fgValueNumberTree(struct GenTree *)                                                                                                                                                                                                                                                      
-953749    : -0.35%   : 0.20%  : -0.0008% : public: virtual enum PhaseStatus __cdecl LinearScan::doLinearScan(void)                                                                                                                                                                                                                                                 
-1062947   : -0.47%   : 0.22%  : -0.0009% : jitstd::`anonymous namespace'::insertion_sort<unsigned int *,LclVarDsc_BlendedCode_Less>                                                                                                                                                                                                                                
-1112593   : -0.11%   : 0.23%  : -0.0010% : public: struct GenTree * __cdecl Compiler::fgMorphTree(struct GenTree *, struct Compiler::MorphAddrContext *)                                                                                                                                                                                                           
-1665862   : -0.10%   : 0.35%  : -0.0015% : public: unsigned int __cdecl Compiler::gtSetEvalOrder(struct GenTree *)                                                                                                                                                                                                                                                 
-1818435   : -0.12%   : 0.38%  : -0.0016% : private: struct GenTree * __cdecl Compiler::fgMorphSmpOp(struct GenTree *, struct Compiler::MorphAddrContext *, bool *)                                                                                                                                                                                                 
-2073391   : -0.53%   : 0.43%  : -0.0018% : jitstd::`anonymous namespace'::quick_sort<unsigned int *,LclVarDsc_BlendedCode_Less>                                                                                                                                                                                                                                    
-6607294   : -0.67%   : 1.37%  : -0.0058% : public: bool __cdecl Compiler::optCopyProp(struct BasicBlock *, struct Statement *, struct GenTreeLclVarCommon *, unsigned int, class JitHashTable<unsigned int, struct JitSmallPrimitiveKeyFuncs<unsigned int>, class ArrayStack<class Compiler::CopyPropSsaDef> *, class CompAllocator, class JitHashTableBehavior> *)
-102865416 : -85.74%  : 21.36% : -0.0906% : private: enum PhaseStatus __cdecl Compiler::fgMarkAddressExposedLocals(void)                                                                                                                                                                                                                                            

Nothing obvious to optimize pops out at me here, so this is probably as good as we're going to get.

@jakobbotsch
Copy link
Member Author

@EgorBot

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics;

public class MyBench
{
    [Benchmark]
    public Matrix4x4 AddBenchmark() => Matrix4x4.Add(Matrix4x4.Identity, Matrix4x4.Identity);
}

@EgorBot
Copy link

EgorBot commented May 30, 2024

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
  Job-BPATGK : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-VTVYIS : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio
AddBenchmark Main 2.2304 ns 0.0006 ns 1.00
AddBenchmark PR 0.4717 ns 0.0002 ns 0.21

BDN_Artifacts.zip

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Comment on lines 1200 to 1202
if (varDsc->lvIsParam || m_pCompiler->info.compInitMem || varDsc->lvMustInit ||
(varTypeIsGC(varDsc) && !varDsc->lvHasExplicitInit) ||
VarSetOps::IsMember(m_pCompiler, m_pCompiler->fgFirstBB->bbLiveIn, varDsc->lvVarIndex))
Copy link
Member Author

@jakobbotsch jakobbotsch May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hitting an assertion in SSA where optRemoveRedundantZeroInits was removing a zero init based on the knowledge that it will be inited in the prolog, while the logic in SSA and VN disagreed about this fact (and thus didn't add any initial SSA number/VN). The problem seems to be that this condition of fgVarNeedsExplicitZeroInit is not reflected by SSA or VN in any way:

if ((varDsc->lvType == TYP_STRUCT) && varDsc->HasGCPtr())
{
ClassLayout* layout = varDsc->GetLayout();
if (layout->GetSlotCount() == layout->GetGCPtrCount())
{
return false;
}
// Below conditions guarantee block initialization, which will initialize
// all struct fields. If the logic for block initialization in CodeGen::genCheckUseBlockInit()
// changes, these conditions need to be updated.
#ifdef TARGET_64BIT
#if defined(TARGET_AMD64)
// We can clear using aligned SIMD so the threshold is lower,
// and clears in order which is better for auto-prefetching
if (roundUp(varDsc->lvSize(), TARGET_POINTER_SIZE) / sizeof(int) > 4)
#else // !defined(TARGET_AMD64)
if (roundUp(varDsc->lvSize(), TARGET_POINTER_SIZE) / sizeof(int) > 8)
#endif
#else
if (roundUp(varDsc->lvSize(), TARGET_POINTER_SIZE) / sizeof(int) > 4)
#endif
{
return false;
}
}

The zero init that optRemoveRedundantZeroInits removes looks like:

*************** In optRemoveRedundantZeroInits()

removing useless STMT00018 ( INL04 @ 0x000[E-] ... ??? ) <- INL03 @ ??? <- INLRT @ 0x013[E-]
N002 (  5,  6) [000067] DA---------                           STORE_LCL_FLD ref   (P) V03 tmp3         [+0]
                                                                ref    field V03._source (fldOffset=0x0) -> V12 tmp12        
N001 (  1,  1) [000057] -----------                         └──▌  CNS_INT   ref    null
 from BB01

However, SSA/VN do not add an initial value for V12 and we hit an SSA assert on a subsequent use of V12.

The fix here uses the knowledge that all TYP_REF locals without explicit inits are guaranteed to be initialized in the prolog, so they have an initial value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, there is also a work item here to ensure that local morph folds struct fields in post order as well to avoid DNER in a case like the above. The store ought to be folded to be to V12 directly. I will look into that as a follow-up.

@jakobbotsch jakobbotsch marked this pull request as ready for review May 30, 2024 19:17
@jakobbotsch
Copy link
Member Author

The libraries-jitstress failures look to be known ones, or failures in MinOpts that this PR doesn't affect. I'll double check the Fuzzlyn one tomorrow, but it does not look related either (seems to be some intermittent crash).

cc @dotnet/jit-contrib PTAL @AndyAyersMS

Diffs. Between 0.15% to 0.3% TP cost, mostly tending towards the lower side. Some slight MinOpts cost from the checks around whether we should be maintaining assertions -- seems to be more impactful on Linux for some reason. On Windows the MinOpts cost looks small. I could get rid of this with some templates if we think that's beneficial. Given the small Windows cost I guess it's not really intrinsic but rather a choice in codegen strategy from Clang.

Code size regressions generally seem to come from cases where we end up with some more initialization in the prolog, or where we miss containment opportunities due to locals no longer being always on stack. Perf score seems to heavily favor the improvements (the regressions are rare and when they happen are smaller than the corresponding improvements).

There's follow-up work here to avoid DNER in a bunch of cases; currently if we propagate the address of a struct field we won't always fold it directly to access the struct field, resulting in DNER. Not sure how prevalent the problem is, but I noticed it in some cases when looking into some of the improvements/regressions.

@jakobbotsch jakobbotsch requested a review from AndyAyersMS May 30, 2024 19:25
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

{
if (m_assertions.Height() >= 64)
{
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any data on how often we drop assertions here? Might be worth noting this in the dump or something similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these contexts:

[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\coreclr_tests.run.windows.x64.checked.mch # 66062 : IL size 617
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\coreclr_tests.run.windows.x64.checked.mch # 66094 : IL size 617
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\coreclr_tests.run.windows.x64.checked.mch # 30938 : IL size 1066
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\benchmarks.run.windows.x64.checked.mch # 9469 : IL size 2145
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests.run.windows.x64.Release.mch # 396680 : IL size 2145
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests.run.windows.x64.Release.mch # 399807 : IL size 2145
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 168705 : IL size 2145
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 168329 : IL size 2145
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 168914 : IL size 2145
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 167285 : IL size 2145
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 168991 : IL size 2145
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\benchmarks.run.windows.x64.checked.mch # 7002 : IL size 2445
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 167037 : IL size 2445
[10:52:21]   C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 169019 : IL size 2853

I'll add some JITDUMP. Also, I realized this code ends up adding to the map even if we ran out of assertion space. I'll fix that as well.

@hez2010
Copy link
Contributor

hez2010 commented May 31, 2024

Can you check if this enables promotion of fields of stack-allocated objects in #11192?

@jakobbotsch
Copy link
Member Author

Can you check if this enables promotion of fields of stack-allocated objects in #11192?

Looks like it does, at least with the example there.

    public int PointClassTest()
    {
        var p1 = new PointClass(4, 5);
        var p2 = new PointClass(3, 7);
        var result = AddClass(p1, p2);
        return result.X + result.Y;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    PointClass AddClass(PointClass x, PointClass y)
    {
        return new PointClass(x.X + y.X, x.Y + y.Y);
    }

    record class PointClass(int X, int Y);

Before:

; Assembly listing for method C:PointClassTest():int:this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 13 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def <C>
;* V01 loc0         [V01,T04] (  0,  0   )    long  ->  zero-ref    class-hnd exact <C+PointClass>
;* V02 loc1         [V02,T05] (  0,  0   )    long  ->  zero-ref    class-hnd exact <C+PointClass>
;* V03 loc2         [V03,T06] (  0,  0   )    long  ->  zero-ref    class-hnd exact <C+PointClass>
;# V04 OutArgs      [V04    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1         [V05    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp" <C+PointClass>
;* V06 tmp2         [V06    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp" <C+PointClass>
;* V07 tmp3         [V07    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;* V08 tmp4         [V08    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;  V09 tmp5         [V09,T01] (  2,  4   )     int  ->  rcx         "impAppendStmt"
;  V10 tmp6         [V10,T02] (  2,  4   )     int  ->  rdx         "impAppendStmt"
;* V11 tmp7         [V11    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp" <C+PointClass>
;  V12 tmp8         [V12,T03] (  2,  4   )     int  ->  rdx         "Inlining Arg"
;  V13 tmp9         [V13    ] (  6,  6   )  struct (16) [rsp+0x28]  do-not-enreg[XSF] addr-exposed "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
;  V14 tmp10        [V14    ] (  6,  6   )  struct (16) [rsp+0x18]  do-not-enreg[XSF] addr-exposed "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
;  V15 tmp11        [V15    ] (  6,  6   )  struct (16) [rsp+0x08]  do-not-enreg[XSF] addr-exposed "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
;  V16 cse0         [V16,T00] (  4,  4   )    long  ->  rax         "CSE #01: aggressive"
;
; Lcl frame size = 56

G_M39843_IG01:  ;; offset=0x0000
       sub      rsp, 56
						;; size=4 bbWeight=1 PerfScore 0.25
G_M39843_IG02:  ;; offset=0x0004
       vxorps   xmm0, xmm0, xmm0
       vmovups  xmmword ptr [rsp+0x28], xmm0
       mov      rax, 0x7FFB59900F30      ; C+PointClass
       mov      qword ptr [rsp+0x28], rax
       mov      dword ptr [rsp+0x30], 4
       mov      dword ptr [rsp+0x34], 5
       vmovups  xmmword ptr [rsp+0x18], xmm0
       mov      qword ptr [rsp+0x18], rax
       mov      dword ptr [rsp+0x20], 3
       mov      dword ptr [rsp+0x24], 7
       mov      ecx, dword ptr [rsp+0x30]
       add      ecx, dword ptr [rsp+0x20]
       mov      edx, dword ptr [rsp+0x34]
       vmovups  xmmword ptr [rsp+0x08], xmm0
       mov      qword ptr [rsp+0x08], rax
       add      edx, dword ptr [rsp+0x24]
       mov      dword ptr [rsp+0x10], ecx
       mov      dword ptr [rsp+0x14], edx
       mov      eax, dword ptr [rsp+0x10]
       add      eax, dword ptr [rsp+0x14]
						;; size=111 bbWeight=1 PerfScore 21.58
G_M39843_IG03:  ;; offset=0x0073
       add      rsp, 56
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 120, prolog size 4, PerfScore 23.08, instruction count 23, allocated bytes for code 120 (MethodHash=bb85645c) for method C:PointClassTest():int:this (FullOpts)
; ============================================================

This PR with DOTNET_JitObjectStackAllocation=1:

; Assembly listing for method C:PointClassTest():int:this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 13 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def <C>
;* V01 loc0         [V01    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <C+PointClass>
;* V02 loc1         [V02    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <C+PointClass>
;* V03 loc2         [V03    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <C+PointClass>
;# V04 OutArgs      [V04    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1         [V05    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp" <C+PointClass>
;* V06 tmp2         [V06    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp" <C+PointClass>
;* V07 tmp3         [V07    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;* V08 tmp4         [V08    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;* V09 tmp5         [V09    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;* V10 tmp6         [V10    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;* V11 tmp7         [V11    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp" <C+PointClass>
;* V12 tmp8         [V12    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
;* V13 tmp9         [V13    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
;* V14 tmp10        [V14    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
;* V15 tmp11        [V15    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
;* V16 tmp12        [V16    ] (  0,  0   )    long  ->  zero-ref    "V13.[000..008)"
;* V17 tmp13        [V17    ] (  0,  0   )     int  ->  zero-ref    "V13.[008..012)"
;* V18 tmp14        [V18    ] (  0,  0   )     int  ->  zero-ref    "V13.[012..016)"
;* V19 tmp15        [V19    ] (  0,  0   )    long  ->  zero-ref    "V14.[000..008)"
;* V20 tmp16        [V20    ] (  0,  0   )     int  ->  zero-ref    "V14.[008..012)"
;* V21 tmp17        [V21    ] (  0,  0   )     int  ->  zero-ref    "V14.[012..016)"
;* V22 tmp18        [V22    ] (  0,  0   )    long  ->  zero-ref    "V15.[000..008)"
;* V23 tmp19        [V23    ] (  0,  0   )     int  ->  zero-ref    "V15.[008..012)"
;* V24 tmp20        [V24    ] (  0,  0   )     int  ->  zero-ref    "V15.[012..016)"
;
; Lcl frame size = 0

G_M39843_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00
G_M39843_IG02:  ;; offset=0x0000
       mov      eax, 19
						;; size=5 bbWeight=1 PerfScore 0.25
G_M39843_IG03:  ;; offset=0x0005
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 6, prolog size 0, PerfScore 1.25, instruction count 2, allocated bytes for code 6 (MethodHash=bb85645c) for method C:PointClassTest():int:this (FullOpts)
; ============================================================

@hez2010
Copy link
Contributor

hez2010 commented May 31, 2024

Looks like it does, at least with the example there.

Great to hear that!
Then I think we can reevaluate #11192 after this PR, now that stack-allocated objects can be promoted.

@jakobbotsch
Copy link
Member Author

Then I think we can reevaluate #11192 after this PR, now that stack-allocated objects can be promoted.

Part of the problem with stack allocation of objects is figuring out how to avoid making all write barriers more costly. If we were to enable object stack allocation today it would turn write barriers that are unchecked into checked ones. More work to avoid this is needed.

@jakobbotsch jakobbotsch merged commit e867965 into dotnet:main May 31, 2024
111 of 113 checks passed
@jakobbotsch jakobbotsch deleted the propagate-local-addrs branch May 31, 2024 19:06
@LoopedBard3
Copy link
Member

Potential regressions:
Windows x64: dotnet/perf-autofiling-issues#35582

@github-actions github-actions bot locked and limited conversation to collaborators Jul 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
5 participants