-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Propagate LCL_ADDR
nodes during local morph
#102808
Conversation
This changes local morph to run in RPO when optimizations are enabled. It adds infrastructure to track and propagate LCL_ADDR values assigned to locals (or struct fields) during local morph. This allows us to avoid address exposure in cases where the destination local does not actually end up escaping in any way. Example: ```csharp public struct Awaitable { public int Opts; public Awaitable(bool value) { Opts = value ? 1 : 2; } } [MethodImpl(MethodImplOptions.NoInlining)] public static int Test() => new Awaitable(false).Opts; ``` Before: ```asm G_M59043_IG01: ;; offset=0x0000 push rax ;; size=1 bbWeight=1 PerfScore 1.00 G_M59043_IG02: ;; offset=0x0001 xor eax, eax mov dword ptr [rsp], eax mov dword ptr [rsp], 2 mov eax, dword ptr [rsp] ;; size=15 bbWeight=1 PerfScore 3.25 G_M59043_IG03: ;; offset=0x0010 add rsp, 8 ret ;; size=5 bbWeight=1 PerfScore 1.25 ; Total bytes of code: 21 ``` After: ```asm G_M59043_IG02: ;; offset=0x0000 mov eax, 2 ;; size=5 bbWeight=1 PerfScore 0.25 G_M59043_IG03: ;; offset=0x0005 ret ``` Propagating the addresses works much like local assertion prop in morph does. Proving that the locations that were stored to do not escape afterwards is done with a simplistic approach: we check globally that no reads of the location exists, and if so, we replace the `LCL_ADDR` stored to them by a constant 0. We leave it up to liveness to clean up the stores themselves. This could be more sophisticated, but in practice this handles the reported cases just fine. If we were able to remove any `LCL_ADDR` in this way then we run an additional pass over the locals of the IR to compute the final set of exposed locals. Fix dotnet#87072 Fix dotnet#102273 Fix dotnet#102518 This is still not sufficient to handle dotnet#69254. To handle that case we need to handle transferring assertions for struct copies, and also handle proving that specific struct fields containing local addresses do not escape. It is probably doable, but for now I will leave it as future work.
For obvious reasons we cannot use assertions about locals that themselves are exposed. For example: int x = 1234;
int y = 5678;
int* px = &x;
int** ppx = &px;
*ppx = &y;
... *px ... // cannot be replaced with x However, it's hard to detect this precisely, since the exposure could be happening in a block coming in from a backedge. It's not really sufficient to rely on the RPO to mark the LHS as "could have been exposed by now". I'll have to switch to |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 3 pipeline(s). |
Given the switch to Overall the simplification had no diffs, but it does mean that most of the logic that would allow us to resolve #69254 is now gone. I think to work towards something like that we would want to move old promotion to happen after local morph, or remove old promotion entirely. |
Current detailed throughput of Base: 113516714380, Diff: 113681788609, +0.1454%
159435077 : NA : 33.11% : +0.1405% : public: void __cdecl LocalAddressVisitor::VisitBlock(struct BasicBlock *)
22702317 : NA : 4.71% : +0.0200% : private: bool __cdecl Compiler::fgExposeUnpropagatedLocals(bool, class LocalEqualsLocalAddrAssertions *)
19686816 : NA : 4.09% : +0.0173% : private: void __cdecl LocalAddressVisitor::HandleLocalAssertions(struct GenTreeLclVarCommon *, class LocalAddressVisitor::Value &)
15915961 : +2.80% : 3.30% : +0.0140% : public: enum Compiler::fgWalkResult __cdecl LocalAddressVisitor::PostOrderVisit(struct GenTree **, struct GenTree *)
15758409 : NA : 3.27% : +0.0139% : private: void __cdecl LocalAddressVisitor::HandleLocalStoreAssertions(struct GenTreeLclVarCommon *, class LocalAddressVisitor::Value &)
10086932 : +12.12% : 2.09% : +0.0089% : private: void __cdecl SsaRenameState::Push(class SsaRenameState::Stack *, struct BasicBlock *, unsigned int)
9465645 : +7.97% : 1.97% : +0.0083% : private: void __cdecl SsaBuilder::RenameVariables(void)
7598598 : +0.31% : 1.58% : +0.0067% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)
6707150 : +6.36% : 1.39% : +0.0059% : public: enum PhaseStatus __cdecl Compiler::fgValueNumber(void)
6453700 : +19.39% : 1.34% : +0.0057% : public: void __cdecl SsaRenameState::Push(struct BasicBlock *, unsigned int, unsigned int)
5635614 : +40.18% : 1.17% : +0.0050% : public: bool __cdecl Compiler::fgVarIsNeverZeroInitializedInProlog(unsigned int)
4403012 : +0.45% : 0.91% : +0.0039% : memset
4387505 : +4.50% : 0.91% : +0.0039% : public: unsigned int __cdecl ValueNumStore::VNZeroForType(enum var_types)
4356861 : +11.73% : 0.90% : +0.0038% : public: enum PhaseStatus __cdecl Compiler::optVNBasedDeadStoreRemoval(void)
4339763 : NA : 0.90% : +0.0038% : private: void __cdecl LocalAddressVisitor::ProcessIndirection(struct GenTree **, class LocalAddressVisitor::Value &, struct GenTree *)
3719938 : NA : 0.77% : +0.0033% : public: __cdecl LocalEqualsLocalAddrAssertions::LocalEqualsLocalAddrAssertions(class Compiler *)
3087451 : +252.75% : 0.64% : +0.0027% : private: unsigned int __cdecl ValueNumStore::VnForConst<unsigned __int64, class ValueNumStore::VNMap<unsigned __int64, struct JitLargePrimitiveKeyFuncs<unsigned __int64>>>(unsigned __int64, class ValueNumStore::VNMap<unsigned __int64, struct JitLargePrimitiveKeyFuncs<unsigned __int64>> *, enum var_types)
2644605 : +7.16% : 0.55% : +0.0023% : private: void __cdecl SsaDefArray<class LclSsaVarDsc>::GrowArray(class CompAllocator)
1091175 : +0.10% : 0.23% : +0.0010% : public: unsigned int __cdecl ValueNumStore::VNForFunc(enum var_types, enum VNFunc, unsigned int, unsigned int)
1051327 : +259.48% : 0.22% : +0.0009% : public: unsigned int __cdecl ValueNumStore::VNForByrefCon(unsigned __int64)
1008619 : +14.73% : 0.21% : +0.0009% : private: void __cdecl LocalAddressVisitor::EscapeAddress(class LocalAddressVisitor::Value &, struct GenTree *)
974021 : +1.68% : 0.20% : +0.0009% : public: enum Compiler::fgWalkResult __cdecl GenTreeVisitor<class LocalSequencer>::WalkTree(struct GenTree **, struct GenTree *)
973383 : +0.21% : 0.20% : +0.0009% : public: bool __cdecl ValueNumStore::VNMap<struct ValueNumStore::VNDefFuncApp<2>, struct ValueNumStore::VNDefFuncAppKeyFuncs<2>>::Set(struct ValueNumStore::VNDefFuncApp<2>, unsigned int)
875906 : +9.62% : 0.18% : +0.0008% : private: void __cdecl JitHashTable<__int64, struct JitLargePrimitiveKeyFuncs<__int64>, unsigned int, class CompAllocator, class JitHashTableBehavior>::Grow(void)
802452 : +0.32% : 0.17% : +0.0007% : public: unsigned int __cdecl ValueNumStore::VNForMapSelectWork(enum ValueNumKind, enum var_types, unsigned int, unsigned int, int *, bool *, class SmallValueNumSet &)
721893 : +0.50% : 0.15% : +0.0006% : public: void __cdecl LinearScan::resolveEdge(struct BasicBlock *, struct BasicBlock *, enum LinearScan::ResolveType, unsigned __int64 *const &, unsigned __int64)
706519 : +0.17% : 0.15% : +0.0006% : private: struct ValueNumStore::Chunk * __cdecl ValueNumStore::GetAllocChunk(enum var_types, enum ValueNumStore::ChunkExtraAttribs)
545372 : +1.87% : 0.11% : +0.0005% : public: __cdecl ValueNumStore::Chunk::Chunk(class CompAllocator, unsigned int *, enum var_types, enum ValueNumStore::ChunkExtraAttribs)
-485709 : -0.11% : 0.10% : -0.0004% : public: void __cdecl Compiler::fgComputeLifeLIR(unsigned __int64 *&, struct BasicBlock *, unsigned __int64 *const &)
-490672 : -0.12% : 0.10% : -0.0004% : GenTreeVisitor<`Compiler::lvaMarkLocalVars'::`2'::MarkLocalVarsVisitor>::WalkTree
-508593 : -0.10% : 0.11% : -0.0004% : public: struct GenTree * __cdecl Compiler::gtFoldExpr(struct GenTree *)
-528047 : -0.15% : 0.11% : -0.0005% : public: void __cdecl Compiler::fgComputeLife(unsigned __int64 *&, struct GenTree *, struct GenTree *, unsigned __int64 *const &, bool *)
-542747 : -0.10% : 0.11% : -0.0005% : public: void __cdecl Compiler::fgPerNodeLocalVarLiveness(struct GenTree *)
-563375 : -0.08% : 0.12% : -0.0005% : private: enum Compiler::fgWalkResult __cdecl Rationalizer::RewriteNode(struct GenTree **, class ArrayStack<struct GenTree *> &)
-568110 : -0.08% : 0.12% : -0.0005% : public: struct GenTree * __cdecl Compiler::optAssertionProp(unsigned __int64 *const &, struct GenTree *, struct Statement *, struct BasicBlock *)
-593651 : -0.16% : 0.12% : -0.0005% : protected: void __cdecl Compiler::lvaMarkLclRefs(struct GenTree *, struct BasicBlock *, struct Statement *, bool)
-606352 : -0.44% : 0.13% : -0.0005% : public: void __cdecl Compiler::fgValueNumberStore(struct GenTree *)
-651530 : -0.03% : 0.14% : -0.0006% : public: void __cdecl LinearScan::allocateRegisters(void)
-654030 : -0.08% : 0.14% : -0.0006% : GenTreeVisitor<`Compiler::fgSetTreeSeq'::`2'::SetTreeSeqVisitor>::WalkTree
-663374 : -0.95% : 0.14% : -0.0006% : private: unsigned int __cdecl SsaBuilder::RenamePushDef(struct GenTree *, struct BasicBlock *, unsigned int, bool)
-663801 : -0.19% : 0.14% : -0.0006% : public: struct GenTree * __cdecl Compiler::optVNBasedFoldConstExpr(struct BasicBlock *, struct GenTree *, struct GenTree *)
-678870 : -0.12% : 0.14% : -0.0006% : protected: static enum Compiler::fgWalkResult __cdecl Compiler::optVNAssertionPropCurStmtVisitor(struct GenTree **, struct Compiler::fgWalkData *)
-701651 : -0.11% : 0.15% : -0.0006% : private: void __cdecl Compiler::fgMarkUseDef(struct GenTreeLclVarCommon *)
-726730 : -0.23% : 0.15% : -0.0006% : public: void __cdecl LclVarDsc::incRefCnts(double, class Compiler *, enum RefCountState, bool)
-729229 : -0.10% : 0.15% : -0.0006% : public: void __cdecl Compiler::optAssertionGen(struct GenTree *)
-837951 : -0.11% : 0.17% : -0.0007% : public: void __cdecl Compiler::lvaComputeRefCounts(bool, bool)
-838520 : -0.11% : 0.17% : -0.0007% : public: struct GenTree * __cdecl Compiler::optAssertionProp_LclVar(unsigned __int64 *const &, struct GenTreeLclVarCommon *, struct Statement *)
-861274 : -0.12% : 0.18% : -0.0008% : public: void __cdecl Compiler::fgValueNumberTree(struct GenTree *)
-953749 : -0.35% : 0.20% : -0.0008% : public: virtual enum PhaseStatus __cdecl LinearScan::doLinearScan(void)
-1062947 : -0.47% : 0.22% : -0.0009% : jitstd::`anonymous namespace'::insertion_sort<unsigned int *,LclVarDsc_BlendedCode_Less>
-1112593 : -0.11% : 0.23% : -0.0010% : public: struct GenTree * __cdecl Compiler::fgMorphTree(struct GenTree *, struct Compiler::MorphAddrContext *)
-1665862 : -0.10% : 0.35% : -0.0015% : public: unsigned int __cdecl Compiler::gtSetEvalOrder(struct GenTree *)
-1818435 : -0.12% : 0.38% : -0.0016% : private: struct GenTree * __cdecl Compiler::fgMorphSmpOp(struct GenTree *, struct Compiler::MorphAddrContext *, bool *)
-2073391 : -0.53% : 0.43% : -0.0018% : jitstd::`anonymous namespace'::quick_sort<unsigned int *,LclVarDsc_BlendedCode_Less>
-6607294 : -0.67% : 1.37% : -0.0058% : public: bool __cdecl Compiler::optCopyProp(struct BasicBlock *, struct Statement *, struct GenTreeLclVarCommon *, unsigned int, class JitHashTable<unsigned int, struct JitSmallPrimitiveKeyFuncs<unsigned int>, class ArrayStack<class Compiler::CopyPropSsaDef> *, class CompAllocator, class JitHashTableBehavior> *)
-102865416 : -85.74% : 21.36% : -0.0906% : private: enum PhaseStatus __cdecl Compiler::fgMarkAddressExposedLocals(void) Nothing obvious to optimize pops out at me here, so this is probably as good as we're going to get. |
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics;
public class MyBench
{
[Benchmark]
public Matrix4x4 AddBenchmark() => Matrix4x4.Add(Matrix4x4.Identity, Matrix4x4.Identity);
} |
Benchmark results on Intel
|
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 3 pipeline(s). |
if (varDsc->lvIsParam || m_pCompiler->info.compInitMem || varDsc->lvMustInit || | ||
(varTypeIsGC(varDsc) && !varDsc->lvHasExplicitInit) || | ||
VarSetOps::IsMember(m_pCompiler, m_pCompiler->fgFirstBB->bbLiveIn, varDsc->lvVarIndex)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hitting an assertion in SSA where optRemoveRedundantZeroInits
was removing a zero init based on the knowledge that it will be inited in the prolog, while the logic in SSA and VN disagreed about this fact (and thus didn't add any initial SSA number/VN). The problem seems to be that this condition of fgVarNeedsExplicitZeroInit
is not reflected by SSA or VN in any way:
runtime/src/coreclr/jit/compiler.hpp
Lines 4201 to 4226 in 21f356a
if ((varDsc->lvType == TYP_STRUCT) && varDsc->HasGCPtr()) | |
{ | |
ClassLayout* layout = varDsc->GetLayout(); | |
if (layout->GetSlotCount() == layout->GetGCPtrCount()) | |
{ | |
return false; | |
} | |
// Below conditions guarantee block initialization, which will initialize | |
// all struct fields. If the logic for block initialization in CodeGen::genCheckUseBlockInit() | |
// changes, these conditions need to be updated. | |
#ifdef TARGET_64BIT | |
#if defined(TARGET_AMD64) | |
// We can clear using aligned SIMD so the threshold is lower, | |
// and clears in order which is better for auto-prefetching | |
if (roundUp(varDsc->lvSize(), TARGET_POINTER_SIZE) / sizeof(int) > 4) | |
#else // !defined(TARGET_AMD64) | |
if (roundUp(varDsc->lvSize(), TARGET_POINTER_SIZE) / sizeof(int) > 8) | |
#endif | |
#else | |
if (roundUp(varDsc->lvSize(), TARGET_POINTER_SIZE) / sizeof(int) > 4) | |
#endif | |
{ | |
return false; | |
} | |
} |
The zero init that optRemoveRedundantZeroInits
removes looks like:
*************** In optRemoveRedundantZeroInits()
removing useless STMT00018 ( INL04 @ 0x000[E-] ... ??? ) <- INL03 @ ??? <- INLRT @ 0x013[E-]
N002 ( 5, 6) [000067] DA--------- ▌ STORE_LCL_FLD ref (P) V03 tmp3 [+0]
▌ ref field V03._source (fldOffset=0x0) -> V12 tmp12
N001 ( 1, 1) [000057] ----------- └──▌ CNS_INT ref null
from BB01
However, SSA/VN do not add an initial value for V12
and we hit an SSA assert on a subsequent use of V12
.
The fix here uses the knowledge that all TYP_REF
locals without explicit inits are guaranteed to be initialized in the prolog, so they have an initial value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, there is also a work item here to ensure that local morph folds struct fields in post order as well to avoid DNER in a case like the above. The store ought to be folded to be to V12
directly. I will look into that as a follow-up.
The libraries-jitstress failures look to be known ones, or failures in MinOpts that this PR doesn't affect. I'll double check the Fuzzlyn one tomorrow, but it does not look related either (seems to be some intermittent crash). cc @dotnet/jit-contrib PTAL @AndyAyersMS Diffs. Between 0.15% to 0.3% TP cost, mostly tending towards the lower side. Some slight MinOpts cost from the checks around whether we should be maintaining assertions -- seems to be more impactful on Linux for some reason. On Windows the MinOpts cost looks small. I could get rid of this with some templates if we think that's beneficial. Given the small Windows cost I guess it's not really intrinsic but rather a choice in codegen strategy from Clang. Code size regressions generally seem to come from cases where we end up with some more initialization in the prolog, or where we miss containment opportunities due to locals no longer being always on stack. Perf score seems to heavily favor the improvements (the regressions are rare and when they happen are smaller than the corresponding improvements). There's follow-up work here to avoid DNER in a bunch of cases; currently if we propagate the address of a struct field we won't always fold it directly to access the struct field, resulting in DNER. Not sure how prevalent the problem is, but I noticed it in some cases when looking into some of the improvements/regressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
{ | ||
if (m_assertions.Height() >= 64) | ||
{ | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any data on how often we drop assertions here? Might be worth noting this in the dump or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these contexts:
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\coreclr_tests.run.windows.x64.checked.mch # 66062 : IL size 617
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\coreclr_tests.run.windows.x64.checked.mch # 66094 : IL size 617
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\coreclr_tests.run.windows.x64.checked.mch # 30938 : IL size 1066
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\benchmarks.run.windows.x64.checked.mch # 9469 : IL size 2145
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests.run.windows.x64.Release.mch # 396680 : IL size 2145
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests.run.windows.x64.Release.mch # 399807 : IL size 2145
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 168705 : IL size 2145
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 168329 : IL size 2145
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 168914 : IL size 2145
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 167285 : IL size 2145
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 168991 : IL size 2145
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\benchmarks.run.windows.x64.checked.mch # 7002 : IL size 2445
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 167037 : IL size 2445
[10:52:21] C:\dev\dotnet\spmi\mch\227e46fa-1be3-4770-b613-4a239e7c28aa.windows.x64\libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch # 169019 : IL size 2853
I'll add some JITDUMP. Also, I realized this code ends up adding to the map even if we ran out of assertion space. I'll fix that as well.
Can you check if this enables promotion of fields of stack-allocated objects in #11192? |
Looks like it does, at least with the example there. public int PointClassTest()
{
var p1 = new PointClass(4, 5);
var p2 = new PointClass(3, 7);
var result = AddClass(p1, p2);
return result.X + result.Y;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
PointClass AddClass(PointClass x, PointClass y)
{
return new PointClass(x.X + y.X, x.Y + y.Y);
}
record class PointClass(int X, int Y); Before: ; Assembly listing for method C:PointClassTest():int:this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 13 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def <C>
;* V01 loc0 [V01,T04] ( 0, 0 ) long -> zero-ref class-hnd exact <C+PointClass>
;* V02 loc1 [V02,T05] ( 0, 0 ) long -> zero-ref class-hnd exact <C+PointClass>
;* V03 loc2 [V03,T06] ( 0, 0 ) long -> zero-ref class-hnd exact <C+PointClass>
;# V04 OutArgs [V04 ] ( 1, 1 ) struct ( 0) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1 [V05 ] ( 0, 0 ) long -> zero-ref class-hnd exact "NewObj constructor temp" <C+PointClass>
;* V06 tmp2 [V06 ] ( 0, 0 ) long -> zero-ref class-hnd exact "NewObj constructor temp" <C+PointClass>
;* V07 tmp3 [V07 ] ( 0, 0 ) int -> zero-ref "impAppendStmt"
;* V08 tmp4 [V08 ] ( 0, 0 ) int -> zero-ref "impAppendStmt"
; V09 tmp5 [V09,T01] ( 2, 4 ) int -> rcx "impAppendStmt"
; V10 tmp6 [V10,T02] ( 2, 4 ) int -> rdx "impAppendStmt"
;* V11 tmp7 [V11 ] ( 0, 0 ) long -> zero-ref class-hnd exact "NewObj constructor temp" <C+PointClass>
; V12 tmp8 [V12,T03] ( 2, 4 ) int -> rdx "Inlining Arg"
; V13 tmp9 [V13 ] ( 6, 6 ) struct (16) [rsp+0x28] do-not-enreg[XSF] addr-exposed "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
; V14 tmp10 [V14 ] ( 6, 6 ) struct (16) [rsp+0x18] do-not-enreg[XSF] addr-exposed "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
; V15 tmp11 [V15 ] ( 6, 6 ) struct (16) [rsp+0x08] do-not-enreg[XSF] addr-exposed "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
; V16 cse0 [V16,T00] ( 4, 4 ) long -> rax "CSE #01: aggressive"
;
; Lcl frame size = 56
G_M39843_IG01: ;; offset=0x0000
sub rsp, 56
;; size=4 bbWeight=1 PerfScore 0.25
G_M39843_IG02: ;; offset=0x0004
vxorps xmm0, xmm0, xmm0
vmovups xmmword ptr [rsp+0x28], xmm0
mov rax, 0x7FFB59900F30 ; C+PointClass
mov qword ptr [rsp+0x28], rax
mov dword ptr [rsp+0x30], 4
mov dword ptr [rsp+0x34], 5
vmovups xmmword ptr [rsp+0x18], xmm0
mov qword ptr [rsp+0x18], rax
mov dword ptr [rsp+0x20], 3
mov dword ptr [rsp+0x24], 7
mov ecx, dword ptr [rsp+0x30]
add ecx, dword ptr [rsp+0x20]
mov edx, dword ptr [rsp+0x34]
vmovups xmmword ptr [rsp+0x08], xmm0
mov qword ptr [rsp+0x08], rax
add edx, dword ptr [rsp+0x24]
mov dword ptr [rsp+0x10], ecx
mov dword ptr [rsp+0x14], edx
mov eax, dword ptr [rsp+0x10]
add eax, dword ptr [rsp+0x14]
;; size=111 bbWeight=1 PerfScore 21.58
G_M39843_IG03: ;; offset=0x0073
add rsp, 56
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code 120, prolog size 4, PerfScore 23.08, instruction count 23, allocated bytes for code 120 (MethodHash=bb85645c) for method C:PointClassTest():int:this (FullOpts)
; ============================================================ This PR with ; Assembly listing for method C:PointClassTest():int:this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 13 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def <C>
;* V01 loc0 [V01 ] ( 0, 0 ) long -> zero-ref class-hnd exact <C+PointClass>
;* V02 loc1 [V02 ] ( 0, 0 ) long -> zero-ref class-hnd exact <C+PointClass>
;* V03 loc2 [V03 ] ( 0, 0 ) long -> zero-ref class-hnd exact <C+PointClass>
;# V04 OutArgs [V04 ] ( 1, 1 ) struct ( 0) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1 [V05 ] ( 0, 0 ) long -> zero-ref class-hnd exact "NewObj constructor temp" <C+PointClass>
;* V06 tmp2 [V06 ] ( 0, 0 ) long -> zero-ref class-hnd exact "NewObj constructor temp" <C+PointClass>
;* V07 tmp3 [V07 ] ( 0, 0 ) int -> zero-ref "impAppendStmt"
;* V08 tmp4 [V08 ] ( 0, 0 ) int -> zero-ref "impAppendStmt"
;* V09 tmp5 [V09 ] ( 0, 0 ) int -> zero-ref "impAppendStmt"
;* V10 tmp6 [V10 ] ( 0, 0 ) int -> zero-ref "impAppendStmt"
;* V11 tmp7 [V11 ] ( 0, 0 ) long -> zero-ref class-hnd exact "NewObj constructor temp" <C+PointClass>
;* V12 tmp8 [V12 ] ( 0, 0 ) int -> zero-ref "Inlining Arg"
;* V13 tmp9 [V13 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
;* V14 tmp10 [V14 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
;* V15 tmp11 [V15 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] "MorphAllocObjNodeIntoStackAlloc temp" <C+PointClass>
;* V16 tmp12 [V16 ] ( 0, 0 ) long -> zero-ref "V13.[000..008)"
;* V17 tmp13 [V17 ] ( 0, 0 ) int -> zero-ref "V13.[008..012)"
;* V18 tmp14 [V18 ] ( 0, 0 ) int -> zero-ref "V13.[012..016)"
;* V19 tmp15 [V19 ] ( 0, 0 ) long -> zero-ref "V14.[000..008)"
;* V20 tmp16 [V20 ] ( 0, 0 ) int -> zero-ref "V14.[008..012)"
;* V21 tmp17 [V21 ] ( 0, 0 ) int -> zero-ref "V14.[012..016)"
;* V22 tmp18 [V22 ] ( 0, 0 ) long -> zero-ref "V15.[000..008)"
;* V23 tmp19 [V23 ] ( 0, 0 ) int -> zero-ref "V15.[008..012)"
;* V24 tmp20 [V24 ] ( 0, 0 ) int -> zero-ref "V15.[012..016)"
;
; Lcl frame size = 0
G_M39843_IG01: ;; offset=0x0000
;; size=0 bbWeight=1 PerfScore 0.00
G_M39843_IG02: ;; offset=0x0000
mov eax, 19
;; size=5 bbWeight=1 PerfScore 0.25
G_M39843_IG03: ;; offset=0x0005
ret
;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 6, prolog size 0, PerfScore 1.25, instruction count 2, allocated bytes for code 6 (MethodHash=bb85645c) for method C:PointClassTest():int:this (FullOpts)
; ============================================================ |
Great to hear that! |
Part of the problem with stack allocation of objects is figuring out how to avoid making all write barriers more costly. If we were to enable object stack allocation today it would turn write barriers that are unchecked into checked ones. More work to avoid this is needed. |
Potential regressions: |
This changes local morph to run in RPO when optimizations are enabled. It adds infrastructure to track and propagate LCL_ADDR values assigned to locals during local morph. This allows us to avoid address exposure in cases where the destination local does not actually end up escaping in any way.
Example:
Before:
After:
Propagating the addresses works much like local assertion prop in morph does. Proving that the locals that were stored to do not escape afterwards is done with a simplistic approach: we check globally that no reads of the locals exists, and if so, we replace the
LCL_ADDR
stored to them by a constant 0. We leave it up to liveness to clean up the stores themselves.If we were able to remove any
LCL_ADDR
in this way then we run an additional pass over the locals of the IR to compute the final set of exposed locals.This could be more sophisticated, but in practice this handles the reported cases just fine.
Fix #87072
Fix #102273
Fix #102518
This is still not sufficient to handle #69254. To handle that we would need more support around tracking the values of struct fields, and handling of promoted fields. This PR currently does not handle promoted fields at all; we use
lvHasLdAddrOp
as a conservative approximation of address exposure on the destination locals, and promoted structs almost always have this set. If we were to handle promoted fields we would need some other way to determine that a destination holding a local address couldn't be exposed.