-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Document describing upcoming object stack allocation work. #20251
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
# Object Stack Allocation | ||
|
||
This document describes work to enable object stack allocation in .NET Core. | ||
|
||
## Motivation | ||
|
||
In .NET instances of reference types are allocated on the garbage-collected heap. | ||
Such allocations have performance overhead at garbage collection time. The allocator also has to ensure that the memory is fully zero-initialized. | ||
If the lifetime of an object is bounded by the lifetime of the allocating method, the allocation | ||
may be moved to the stack. The benefits of this optimization: | ||
|
||
* The pressure on the garbage collector is reduced because the GC heap becomes smaller. The garbage collector doesn't have to be involved in allocating or deallocating these objects. | ||
* Object field accesses may become cheaper if the compiler is able to do scalar replacement of the fields of the stack-allocated object | ||
(i.e., if the fields can be promoted). | ||
* Some field zero-initializations may be elided by the compiler. | ||
|
||
Object stack allocation is implemented in | ||
various Java runtimes. This optimization is more important for Java since it doesn't have value types. | ||
|
||
## GitHub issues | ||
|
||
[roslyn #2104](https://github.com/dotnet/roslyn/issues/2104) Compiler should optimize "alloc temporary small object" to "alloc on stack" | ||
|
||
[coreclr #1784](https://github.com/dotnet/coreclr/issues/1784) CLR/JIT should optimize "alloc temporary small object" to "alloc on stack" automatically | ||
|
||
## Escape Analysis | ||
|
||
An object is said to escape a method if it can be accessed after the method's execution has finished. | ||
An object allocation can be moved to the stack safely only if the object doesn't escape the allocating method. | ||
|
||
Several escape algorithms have been implemented in different Java implementations. Of the 3 algorithms listed in [references](References), | ||
[[1]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.4799&rep=rep1&type=pdf) | ||
is the most precise and most expensive (it is based on connection graphs) and was used in the context of a static Java compiler, | ||
[[3]](https://pdfs.semanticscholar.org/1b33/dff471644f309392049c2791bca9a7f3b19c.pdf) | ||
is the least precise and cheapest (it doesn't track references through assignments of fields) and was used in MSR's Marmot implementation. | ||
[[2]](https://www.usenix.org/legacy/events/vee05/full_papers/p111-kotzmann.pdf) | ||
is between | ||
[[1]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.4799&rep=rep1&type=pdf) and | ||
[[3]](https://pdfs.semanticscholar.org/1b33/dff471644f309392049c2791bca9a7f3b19c.pdf) | ||
both in analysis precision and cost. It was used in Java HotSpot. | ||
|
||
Effectiveness of object stack allocation depends in large part on whether escape analysis is done inter-procedurally. | ||
With intra-procedural analysis only, the compiler has to assume that arguments escape at all non-inlined call sites, | ||
which blocks many stack allocations. In particular, assuming that 'this' argument always escapes hurts the optimization. | ||
[[4]](http://www.ssw.uni-linz.ac.at/Research/Papers/Stadler14/Stadler2014-CGO-PEA.pdf) describes an approach that | ||
handle objects that only escape on some paths by promoting them to the heap "just in time" as control reaches those paths. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You might also mention that some approaches are able to handle objects that only escape on some paths by promoting them to the heap "just in time" as control reaches those paths -- ffor instance Partial Escape Analysis and Scalar Replacement for Java. So long as those escaping paths are indeed rare this can pay off. For instance, in the local delegate case there is an exception path that makes it look like the delegate can escape. In my prototype I managed to modify the importer to prove this code wasn't reachable, but in general we may see things like this.... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated the doc to mention this approach and added the paper to References. |
||
There are several choices for where escape analysis can be performed: | ||
|
||
### Analysis in the jit | ||
**Pros:** | ||
* The jit can analyze callee's code (subject to some restrictions, e.g., when running under profiler) | ||
since there are no versioning considerations at jit time. | ||
* The optimization will apply to any msil code regardless of the language compiler or msil post-processing tools. | ||
* The jit already has IR that's suitable for escape analysis. | ||
|
||
**Cons:** | ||
* The jit analyzes methods top-down, i.e., callers before callees (when inlining), which doesn't fit well with the stack allocation optimization. | ||
* Full interprocedural analysis is expensive for the jit, even at high tiering levels. Background on-demand/full interprocedural analysis might be feasible | ||
if we have the ability to memoize method properties with (in)validation. | ||
|
||
Possible approaches to interprocedural analysis in the jit: | ||
* We can run escape analysis concurrently with inlining and analyze callee's parameters for escaping while inspecting | ||
inline candidates. The results of such analysis can be cached. | ||
* We can adjust inlining heuristics to give more weight to candidates whose parameters have references to potentially | ||
stack-allocated objects. Inlining such methods may result in additional benefits if the jit can promote fields of the | ||
stack-allocated objects. | ||
* For higher-tier jit the order of method processing may be closer to bottom-up, i.e., callees before callers. That may | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How difficult would this approach be? Compiling methods bottom-up would help more optimizations, for example, if the callee does not use certain volatile registers, the callers would not need to save these registers on the stack (i.e., SIMD code on Unix/Linux). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The JIT is fundamentally top-down in its compilation model, as it compiles on-demand, and doesn't know the downstream call chain until either it is excecuted or inlined. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The idea here is that callees may be promoted to higher tier before callers. For example, if both A calls C and B calls C then it's possible that C will reach the count needed for promotion to the next tier before both A and B. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This situation looks really tricky, which may need more scenarios and data. AFAIK, JVM implementations usually get helps from the bytecode-level analysis. Perhaps, we can do something similar, e.g., collecting callee info by assembly loading or bytecode verification. |
||
help with running stack allocation optimization. | ||
|
||
### Analysis in ngen/crossgen | ||
**Pros:** | ||
* ngen/crossgen can afford to spend more time for escape analysis. | ||
* The jit already has IR that's suitable for escape analysis. | ||
|
||
**Cons:** | ||
* crossgen in Ready-To-Run mode is not running on generic methods that cross assembly boundaries. | ||
* crossgen in Ready-To-Run mode is not allowed to analyze code of methods from other assemblies. | ||
* newobj in Ready-To-Run mode is more abstract and introducing a hard dependence on ref class size and layout may interfere | ||
with version reseliency. | ||
|
||
### Analysis in ILLink | ||
**Pros:** | ||
* ILLInk can afford to spend more time for escape analysis. | ||
* For self-contained apps, ILLink has access to all of application's code and can do full interprocedural analysis. | ||
* ILLink is already a part of System.Private.CoreLib and CoreFX build toolchain so the assemblies built there can benefit | ||
from this. | ||
|
||
**Cons:** | ||
* The implementation will only benefit customers that have ILLink in their toolchain | ||
* ILLink operated on a view of metadata and raw msil instructions, it currently doesn't have a call graph or IR representation | ||
suitable for escape analysis. | ||
|
||
The results of escape analysis in the linker may be communicated to the jit by injecting an intrinsic call right before or after | ||
newobj for the object that was determined to be non-escaping. Note that assemblies may lose verifiability with this approach. | ||
An alternative is to annotate parameters with escape information so that the annotations can be verified by the jit with | ||
local analysis. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I've mentioned elsewhere in passing, the jit cannot generally rely on any AOT derived interprocedural information as ground truth. While that information may be true of the IL scanned by AOT, at runtime, because of profilers and the like, the jit may see different IL initially or new IL may arrive after jitting. Without the ability to revoke an arbitrary running method, the only current safe way to incorporate interprocedural information is via inlining. Inlines are tracked by the runtime and when a method body is updated, all the existing methods that are impacted are set for rejitting. Existing instances that are active continue to run and consistently use the old versions; new instances invoked after the IL update see only the new versions. So any AOT derived interprocedural information can at best be used as a strong hint to the jit and those facts must be re-verified by actually inlining. Unless we know that IL updates are not possible OR we implement a general revocation scheme (deopt/osr). Given that method body updates are dynamically rare this hinting might be sufficient to expose the perf opportunities, but it means coupling this information into the inliner. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can the AOT derived analysis include list of methods that were used to produce the result? Then we can make this list logically inlined into the main method and the rest will work the same way as if the methods were physically inlined. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Imagine we have a long-running A that calls B every so often. We allow information about B to influence A's codegen. Then when B is modified we either need to fix up that running instance of A or prevent the old A from calling the new B. It is not enough to force any new call to A to be rejitted. Say for instance A passes B a struct implicitly by-ref and the AOT version of B doesn't modify the struct. So we take advantage of this and the initial version of A doesn't copy the struct each time A calls B. Then if the new B modifies the struct, the old A can't safely call the new B. So we either need to immediately revoke the old A or take pains to make sure the old A will still invoke the old B. We don't have these problems when B is inlined, as old As always "invoke" old Bs, and when B is updated and both A and B get rejitted, new As always invoke new Bs. And any place we don't inline, we also don't bake in any dependency on the callee. So even if A both inlines B at some sites and calls B at others we're ok. I haven't thought much about how to realistically support a system where code is versioned and we somehow keep straight which versions can safely call which other versions (note as above that this behavior is call site dependent, eg an old A may be able to safely call the new B at some sites but must call the old B at others...). Maybe it is viable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I have missed this part. It would certainly be non-trivial to get this right. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, if profiling is always on in the first version we'll have to inline all methods the stack allocated object can be passed to. Unfortunately, that will complicate analysis of the perf implications when both inlining changes and stack allocation will be performed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Noted this in the document. |
||
If the methods whose info was used for interprocedural escape analysis are allowed to change after the analysis, the jit either needs | ||
to inline those methods or there should be a mechanism to immediately revoke methods with stack allocated objects that relied on | ||
that analysis. | ||
|
||
## Other restrictions on stack allocations | ||
|
||
* Objects with finalizers can't be stack-allocated since they always escape to the finalizer queue. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about objects that have a finalizer, but whose finalization is always suppressed? Though I don't know what is the likelihood of determining whether finalization is indeed suppressed, since it might require inlining of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or could the finalizer just be called synchronously similiar to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think at least for the initial version it's ok to be conservative here and not stack allocate objects with finalizers without worrying about whether they are suppressed. |
||
* Objects allocated in a loop can be stack allocated only if the allocation doesn't escape the iteration of the loop in which it is | ||
allocated. Such analysis is complicated and is beyond the scope of at least the initial implementation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure how complex this would actually be. To me the question is whether this occurs frequently - if so, it would be very useful to do this analysis even if it doesn't escape, since the object could be reused across iterations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to clarify, the cases we would have to detect and disqualify are things like this:
This has to output
so we can't reuse the same stack slot for the allocation at each iteration. |
||
* Conditional object allocations (i.e., allocations that don't dominate the exit) need to be restricted to avoid growing the stack | ||
unnecessarily. A possible approach is turning such allocations to dynamic stack allocations. | ||
* There should be a limit on the maximum size of stack allocated objects. | ||
* There may be restrictions for objects with weak GC fields (this needs to be investigated). | ||
|
||
## GC considerations | ||
|
||
The jit is responsible for reporting references to heap-allocated objects to the GC. With stack-allocated objects present in a method | ||
a reference at a particular GC-safe point may be in one of 3 states: | ||
* The reference always points to a heap object: the reference should be reported as TYPE_GC_REF. | ||
* The reference always points to a stack object: the reference should not be reported to the GC. | ||
* The reference may point to a heap object or to a stack object depending on control flow: the reference should be reported as TYPE_GC_BYREF. | ||
|
||
All GC fields of stack-allocated objects have to be reported to the GC, the same as for fields of stack-allocated value classes. | ||
|
||
When a field of an object is modified the jit may need to issue a write barrier: | ||
* The reference always points to a heap object: normal write barrier should be used | ||
* The reference always points to a stack object: no write barrier is needed | ||
* The reference may point to a heap object or to a stack object depending on control flow: checked write barrier should be used | ||
|
||
## Existing Prototypes | ||
|
||
@echesakovMSFT implemented a [prototype](https://github.com/echesakovMSFT/coreclr/tree/StackAllocation) in 2016. | ||
|
||
The goal of the prototype was to have the optimization working end-to-end with a number of simplifications: | ||
* A simple intra-procedural escape analysis based on | ||
[[1]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.4799&rep=rep1&type=pdf) | ||
but without field edges in the connection graph. | ||
* All call arguments are assumed to be escaping. | ||
* Only simple objects are stack allocated, arrays of constant size are not analyzed. | ||
* Only objects that are allocated unconditionally in the method are moved to the stack. An improvement here would | ||
be allocating other objects dynamically on the stack. | ||
* If at least one object in a method is stack allocated, all objects are conservatively reported as as TYPE_GC_BYREF | ||
and a checked write barrier is used in the method. | ||
* All objects allocated on the stack also have a pre-header allocated. Pre-header is used for synchronization | ||
and hashing so we could eliminate it if we proved the object wasn't used for synchronization and hashing. | ||
|
||
We ran the prototype on System.Private.CoreLib via crossgen in 2016 and | ||
objects at 21 allocation sites were moved to the stack. | ||
|
||
We also ran an experiment where we changed the algorithm to optimistically assume that no arguments escape. | ||
The goal was to get an upper bound on the number of potential stack allocations. | ||
|
||
* 424 methods out of 6150 had allocations moved to the stack. | ||
* 586 allocation sites out of 7977 were moved to the stack. | ||
* One finding was that most sites were exception object allocations (5345), which almost never happen dynamically. | ||
* Excluding exception allocations we had 586 allocation sites out of 2632 that could be moved to the stack. | ||
So the upper bound from this experiment is 22.2%. | ||
|
||
@AndyAyersMS recently resurrected @echesakovMSFT work and used it to [prototype stack allocation of a simple delegate that's | ||
directly invoked](https://github.com/dotnet/coreclr/compare/master...AndyAyersMS:NonNullPlusStackAlloc). It exposed a number of things that need to be | ||
done in the jit to generate better code for stack-allocated objects. The details are in comments of | ||
[coreclr #1784](https://github.com/dotnet/coreclr/issues/1784). | ||
|
||
We did some analysis of Roslyn csc self-build to see where this optimization may be beneficial. One hot place was found in [GreenNode.WriteTo](https://github.com/dotnet/roslyn/blob/fab7134296816fc80019c60b0f5bef7400cf23ea/src/Compilers/Core/Portable/Syntax/GreenNode.cs#L647). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How hard is to fix this up to save this allocation? It looks pretty straightforward. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, implementing a simple struct version of a stack and using it there will remove these allocations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be interesting to look at which cases out of the ones you have listed are hard or impossible to fix by a simple local change. I think they would be the ones to focus on. Maybe the delegates? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Delegates is one case. Another common case is fixed-length array passed to, e.g., Console.WriteLine. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There's also a proposal to avoid that array allocation from the language side in dotnet/csharplang#1757 by allowing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just so i understand, in the WriteTo case, the proposed optimization would remove the allocation of the Stack object itself, but the underlying array it uses internally would still be heap allocated right? There's no concept of the "entire stack" structure somehow being able to fit on the real execution stack as long as possible, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the proposed optimization would remove the allocation of the Stack object itself. "Inlining" object fields into enclosing objects is beyond the scope of this. |
||
This object allocation accounts for 8.17% of all object allocations in this scenario. The number is not as impressive as a percentage | ||
of all allocated bytes: 0.67% (6.24 Mb out of 920.1 Mb) but it's just a single static allocation. | ||
Below is the portion of the call graph the escape analysis will have to consider when proving this allocation is not escaping. | ||
Green arrows correspond to the call sites that are inlined and red arrows correspond to the call sites that are not inlined. | ||
|
||
![Call Graph](../images/GreenNode_WriteTo_CallGraph.png) | ||
|
||
## Roadmap | ||
|
||
We will implement the optimization in the jit first and get as many cases as possible that don't require deep interprocedural analysis, | ||
e.g., the delegate cases from the prototype mentioned above. We will also try to take advantage of the more suitable order of method | ||
processing in higher-tier jit. | ||
|
||
The jit work includes removing the restrictions in @echesakovMSFT prototype, making escape analysis more sophisticated, making | ||
changes for producing better code for stack-allocated objects (some of which @AndyAyersMS discovered while working on his prototype), | ||
and updating inlining heuristics to help with object stack allocation. | ||
|
||
To get the maximum benefit from the optimization we will likely have to augment the jit analysis with more information. The information | ||
may come from manual annotations or from a tool analysis. ILLink or the upcoming [CPAOT](https://github.com/dotnet/corert/tree/r2r) | ||
may be appropriate places for ahead-of-time escape analysis. Self-contained applications will benefit the most from ILLink analysis | ||
but framework assemblies can also be analyzed and annotated even though cross-assembly calls will have to be processed conservatively. | ||
|
||
In this context an algorithm similar to [[1]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.4799&rep=rep1&type=pdf) can be used | ||
to get the most accurate escape results. | ||
|
||
The cost of adding some infrastructure to ILLink (call graph, proper IR, etc.) will be amortized if we do other IL-to-IL optimizations in the future. | ||
Also, we may be able to reuse the infrastructure from other projects, i.e., [ILSpy](https://github.com/icsharpcode/ILSpy/blob/da2f0d0b9143fb082a5529f78267fa36e8bf16f9/ICSharpCode.Decompiler/IL/ILReader.cs). | ||
|
||
## References | ||
|
||
[[1] Jong-Deok Choi at al. Stack Allocation and Synchronization Optimizations for Java Using Escape Analysis.](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.4799&rep=rep1&type=pdf) | ||
|
||
[[2] Thomas Kotzmann and Hanspeter Moessenbroeck. Escape Analysis in the Context of Dynamic Compilation and Deoptimization](https://www.usenix.org/legacy/events/vee05/full_papers/p111-kotzmann.pdf) | ||
|
||
[[3] David Gay and Bjarne Steensgaard. Fast Escape Analysis and Stack Allocation for Object-Based Programs](https://pdfs.semanticscholar.org/1b33/dff471644f309392049c2791bca9a7f3b19c.pdf) | ||
|
||
[[4] Lukas Stadler at al. Partial Escape Analysis and Scalar Replacement for Java](http://www.ssw.uni-linz.ac.at/Research/Papers/Stadler14/Stadler2014-CGO-PEA.pdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing a "." after the
[2]
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.