-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm/arm64] Leaf frames, saving LR, and return address hijacking #35274
Comments
Does the simple method in your example really need to establish
Yes, the current JIT heuristics around GC suspension really only guarantee that the GC suspension has a chance of succeeding. They do not guarantee that the GC suspension happens in timely manner. It is not hard to construct real-world looking code that hits issues like this. It is something we should be working towards fixing, but I am not sure we would want to fix this by generating fully interruptible code for all loops. It needs data about the tradeoffs.
Sounds fine to me. It needs data about the tradeoffs. |
E.g. interesting data point would be: If we stopped generating the frame for small functions (ie smaller code) and started generating fully interruptible code for all loops (ie bigger GCInfo), is the CoreLib native image going to be smaller or bigger overall? |
It seems like it doesn't need to: I think we establish it for Windows ETW and stack walking (maybe just gdb debugging) on Linux. I'm not sure if it's required for correctness.
I think fully-versus-partially interruptible was on a function basis; we don't do it just for particular loops or code regions. The question is whether the GC info supports it on a region basis. Assuming it is supported (or could be engineered), then gathering that data would be interesting. Question: as I understand it, GC suspension is, very roughly:
How many "tries" (or loops) normally happen before we succeed? (And how costly is it?) Would changing our fully interruptible / call / loop strategy make GC suspension faster and more reliable? |
Just to share one data point - In the R2R generated code of framework libraries for x64, 23% of methods don't have prolog/epilog. These methods can be optimized for arm64 as well. |
I believe that the GC Info encoding (the non-x86 version at least) does support it on region basis. Look for
You can gather the data even for setting on the whole method: It will give us upper bound.
Here is a more accurate version:
The typical case is that we go through one iteration of the loop, and by the time we get to the second one iteration all threads hit return address hijack or some other GC safe spot. The typical methods are small and so they will hit return address hijack quickly. The fully interruptible is important for the long running loops. It is important for predictable performance - the 98+% latencies, etc. |
One interesting thought by @filipnavara is that partially interruptible leaf methods may not need to support external unhijacking as, once hijacked in such method, a thread will have to inevitably unhijack itself - either by hitting the hijack or exiting the frame exceptionally (i.e. deref a NULL case) There is never a need to move a hijack lower, if it is already on a leaf call, and that could be useful property to consider. |
@EgorBo and I made some quick prototype earlier this week. It implements the NativeAOT side of the LR-based hijacking and some trivial support in JIT to recognize leaf methods and emit them as frameless. Notably it does NOT include the CoreCLR bits and any logic to handle R2R versioning. It was only tested on osx-arm64. The purpose of the prototype was two-fold. First, to show that the hijacking logic is relatively easy to implement (~100 lines of code). Second, to measure any size impact of the change on typical executable. The hijacking logic was tested on two use cases:
Moreover, all NativeAOT smoke tests were run several times with the change and passed with no error. Now, for the size data. I only checked two trivial cases - nativeaot/SmokeTests/UnitTests and empty .NET MAUI app. In both cases I measured size change in object file in ILC output, and number of frameless entries in the compact unwinding table. The later measurement is an imprecise approximation since it doesn't count methods that have no frame but have stack pointer adjusted, and there's at least few of those present. For empty .NET MAUI app the object file shrank by 27,896 bytes. There's at least 2,167 methods, out of 33,785, that are now frameless (6%). For UnitTests the object file shrank by 16,496 bytes. There's at least 1,389 methods, out of 10,511, that are now frameless (13%). The size savings would be slightly bigger on Linux/Windows due to larger impact on the size of unwinding data. On macOS and other Mach-O platforms we end up generating compact unwinding data which should not amount to any size difference in this case. I didn't measure any performance data. I expect that in most cases the methods that would end up leaf frameless methods would be inlined. There are at least some |
The prototype looks very convincing!
We are likely to end up bumping min R2R version for 10.0 anyways for a few other reasons (ex: doing #104336 in 10.0 would also require R2R min version bump, maybe some other changes too). NativeAOT is more approachable for experimentation in this area, since hijacking is done in a more straightforward way. We will work towards making CoreCLR similar, but so far that work has been incremental and there are still differences that make similar experimenting with CoreCLR a bit more unwieldy.
Another example that often works as a good proxy of a "mid-size average app" is |
I can help with CoreCLR counterpart. Although the codemanager changes are fairly small, so perhaps porting that could be relatively straightforward. |
I don't have the size data, but at least 9860 methods out of 82797 get frameless with the prototype in that test. I have different branch where I improved the compact unwind encoding and got measurements for the frameless methods w/ SP offset. In the MAUI sample it was a single digit number (7 IIRC), so it's not really statistically significant.
I gave it an hour or two few days ago. I was bit discouraged that some of the hijacking code was scattered all over the place. I don't expect it to be hard to implement but it could certainly benefit from some cleanup. Any help is definitely welcome. I've made couple of prototypes in the last few days to address some long-standing issues or optimization opportunities. I'm currently gauging interest to see what sticks and what is worth pursuing in the .NET 10 release cycle. |
During the original development of the arm32 product, it was decided that the
lr
register would always be stored to the stack in the prolog to support return address hijacking for GC suspension. There is this comment in the JIT (CodeGen::genPushCalleeSavedRegisters()
):This decision was maintained when arm64 support was added.
Should this decision be reconsidered?
For arm64, empty methods have this minimum code size:
Question: if function
A
was a loop with a lot of expensive computation (say, 1000 divides) and a single call to a trivial function (that is fully interruptible?), then the expensive loop is partially interruptible due to the call. But there will be only one instruction in the call that is a GC safe point (or maybe two?). Isn't GC starvation likely in this scenario?Even simple leaf methods require this prolog/epilog, e.g.:
One argument to keep this is that simple leaf methods are likely to be inlined into their callers and therefore this prolog/epilog overhead isn't encountered in real programs.
The overhead measurement (<10k in the above comment for mscorlib => System.Private.CoreLib) could be recomputed for arm64 and the current situation, and include measurement of other libraries as well.
There might also be implications to not saving and establishing
fp
on debugging, stack walking, etc.Comments? @jkotas @AndyAyersMS @kunalspathak
category:question
theme:prolog-epilog
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: