-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support 32 byte alignment of code on xarch #2249
Conversation
Do you have any data that show this actually helps, or is this written as a recommendation in some of the Intel docs? |
I have some locally gathered data and there it helps quite a bit. We have also seen this in various coreclr issues, like dotnet/coreclr#22920. Intel docs suggest that 64 byte alignment might be preferable. I haven't tried this.
My thought was to add support for 32 byte alignment and then do a series of perf lab runs over time to prove or disprove the "improved stability" hypothesis. |
src/coreclr/src/zap/zapinfo.cpp
Outdated
@@ -1104,7 +1108,11 @@ void ZapInfo::allocMem( | |||
|
|||
if (roDataSize > 0) | |||
{ | |||
if (flag & CORJIT_ALLOCMEM_FLG_RODATA_16BYTE_ALIGN) | |||
if (flag & CORJIT_ALLOCMEM_FLG_RODATA_32BYTE_ALIGN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not bother changing anything for zapper. We are trying to churn the old crossgen and there are number of other changes to make it actually work (e.g. in NewAlignedBlob).
I understand why this helps micro-benchmarks stability. I am worried about regressions in real world workloads. The real world workloads never fit into instruction cache and they are hitting instruction cache misses constantly. In worst case, this will make the code 2x bigger and thus only half of the code will fit into caches. What is the worst case regression we can expect from this? |
I would be fine with having this under a switch that is turned on for micro-benchmark runs or investigations, similar to the existing JitAlignLoops. |
Off by default seems sensible; we can give BDN the ability to turn this on and see if it fixes stability issues. But, for the sake of argument, if we wanted this on by default:
|
Doing this for methods with loops by default sounds reasonable to me. |
When we have done similar exercise for x86 years back, we found that it is most profitable to align methods that are smaller than 16 bytes. It was not what one would expect intuitively. You can still find the code that does that in the JIT if you look for |
For reference, the AMD Family 17h microarchitecture (Zen) optimization manuals (https://developer.amd.com/resources/developer-guides-manuals/) also have notes on 16 vs 32 vs 64-bytes for code/branch/loop alignment and packing. |
Here's some local data, where I modified the jit to insert nop padding just after the prolog in method with a simple 8 byte loop. Roughly speaking, perf suffers once the loop body crosses a 32 byte boundary. This is from an i7-8650U (Kaby Lake). I see similar results on an i9-9900T (Coffee Lake) and on an i7-6700 (Skylake). I can't quite explain why there are only 6 bad alignments instead of 7 (seems like 25 should be slow too), but this gives you an idea of what can happen. With current master, each method is 16 byte aligned or 32 byte aligned somewhat randomly. Over time (as other method code sizes change, etc), perf of the loop can fluctuate even though codegen for the method with the loop is identical. We see this in the perf lab runs. And even within a single run, perf of two otherwise identical methods can differ. |
I saw exactly the same effects when analyzing Windows and Linux code for the same benchmarks. If the body of a tight loop crosses 32-byte boundary, there is a significant penalty. This is the effect of µop cache. Chapter 9.3 here has a good description of what's happening: https://www.agner.org/optimize/microarchitecture.pdf |
Note SPMI currently ignores the alignment flag passed to Failures look unrelated. I'm going to hold off on a merge for a few days since the GUID bump will invalidate SPMI collections and we're just getting back to leveraging those. |
We're ready to take JIT-EE interface changes now (some internal processes dependent on the JIT-EE interface GUID are now stable). @AndyAyersMS Can you fix the conflicts and then merge this? |
Update jit and runtime to allow jit to ask for code to be 32 byte aligned. Request 32 byte alignment for Tier1 methods on x86/x64. Add minimal crossgen support; one can imagine requesting or choosing 32 byte alignment for crossgenned code, but that is left as future work. This should provide some measure of performance stability, in particular for microbenchmarks or other code where performance depends crucially on a few branches. It may or may not improve performance. If/when there are regressions we can contemplate updating the jit to add intra-method padding to address alignment sensitive code layout (e.g. dotnet/coreclr#11607). This will require a jit GUID update in addition to the changes here.
7a0cee3
to
daf588a
Compare
Rebased to get around merge conflict on jit GUID. |
Failures are infra. Merging. |
In #2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.
In dotnet#2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.
Update jit and runtime to allow jit to ask for code to be 32 byte aligned.
Request 32 byte alignment for Tier1 methods on x86/x64.
Add minimal crossgen support; one can imagine requesting or choosing 32 byte
alignment for crossgenned code, but that is left as future work.
This should provide some measure of performance stability, in particular for
microbenchmarks or other code where performance depends crucially on a few
branches.
It may or may not improve performance. If/when there are regressions we can
contemplate updating the jit to add intra-method padding to address alignment
sensitive code layout (e.g. dotnet/coreclr#11607).
This will require a jit GUID update in addition to the changes here.