-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable EVEX support by default #83648
Conversation
…emented instructions
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsThis does some cleanup to enable EVEX support by default and to ensure that relevant CI jobs exist to test the non-AVX512 paths.
|
@@ -304,7 +304,6 @@ CONFIG_INTEGER(EnableMultiRegLocals, W("EnableMultiRegLocals"), 1) // Enable the | |||
#if defined(DEBUG) | |||
CONFIG_INTEGER(JitStressEvexEncoding, W("JitStressEvexEncoding"), 0) // Enable EVEX encoding for SIMD instructions when | |||
// AVX-512VL is available. | |||
CONFIG_INTEGER(JitForceEVEXEncoding, W("JitForceEVEXEncoding"), 0) // Force EVEX encoding for SIMD instructions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Force isn't needed because we enable all ISAs in AltJit by default. So we can "force" EVEX simply by running the AltJit.
This ends up being safer as well since it doesn't require us to have AVX512 capable machines since AltJit code won't be run by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also noting that if we want to disable EVEX support, its as simple as changing the following one line:
- CONFIG_INTEGER(EnableAVX512F, W("EnableAVX512F"), 1)
+ CONFIG_INTEGER(EnableAVX512F, W("EnableAVX512F"), 0)
This will have AVX512F
be off by default but allow users to opt-in by setting it to 1
. So its trivial to disable this if necessary due to any potential bugs or edge cases discovered after merge.
@@ -103,7 +103,7 @@ const char* CodeGen::genInsDisplayName(emitter::instrDesc* id) | |||
|
|||
const emitter* emit = GetEmitter(); | |||
|
|||
if (emit->IsVexOrEvexEncodedInstruction(ins)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encodable
is a bit more accurate (then Encoded
) since we're doing a "can it be encoded" rather than a "must it be encoded" check.
bool emitter::IsVexOrEvexEncodableInstruction(instruction ins) const | ||
{ | ||
return IsVexEncodedInstruction(ins) || IsEvexEncodedInstruction(ins); | ||
if (!UseVEXEncoding()) | ||
{ | ||
return false; | ||
} | ||
|
||
insFlags flags = CodeGenInterface::instInfo[ins]; | ||
return (flags & (Encoding_VEX | Encoding_EVEX)) != 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little bit faster than checking them independently. We don't need to check UseEVEXEncoding()
since we'll end up filtering by TakesEvexPrefix(id)
which in turn duplicates the IsEvexEncodableInstruction
check.
We can't check just VEX
since we need to also return true
for EVEX
only instructions.
src/coreclr/jit/emitxarch.cpp
Outdated
return false; | ||
} | ||
|
||
if (!emitComp->DoJitStressEvexEncoding()) | ||
if (HasHighSIMDReg(id) || (id->idOpSize() == OPSZ64) || HasKMaskRegisterDest(ins)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IsEvexEncodableInstruction
excludes the kmask instructions, since they are VEX only. This simplifies the checks we do here since it means we can just check for kmask usage in general.
} | ||
|
||
return IsVexEncodedInstruction(ins); | ||
return IsVexEncodableInstruction(ins) && (ins != INS_vzeroupper); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably remove the ins != INS_vzeroupper
check, but it might require a couple additional tweaks to get working, so I'm leaving that to a future PR.
{ | ||
if (TakesEvexPrefix(id)) | ||
if (TakesEvexPrefix(id) && codeEvexMigrationCheck(code)) // TODO-XArch-AVX512: Remove codeEvexMigrationCheck(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might actually be able to remove the codeEvexMigrationCheck(code)
at this point, but I'm likewise leaving that to a future PR.
if (JitConfig.EnableAVX512F() != 0) | ||
{ | ||
instructionSetFlags.AddInstructionSet(InstructionSet_AVX512F); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the below are what allows AVX512 to light up in the AltJit.
// x86-64-v4 feature level supports AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL | ||
// These have been shipped together historically and at the time of this writing | ||
// there exists no hardware which doesn't support the entire feature set. To simplify | ||
// the overall JIT implementation, we currently require the entire set of ISAs to be | ||
// supported and disable AVX512 support otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the comment says, we want to support AVX512 only if all of AVX512F
, AVX512BW
, AVX512CD
, AVX512DQ
, and AVX512VL
are supported.
These ISAs form the x86-64-v4
baseline and there has never shipped a piece of hardware without all of them.
Some examples of things we'd have to consider are that legacy-encoded xorps
is SSE
and VEX-encoded vxorps
is AVX. However, EVEX-encoded xorps
is AVX512DQ. Likewise the EVEX support for XMM/YMM based xorps
is then AVX512DQ + AVX512VL
(AVX512DQ_VL
).
Supporting this "properly" would require us to add some fairly complex checks to import and likely LSRA to handle the difference and ensure that some suitable fallback is generated. We could write all the logic to support them, but without hardware existing that will will be "needless" overhead and negatively impact throughput. So its much easier to just write the JIT to disable AVX512 entirely if any of the "core" ISAs are unsupported.
@dotnet/jit-contrib It is also worth noting, in addition to the analysis given above, that the TP "regression" is only for hardware with AVX-512 support (which all of our CI machines currently have). The TP for hardware without AVX-512 will be effectively what it was prior to this PR. -- That is, this hardware isn't really changing the code we execute when AVX-512 is disabled as all the paths already existed and were just being filtered on whether or not the JIT said it supported AVX-512. For AVX-512 enabled hardware, with this PR we're implicitly using AVX-512 already (as evidenced by the -40k diff to disassembly). This applies to scalar floating-point code, 128-bit and 256-bit SIMD code. This codegen improvement will only improve further as we add more AVX-512 support and relevant optimizations. It will further improve as we add explicit optimizations to the libraries as well. For context save/restore, Windows already handles this efficiently via its own APIs. The GC has had AVX-512 support for a couple years now and so native is already paying any "additional cost". For Unix, there are some things we could improve but that entails more in depth changes to use the native mechanics rather than trying to abstract things over the general Win32 API surface area. |
The diffs show especially significant TP regressions for MinOpts -- as much as +4.26% on the MinOpts parts of libraries.pmi.windows.x64.checked.mch. I wouldn't think that would be explained by the LSRA affect of Can you disable the AVX-512 code using |
Yes and this is functionally what happens implicitly for hardware without AVX512 support.
This doesn't "quite" work as In order to get the below numbers I temporarily changed the check to the following so SPMI would work: if (instructionSetFlags.HasInstructionSet(InstructionSet_AVX512BW_VL) &&
instructionSetFlags.HasInstructionSet(InstructionSet_AVX512CD_VL) &&
instructionSetFlags.HasInstructionSet(InstructionSet_AVX512DQ_VL) &&
+ JitConfig.EnableAVX512F()) I can submit a follow up/separate PR to enable AltJIT to also disable ISAs that are supported by the hardware and to ensure the JIT side knobs work by default with SPMI. The remaining perf difference comes from the newer compiler (~0.03-0.05%) and various changes in emitxarch. For example changing from DOTNET_EnableAVX512F=1 (this PR, AVX512 enabled hardware)Warning: Different compilers used for base and diff JITs. Results may be misleading. Overall (+0.25% to +2.01%)
MinOpts (+0.77% to +4.25%)
FullOpts (+0.25% to +1.63%)
DetailsAll contexts:
MinOpts contexts:
FullOpts contexts:
DOTNET_EnableAVX512F=0 (this PR, non-AVX512 enabled hardware)Warning: Different compilers used for base and diff JITs. Results may be misleading. Overall (+0.21% to +0.41%)
MinOpts (+0.52% to +0.79%)
FullOpts (+0.18% to +0.25%)
DetailsAll contexts:
MinOpts contexts:
FullOpts contexts:
I believe you're underestimating the impact on the This then tracks that the regression to x86 is much less (up to 1.5% rather than up to 4.26%) since it is going from 16 registers (8 integer + 8 floating) to 24 registers (8 integer + 8 floating + 8 mask) - 50% more registers. Of the 8 places in LSRA that do
However, as the profile shows these aren't "hot paths" for the code. So while we are executing more instructions, tthey don't actually cause any real change to wall clock time. You have to factor in not only As a corollary example, see #83479. This was an extremely simple change to a None of the increases to CG2 wall clock time have been due to the AVX512 support added so far. I don't expect that to change with this PR either. However, if it does we have some identified real hot spots from the profiles above that show us where we can most easily win that back. |
The -0.28% was in libraries.crossgen2 MinOpts. There are 15 of those contexts so it is not really a representative sample (indeed, we wouldn't expect many prejitted MinOpts contexts). The overall impact of that change on crossgen2 collections was much smaller, and certainly within the variance of crossgen2 throughput (which is large -- several percent). I agree that actual profilers can give better data for the "macro" changes, but I think doing so on an SPMI run would have much less variance. Have we spent time with @SingleAccretion's PIN tool on this change yet? Can you post the detailed breakdown from that? |
The pin tool uses Intel PIN, which is exactly what VTune uses under the covers. AMD uProf uses a similar set of model specific registers for tracking precise instruction counts as well. It is not going to give any additional information beyond what the hardware specific hardware profilers are reporting, which I already shared above. If someone else wants to spend the extra cycles to build a local copy of the tool, to collect some metrics, etc. Then please feel free. I have several other work items I need to focus on and have already given significant data showing that this is likely not going to be problematic in practice. In the off chance that it is measurably problematic (which will be quickly caught in the weekly perf triage), then we have a one line change that will allow us to disable AVX512 support and investigate this more in depth at that time. |
Surprised to see any MinOpts contexts in crossgen at all, I guess those are explicit |
The general point of the example was that we have made a plethora of changes in the past few months that have impacted the SPMI tracked TP metric. Both in terms of "regressions" and in terms of "improvements" for both MinOpts and Optimized code. Despite this, and despite many of them being "significant" changes to the TP percentile, there has been no "real" change to the TP performance of crossgen2. The only real regression was caused by the introduction of a non-trivial amount of new managed code and the only real improvement was caused by code simplification in part of the JIT's front end on a PR that did have some measured TP impact ( Instructions can range from |
Our crossgen2 benchmarks don't measure TP for tier0, do they? The problem the team was worried about that Tier0 regressions are way more important than Tier1/CG/NAOT, because every Tier0 compilation by definition means that the application execution is stopped in a stub and is waiting for JIT to finish. So the more time we spend in Tier0 jitting the slower the startup. It's way less important for non-MinOpts where we promote methods asynchronously. Although, I agree with you that PINTOOL doesn't directly map into actual overhead. @kunalspathak can we kick a TE run from this PR via bot to see "startup time/Time to first response" change? (ideally, with ReadyToRun=0 if possible) |
They do not by default, but many of the tracked TP improvements/regressions have applied equally or even greater to optimized code than to MinOpts. I've also explicitly measured crossgen of debug corelib using the release JIT, which does do minopts for everything. We have extremely good CPI in LSRA and while it is overall one of the hotter (in terms of total actual cycles spent executing) pieces of code that the JIT currently, an increase in instructions to it is much less consequential than changes to It also means that improving something like We similarly spend a significant amount of time in All of these are relatively low hanging fruit that would provide real measurable improvement to the JIT and are a significantly better use of our time. |
/benchmark json aspnet-citrine-win runtime |
Benchmark started for json on aspnet-citrine-win with runtime. Logs: link |
I don't think it is possible, probably using |
Not sure why the results are not automatically posted, but here is what I get from link.
|
So I presume, then, that we can't draw any conclusions from the aspnet perf data? |
Not without more runs due to the existing noisiness. This PR in the results above is actually showing better time to first request by 9ms and better overall latency up through P99; but also 2ms slower build time and 4ms slower start time. This is all within the existing standard deviation so we can at least speculate that there isn't a "substantial" improvement/regression. |
I generated a PIN diff (using the https://github.com/SingleAccretion/Dotnet-Runtime.Dev#analyze-pin-trace-diffps1---diff-the-traces-produced-by-the-pin-tool script) with a baseline of this PR with AVX512F disabled and this PR as the diff, using the libraries.pmi.windows.x64.checked.mch collection. These are all of the regressions (there were also per-function improvements), almost all in LSRA, as expected due to the
A PIN per-function diff between this PR and the baseline in
|
I feel like we've analyzed this enough and there is data here that can drive future TP improvement activities. |
Right, this is basically what I covered above:
This should get fixed with PGO data, but if it still isn't outlined explicitly after the next PGO update we can refactor it a bit so the core check is inlined and the more complex bit of logic is not |
I did some wall clock measurements of ASP.NET tier0 contexts:
The results were the following, in milliseconds, taken from SPMI's output: base = {25776.552800, 26043.440300, 25901.721800, 25787.304400,
26002.092800, 25677.519400, 25641.274600, 25619.626000,
25898.820300, 26292.012000};
diff = {26312.449900, 26448.550100, 26414.532700, 26370.734300,
26243.087700, 26335.908700, 26299.542700, 26181.570600,
26385.097500, 26297.585800}; The 95% confidence interval of this is a regression in wall clock time of between 1.2% and 2.4%. This includes time spent in the SPMI part of the JIT-EE interface, so the real regression is a bit higher (though probably not much, IIRC we spend around 10-15% of SPMI replay in SPMI). Just some more data to help prioritize follow-up work. Maybe open an issue to track recuperating some of this with an appropriate milestone? |
This does some cleanup to enable EVEX support by default and to ensure that relevant CI jobs exist to test the non-AVX512 paths.