Enable EVEX support by default #83648

tannergooding · 2023-03-19T16:37:15Z

This does some cleanup to enable EVEX support by default and to ensure that relevant CI jobs exist to test the non-AVX512 paths.

…emented instructions

ghost · 2023-03-19T16:37:26Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

This does some cleanup to enable EVEX support by default and to ensure that relevant CI jobs exist to test the non-AVX512 paths.

Author:	tannergooding
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding · 2023-03-19T16:38:38Z

src/coreclr/jit/jitconfigvalues.h

@@ -304,7 +304,6 @@ CONFIG_INTEGER(EnableMultiRegLocals, W("EnableMultiRegLocals"), 1) // Enable the
 #if defined(DEBUG)
 CONFIG_INTEGER(JitStressEvexEncoding, W("JitStressEvexEncoding"), 0) // Enable EVEX encoding for SIMD instructions when
                                                                     // AVX-512VL is available.
-CONFIG_INTEGER(JitForceEVEXEncoding, W("JitForceEVEXEncoding"), 0)   // Force EVEX encoding for SIMD instructions.


Force isn't needed because we enable all ISAs in AltJit by default. So we can "force" EVEX simply by running the AltJit.

This ends up being safer as well since it doesn't require us to have AVX512 capable machines since AltJit code won't be run by default.

Also noting that if we want to disable EVEX support, its as simple as changing the following one line:

- CONFIG_INTEGER(EnableAVX512F, W("EnableAVX512F"), 1) + CONFIG_INTEGER(EnableAVX512F, W("EnableAVX512F"), 0)

This will have AVX512F be off by default but allow users to opt-in by setting it to 1. So its trivial to disable this if necessary due to any potential bugs or edge cases discovered after merge.

tannergooding · 2023-03-19T16:39:34Z

src/coreclr/jit/instr.cpp

@@ -103,7 +103,7 @@ const char* CodeGen::genInsDisplayName(emitter::instrDesc* id)

    const emitter* emit = GetEmitter();

-    if (emit->IsVexOrEvexEncodedInstruction(ins))


Encodable is a bit more accurate (then Encoded) since we're doing a "can it be encoded" rather than a "must it be encoded" check.

tannergooding · 2023-03-19T16:42:03Z

src/coreclr/jit/emitxarch.cpp

+bool emitter::IsVexOrEvexEncodableInstruction(instruction ins) const
 {
-    return IsVexEncodedInstruction(ins) || IsEvexEncodedInstruction(ins);
+    if (!UseVEXEncoding())
+    {
+        return false;
+    }
+
+    insFlags flags = CodeGenInterface::instInfo[ins];
+    return (flags & (Encoding_VEX | Encoding_EVEX)) != 0;
 }


This is a little bit faster than checking them independently. We don't need to check UseEVEXEncoding() since we'll end up filtering by TakesEvexPrefix(id) which in turn duplicates the IsEvexEncodableInstruction check.

We can't check just VEX since we need to also return true for EVEX only instructions.

tannergooding · 2023-03-19T16:43:16Z

src/coreclr/jit/emitxarch.cpp

        return false;
    }

-    if (!emitComp->DoJitStressEvexEncoding())
+    if (HasHighSIMDReg(id) || (id->idOpSize() == OPSZ64) || HasKMaskRegisterDest(ins))


The IsEvexEncodableInstruction excludes the kmask instructions, since they are VEX only. This simplifies the checks we do here since it means we can just check for kmask usage in general.

tannergooding · 2023-03-19T16:43:50Z

src/coreclr/jit/emitxarch.cpp

-    }
-
-    return IsVexEncodedInstruction(ins);
+    return IsVexEncodableInstruction(ins) && (ins != INS_vzeroupper);


I think we can probably remove the ins != INS_vzeroupper check, but it might require a couple additional tweaks to get working, so I'm leaving that to a future PR.

tannergooding · 2023-03-19T16:44:55Z

src/coreclr/jit/emitxarch.cpp

    {
-        if (TakesEvexPrefix(id))
+        if (TakesEvexPrefix(id) && codeEvexMigrationCheck(code)) // TODO-XArch-AVX512: Remove codeEvexMigrationCheck().


We might actually be able to remove the codeEvexMigrationCheck(code) at this point, but I'm likewise leaving that to a future PR.

tannergooding · 2023-03-19T16:46:12Z

src/coreclr/jit/compiler.cpp

+        if (JitConfig.EnableAVX512F() != 0)
+        {
+            instructionSetFlags.AddInstructionSet(InstructionSet_AVX512F);
+        }


This and the below are what allows AVX512 to light up in the AltJit.

tannergooding · 2023-03-19T16:53:32Z

src/coreclr/jit/compiler.cpp

+    // x86-64-v4 feature level supports AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL
+    // These have been shipped together historically and at the time of this writing
+    // there exists no hardware which doesn't support the entire feature set. To simplify
+    // the overall JIT implementation, we currently require the entire set of ISAs to be
+    // supported and disable AVX512 support otherwise.


Like the comment says, we want to support AVX512 only if all of AVX512F, AVX512BW, AVX512CD, AVX512DQ, and AVX512VL are supported.

These ISAs form the x86-64-v4 baseline and there has never shipped a piece of hardware without all of them.

Some examples of things we'd have to consider are that legacy-encoded xorps is SSE and VEX-encoded vxorps is AVX. However, EVEX-encoded xorps is AVX512DQ. Likewise the EVEX support for XMM/YMM based xorps is then AVX512DQ + AVX512VL (AVX512DQ_VL).

Supporting this "properly" would require us to add some fairly complex checks to import and likely LSRA to handle the difference and ensure that some suitable fallback is generated. We could write all the logic to support them, but without hardware existing that will will be "needless" overhead and negatively impact throughput. So its much easier to just write the JIT to disable AVX512 entirely if any of the "core" ISAs are unsupported.

…sage

…ifferences

tannergooding · 2023-03-22T18:43:01Z

@dotnet/jit-contrib It is also worth noting, in addition to the analysis given above, that the TP "regression" is only for hardware with AVX-512 support (which all of our CI machines currently have).

The TP for hardware without AVX-512 will be effectively what it was prior to this PR. -- That is, this hardware isn't really changing the code we execute when AVX-512 is disabled as all the paths already existed and were just being filtered on whether or not the JIT said it supported AVX-512.

For AVX-512 enabled hardware, with this PR we're implicitly using AVX-512 already (as evidenced by the -40k diff to disassembly). This applies to scalar floating-point code, 128-bit and 256-bit SIMD code. This codegen improvement will only improve further as we add more AVX-512 support and relevant optimizations. It will further improve as we add explicit optimizations to the libraries as well.

For context save/restore, Windows already handles this efficiently via its own APIs. The GC has had AVX-512 support for a couple years now and so native is already paying any "additional cost". For Unix, there are some things we could improve but that entails more in depth changes to use the native mechanics rather than trying to abstract things over the general Win32 API surface area.

BruceForstall · 2023-03-24T06:11:00Z

The diffs show especially significant TP regressions for MinOpts -- as much as +4.26% on the MinOpts parts of libraries.pmi.windows.x64.checked.mch. I wouldn't think that would be explained by the LSRA affect of availableRegCount increasing, because I wouldn't expect LSRA to be so involved in MinOpts.

Can you disable the AVX-512 code using DOTNET_EnableAVX512F=0? If so, you could run superpmi.py tpdiff -diff_jit_option EnableAVX512F=0 with your change and see if the MinOpts TP regression (and the other TP regressions) disappear.

tannergooding · 2023-03-24T15:03:42Z

Can you disable the AVX-512 code using DOTNET_EnableAVX512F=0?

Yes and this is functionally what happens implicitly for hardware without AVX512 support.

If so, you could run superpmi.py tpdiff -diff_jit_option EnableAVX512F=0 with your change and see if the MinOpts TP regression (and the other TP regressions) disappear.

This doesn't "quite" work as DOTNET_EnableAVX512F is a VM option. There is a separate and identical knob in the JIT, but its only looked up as part of AltJit and only to enable ISAs that might otherwise be unsupported (this allows people without hardware to still test and get disasm for newer ISAs).

In order to get the below numbers I temporarily changed the check to the following so SPMI would work:

     if (instructionSetFlags.HasInstructionSet(InstructionSet_AVX512BW_VL) &&
         instructionSetFlags.HasInstructionSet(InstructionSet_AVX512CD_VL) &&
         instructionSetFlags.HasInstructionSet(InstructionSet_AVX512DQ_VL) &&
+        JitConfig.EnableAVX512F())

I can submit a follow up/separate PR to enable AltJIT to also disable ISAs that are supported by the hardware and to ensure the JIT side knobs work by default with SPMI.

The remaining perf difference comes from the newer compiler (~0.03-0.05%) and various changes in emitxarch. For example changing from UseEvexEncoding() && IsEvexEncodedInstruction(ins) to IsVexOrEvexEncodableInstruction(ins). This ultimately removed a lot of code, but without PGO the nested UseVEXEncoding() check doesn't get outlined

DOTNET_EnableAVX512F=1 (this PR, AVX512 enabled hardware)

Warning: Different compilers used for base and diff JITs. Results may be misleading.
Base JIT's compiler: MSVC 193532215
Diff JIT's compiler: MSVC 193632502

Overall (+0.25% to +2.01%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+1.70%
coreclr_tests.run.windows.x64.checked.mch	+2.01%
libraries.crossgen2.windows.x64.checked.mch	+0.25%
libraries.pmi.windows.x64.checked.mch	+1.56%
libraries_tests.pmi.windows.x64.checked.mch	+1.65%

MinOpts (+0.77% to +4.25%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+3.01%
coreclr_tests.run.windows.x64.checked.mch	+3.00%
libraries.crossgen2.windows.x64.checked.mch	+0.77%
libraries.pmi.windows.x64.checked.mch	+4.25%
libraries_tests.pmi.windows.x64.checked.mch	+3.33%

FullOpts (+0.25% to +1.63%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+1.42%
coreclr_tests.run.windows.x64.checked.mch	+1.18%
libraries.crossgen2.windows.x64.checked.mch	+0.25%
libraries.pmi.windows.x64.checked.mch	+1.54%
libraries_tests.pmi.windows.x64.checked.mch	+1.63%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	104,919,715,049	106,701,897,153	+1.70%
coreclr_tests.run.windows.x64.checked.mch	788,527,599,616	804,356,245,532	+2.01%
libraries.crossgen2.windows.x64.checked.mch	123,711,122,727	124,023,390,612	+0.25%
libraries.pmi.windows.x64.checked.mch	230,402,159,674	233,987,828,827	+1.56%
libraries_tests.pmi.windows.x64.checked.mch	502,104,878,759	510,383,132,202	+1.65%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	18,423,693,189	18,978,677,027	+3.01%
coreclr_tests.run.windows.x64.checked.mch	359,918,735,082	370,699,709,109	+3.00%
libraries.crossgen2.windows.x64.checked.mch	1,712,794	1,725,952	+0.77%
libraries.pmi.windows.x64.checked.mch	1,319,088,937	1,375,098,102	+4.25%
libraries_tests.pmi.windows.x64.checked.mch	6,105,990,635	6,309,223,185	+3.33%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	86,496,021,860	87,723,220,126	+1.42%
coreclr_tests.run.windows.x64.checked.mch	428,608,864,534	433,656,536,423	+1.18%
libraries.crossgen2.windows.x64.checked.mch	123,709,409,933	124,021,664,660	+0.25%
libraries.pmi.windows.x64.checked.mch	229,083,070,737	232,612,730,725	+1.54%
libraries_tests.pmi.windows.x64.checked.mch	495,998,888,124	504,073,909,017	+1.63%

DOTNET_EnableAVX512F=0 (this PR, non-AVX512 enabled hardware)

Warning: Different compilers used for base and diff JITs. Results may be misleading.
Base JIT's compiler: MSVC 193532215
Diff JIT's compiler: MSVC 193632502

Overall (+0.21% to +0.41%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.29%
coreclr_tests.run.windows.x64.checked.mch	+0.41%
libraries.crossgen2.windows.x64.checked.mch	+0.25%
libraries.pmi.windows.x64.checked.mch	+0.21%
libraries_tests.pmi.windows.x64.checked.mch	+0.21%

MinOpts (+0.52% to +0.79%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.79%
coreclr_tests.run.windows.x64.checked.mch	+0.68%
libraries.crossgen2.windows.x64.checked.mch	+0.76%
libraries.pmi.windows.x64.checked.mch	+0.66%
libraries_tests.pmi.windows.x64.checked.mch	+0.52%

FullOpts (+0.18% to +0.25%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.19%
coreclr_tests.run.windows.x64.checked.mch	+0.18%
libraries.crossgen2.windows.x64.checked.mch	+0.25%
libraries.pmi.windows.x64.checked.mch	+0.21%
libraries_tests.pmi.windows.x64.checked.mch	+0.21%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	105,445,894,170	105,754,830,649	+0.29%
coreclr_tests.run.windows.x64.checked.mch	813,671,944,975	816,980,627,429	+0.41%
libraries.crossgen2.windows.x64.checked.mch	123,710,946,415	124,017,242,367	+0.25%
libraries.pmi.windows.x64.checked.mch	231,164,671,088	231,657,914,986	+0.21%
libraries_tests.pmi.windows.x64.checked.mch	503,130,079,375	504,205,511,144	+0.21%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	18,462,321,928	18,607,792,875	+0.79%
coreclr_tests.run.windows.x64.checked.mch	366,939,039,379	369,425,191,639	+0.68%
libraries.crossgen2.windows.x64.checked.mch	1,712,799	1,725,868	+0.76%
libraries.pmi.windows.x64.checked.mch	1,319,381,990	1,328,038,412	+0.66%
libraries_tests.pmi.windows.x64.checked.mch	6,106,527,083	6,138,104,650	+0.52%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	86,983,572,242	87,147,037,774	+0.19%
coreclr_tests.run.windows.x64.checked.mch	446,732,905,596	447,555,435,790	+0.18%
libraries.crossgen2.windows.x64.checked.mch	123,709,233,616	124,015,516,499	+0.25%
libraries.pmi.windows.x64.checked.mch	229,845,289,098	230,329,876,574	+0.21%
libraries_tests.pmi.windows.x64.checked.mch	497,023,552,292	498,067,406,494	+0.21%

I wouldn't think that would be explained by the LSRA affect of availableRegCount increasing, because I wouldn't expect LSRA to be so involved in MinOpts.

I believe you're underestimating the impact on the retired instruction count of availableRegCount increasing. We're going from 32 registers (16 integer + 16 floating) to 56 registers (16 integer + 32 floating + 8 mask) - 75% more registers

This then tracks that the regression to x86 is much less (up to 1.5% rather than up to 4.26%) since it is going from 16 registers (8 integer + 8 floating) to 24 registers (8 integer + 8 floating + 8 mask) - 50% more registers.

Of the 8 places in LSRA that do for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg = REG_NEXT(reg)), 2 of them will always execute for MinOpts:

allocateRegisters - called once per compilation in doLinearScan
processBlockStartLocations - called once per block per compilation in resolveRegisters

However, as the profile shows these aren't "hot paths" for the code. So while we are executing more instructions, tthey don't actually cause any real change to wall clock time. You have to factor in not only retired instructions, but also cycles not in halt, which together are used to measure a cycles per instruction (CPI). You'll notice that LinearScan::allocateReg has a CPI of 0.59, so its able to dispatch almost 2 instructions per cycle, reliably. However, impImportBlockCode has a CPI of 0.97 so it's only able to dispatch about 1 instruction per cycle. This makes impImportBlockCode almost as expensive as allocateRegisters despite executing significantly less instructions and therefore having a much less impact on the TP metric that SPMI is collecting.

As a corollary example, see #83479. This was an extremely simple change to a for loop in the LinearScan constructor. This loop was called once per compilation and was simply doing some trivial if checks for the 23 TYP_* kinds. Linearizing this loop, which effectively just removed an average of 6 compares/branches per type, accounted for a -0.28% change to TP. This had no real change to crossgen throughput: Crossgen2 Throughput - Single - System.Private.Corelib -or- Crossgen2 Throughput - Single Threaded - Single - System.Private.Corelib.

None of the increases to CG2 wall clock time have been due to the AVX512 support added so far. I don't expect that to change with this PR either. However, if it does we have some identified real hot spots from the profiles above that show us where we can most easily win that back.

jakobbotsch · 2023-03-24T15:24:01Z

As a corollary example, see #83479. This was an extremely simple change to a for loop in the LinearScan constructor. This loop was called once per compilation and was simply doing some trivial if checks for the 23 TYP_* kinds. Linearizing this loop, which effectively just removed an average of 6 compares/branches per type, accounted for a -0.28% change to TP. This had no real change to crossgen throughput: Crossgen2 Throughput - Single - System.Private.Corelib -or- Crossgen2 Throughput - Single Threaded - Single - System.Private.Corelib.

The -0.28% was in libraries.crossgen2 MinOpts. There are 15 of those contexts so it is not really a representative sample (indeed, we wouldn't expect many prejitted MinOpts contexts). The overall impact of that change on crossgen2 collections was much smaller, and certainly within the variance of crossgen2 throughput (which is large -- several percent).

I agree that actual profilers can give better data for the "macro" changes, but I think doing so on an SPMI run would have much less variance. Have we spent time with @SingleAccretion's PIN tool on this change yet? Can you post the detailed breakdown from that?

tannergooding · 2023-03-24T15:44:44Z

Have we spent time with @SingleAccretion's PIN tool on this change yet? Can you post the detailed breakdown from that?

The pin tool uses Intel PIN, which is exactly what VTune uses under the covers. AMD uProf uses a similar set of model specific registers for tracking precise instruction counts as well. It is not going to give any additional information beyond what the hardware specific hardware profilers are reporting, which I already shared above.

If someone else wants to spend the extra cycles to build a local copy of the tool, to collect some metrics, etc. Then please feel free. I have several other work items I need to focus on and have already given significant data showing that this is likely not going to be problematic in practice. In the off chance that it is measurably problematic (which will be quickly caught in the weekly perf triage), then we have a one line change that will allow us to disable AVX512 support and investigate this more in depth at that time.

EgorBo · 2023-03-24T16:00:23Z

indeed, we wouldn't expect many prejitted MinOpts contexts

Surprised to see any MinOpts contexts in crossgen at all, I guess those are explicit MethodImpl(NoOptimizations)? Because even cctos are compiled with opts in crossgen (unlike in jit)

tannergooding · 2023-03-24T16:10:15Z

The general point of the example was that we have made a plethora of changes in the past few months that have impacted the SPMI tracked TP metric. Both in terms of "regressions" and in terms of "improvements" for both MinOpts and Optimized code.

Despite this, and despite many of them being "significant" changes to the TP percentile, there has been no "real" change to the TP performance of crossgen2.

The only real regression was caused by the introduction of a non-trivial amount of new managed code and the only real improvement was caused by code simplification in part of the JIT's front end on a PR that did have some measured TP impact (-0.16%) which inversely had a 4.3% actual improvement to perf, because it was done on an actual hot path.

Instructions can range from 0.2 cycles with 5x dispatch all the way up to 40+ cycles for a 64-bit division. Some special instructions can even take 140 cycles or one of the worst cases, 1400 cycles. This means the SPMI "retired instructions" is an overall poor measurement of real world impact and while it is fine to look at as a sort of "basic heuristic" for potential impact, we should in general be looking at real perf numbers to make the final decision.

EgorBo · 2023-03-24T16:24:11Z

there has been no "real" change to the TP performance of crossgen2.

Our crossgen2 benchmarks don't measure TP for tier0, do they? The problem the team was worried about that Tier0 regressions are way more important than Tier1/CG/NAOT, because every Tier0 compilation by definition means that the application execution is stopped in a stub and is waiting for JIT to finish. So the more time we spend in Tier0 jitting the slower the startup. It's way less important for non-MinOpts where we promote methods asynchronously. Although, I agree with you that PINTOOL doesn't directly map into actual overhead. @kunalspathak can we kick a TE run from this PR via bot to see "startup time/Time to first response" change? (ideally, with ReadyToRun=0 if possible)

tannergooding · 2023-03-24T16:41:13Z

Our crossgen2 benchmarks don't measure TP for tier0, do they?

They do not by default, but many of the tracked TP improvements/regressions have applied equally or even greater to optimized code than to MinOpts. I've also explicitly measured crossgen of debug corelib using the release JIT, which does do minopts for everything.

We have extremely good CPI in LSRA and while it is overall one of the hotter (in terms of total actual cycles spent executing) pieces of code that the JIT currently, an increase in instructions to it is much less consequential than changes to impImportBlockCode or fgFindJumpTargets where we have very poor CPI. It having such good CPI also means that it is much harder to get real world perf improvements but much easier to see an SPMI TP regression caused by it.

It also means that improving something like fgFindJumpTargets is a much better overall use of time. -- We spend 16% of the 46.5 million cycles spent executing fgFindJumpTargets in the jump table logic that is generated for switch (opcode)

We similarly spend a significant amount of time in tree->OperRequiresAsgFlag for fgMorphSmpOp. Doing register saving in gtSetEvalOrder due to the size of the method, and just in impImportBlockCode in general due to it being so large the assembly isn't very good.

All of these are relatively low hanging fruit that would provide real measurable improvement to the JIT and are a significantly better use of our time.

kunalspathak · 2023-03-24T16:55:41Z

/benchmark json aspnet-citrine-win runtime

pr-benchmarks · 2023-03-24T16:56:07Z

Benchmark started for json on aspnet-citrine-win with runtime. Logs: link

kunalspathak · 2023-03-24T16:56:59Z

(ideally, with ReadyToRun=0 if possible)

I don't think it is possible, probably using --arguments, but haven't explored it.

kunalspathak · 2023-03-24T19:02:48Z

Not sure why the results are not automatically posted, but here is what I get from link.

application	json.base	json.pr
CPU Usage (%)	76	81	+6.58%
Cores usage (%)	2,129	2,267	+6.48%
Working Set (MB)	78	78	0.00%
Private Memory (MB)	102	101	-0.98%
Build Time (ms)	3,441	3,433	-0.23%
Start Time (ms)	321	325	+1.25%
Published Size (KB)	97,720	97,720	0.00%
Symbols Size (KB)	52	52	0.00%
.NET Core SDK Version	8.0.100-preview.4.23174.1	8.0.100-preview.4.23174.1

load	json.base	json.pr
CPU Usage (%)	75	75	0.00%
Cores usage (%)	2,097	2,104	+0.33%
Working Set (MB)	48	48	0.00%
Private Memory (MB)	363	363	0.00%
Start Time (ms)	0	0
First Request (ms)	152	143	-5.92%
Requests/sec	1,108,954	1,101,094	-0.71%
Requests	16,744,703	16,625,384	-0.71%
Mean latency (ms)	1.25	1.21	-3.20%
Max latency (ms)	53.34	49.47	-7.26%
Bad responses	0	0
Socket errors	0	0
Read throughput (MB/s)	154.41	153.31	-0.71%
Latency 50th (ms)	0.30	0.31	+1.32%
Latency 75th (ms)	0.86	0.84	-2.44%
Latency 90th (ms)	3.38	3.32	-1.78%
Latency 99th (ms)	12.91	11.95	-7.44%

EgorBo · 2023-03-24T19:09:22Z

Ah, looks like TTFR has StdDiv ~7-8% and StartUpTime has ~10-11% so there is no way we can detect that from a single run 🙁

Linux-x64 is better but still not stable enough

BruceForstall · 2023-03-24T20:28:11Z

Ah, looks like TTFR has StdDiv ~7-8% and StartUpTime has ~10-11% so there is no way we can detect that from a single run

So I presume, then, that we can't draw any conclusions from the aspnet perf data?

tannergooding · 2023-03-24T22:29:24Z

So I presume, then, that we can't draw any conclusions from the aspnet perf data?

Not without more runs due to the existing noisiness.

This PR in the results above is actually showing better time to first request by 9ms and better overall latency up through P99; but also 2ms slower build time and 4ms slower start time. This is all within the existing standard deviation so we can at least speculate that there isn't a "substantial" improvement/regression.

BruceForstall · 2023-03-25T00:09:12Z

I generated a PIN diff (using the https://github.com/SingleAccretion/Dotnet-Runtime.Dev#analyze-pin-trace-diffps1---diff-the-traces-produced-by-the-pin-tool script) with a baseline of this PR with AVX512F disabled and this PR as the diff, using the libraries.pmi.windows.x64.checked.mch collection. These are all of the regressions (there were also per-function improvements), almost all in LSRA, as expected due to the for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg = REG_NEXT(reg)) loops.

Base: 231689703239, Diff: 234118161133, +1.0482%

?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z                       : 1285213159 : +50.45%  : 35.66% : +0.5547%
?allocateRegisters@LinearScan@@QEAAXXZ                                                : 527538735  : +10.13%  : 14.64% : +0.2277%
?newRefPositionRaw@LinearScan@@AEAAPEAVRefPosition@@IPEAUGenTree@@W4RefType@@@Z       : 252824560  : +13.35%  : 7.02%  : +0.1091%
?resolveRegisters@LinearScan@@QEAAXXZ                                                 : 190399730  : +10.24%  : 5.28%  : +0.0822%
?TakesEvexPrefix@emitter@@QEBA_NPEBUinstrDesc@1@@Z                                    : 180090394  : +100.67% : 5.00%  : +0.0777%
?addRefsForPhysRegMask@LinearScan@@AEAAX_KIW4RefType@@_N@Z                            : 177658558  : +40.90%  : 4.93%  : +0.0767%
?buildIntervals@LinearScan@@QEAAXXZ                                                   : 112180003  : +8.50%   : 3.11%  : +0.0484%
?associateRefPosWithInterval@LinearScan@@AEAAXPEAVRefPosition@@@Z                     : 93370218   : +7.66%   : 2.59%  : +0.0403%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                          : 91125460   : +1.84%   : 2.53%  : +0.0393%
?allocateReg@LinearScan@@AEAA?AW4_regNumber_enum@@PEAVInterval@@PEAVRefPosition@@@Z   : 63015313   : +1.36%   : 1.75%  : +0.0272%
?compExactlyDependsOn@Compiler@@AEBA_NW4CORINFO_InstructionSet@@@Z                    : 33734585   : +81.86%  : 0.94%  : +0.0146%
?compInitOptions@Compiler@@IEAAXPEAVJitFlags@@@Z                                      : 3873920    : +0.91%   : 0.11%  : +0.0017%

A PIN per-function diff between this PR and the baseline in main shows these regressions (once again, I've omitted the improved functions). The results are similar to above, except a number of emitter functions show up in the list:

Base: 231257791689, Diff: 234112656395, +1.2345%

?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z                        : 1285286530 : +50.45%  : 28.84% : +0.5558%
?allocateRegisters@LinearScan@@QEAAXXZ                                                 : 527183304  : +10.14%  : 11.83% : +0.2280%
?TakesEvexPrefix@emitter@@QEBA_NPEBUinstrDesc@1@@Z                                     : 360761464  : NA       : 8.09%  : +0.1560%
?newRefPositionRaw@LinearScan@@AEAAPEAVRefPosition@@IPEAUGenTree@@W4RefType@@@Z        : 252587840  : +13.36%  : 5.67%  : +0.1092%
?resolveRegisters@LinearScan@@QEAAXXZ                                                  : 190283162  : +10.25%  : 4.27%  : +0.0823%
?addRefsForPhysRegMask@LinearScan@@AEAAX_KIW4RefType@@_N@Z                             : 177490565  : +40.96%  : 3.98%  : +0.0768%
?buildIntervals@LinearScan@@QEAAXXZ                                                    : 112181570  : +8.50%   : 2.52%  : +0.0485%
?insEncodeReg012@emitter@@QEAAIPEBUinstrDesc@1@W4_regNumber_enum@@W4emitAttr@@PEA_K@Z  : 93457761   : +50.78%  : 2.10%  : +0.0404%
?associateRefPosWithInterval@LinearScan@@AEAAXPEAVRefPosition@@@Z                      : 93282759   : +7.66%   : 2.09%  : +0.0403%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                           : 91032693   : +1.84%   : 2.04%  : +0.0394%
?AddSimdPrefixIfNeeded@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z                 : 70216724   : +157.30% : 1.58%  : +0.0304%
?insEncodeReg345@emitter@@QEAAIPEBUinstrDesc@1@W4_regNumber_enum@@W4emitAttr@@PEA_K@Z  : 66055896   : +99.82%  : 1.48%  : +0.0286%
?allocateReg@LinearScan@@AEAA?AW4_regNumber_enum@@PEAVInterval@@PEAVRefPosition@@@Z    : 65937789   : +1.42%   : 1.48%  : +0.0285%
?AddSimdPrefixIfNeededAndNotPresent@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z    : 63784904   : +219.43% : 1.43%  : +0.0276%
?emitOutputAM@emitter@@QEAAPEAEPEAEPEAUinstrDesc@1@_KPEAUCnsVal@1@@Z                   : 51543523   : +9.08%   : 1.16%  : +0.0223%
?compExactlyDependsOn@Compiler@@AEBA_NW4CORINFO_InstructionSet@@@Z                     : 33756085   : +81.24%  : 0.76%  : +0.0146%
?emitGetAdjustedSize@emitter@@QEBAIPEAUinstrDesc@1@_K@Z                                : 32275401   : +13.46%  : 0.72%  : +0.0140%
?emitInsSizeAM@emitter@@QEAAIPEAUinstrDesc@1@_K@Z                                      : 14255524   : +4.16%   : 0.32%  : +0.0062%
?emitInsSizeSVCalcDisp@emitter@@QEAAIPEAUinstrDesc@1@_KHH@Z                            : 13690387   : +24.15%  : 0.31%  : +0.0059%
?TakesRexWPrefix@emitter@@QEBA_NPEBUinstrDesc@1@@Z                                     : 13372542   : +2.68%   : 0.30%  : +0.0058%
?emitOutputSV@emitter@@QEAAPEAEPEAEPEAUinstrDesc@1@_KPEAUCnsVal@1@@Z                   : 9425469    : +4.66%   : 0.21%  : +0.0041%
?compInitOptions@Compiler@@IEAAXPEAVJitFlags@@@Z                                       : 9142373    : +2.17%   : 0.21%  : +0.0040%
?emitInsSize@emitter@@QEAAIPEAUinstrDesc@1@_K_N@Z                                      : 8064798    : +6.26%   : 0.18%  : +0.0035%
?AddRexWPrefix@emitter@@QEAA_KPEBUinstrDesc@1@_K@Z                                     : 7042464    : +9.57%   : 0.16%  : +0.0030%
??0LinearScan@@QEAA@PEAVCompiler@@@Z                                                   : 6152156    : +4.16%   : 0.14%  : +0.0027%

BruceForstall · 2023-03-25T00:11:34Z

I feel like we've analyzed this enough and there is data here that can drive future TP improvement activities.

tannergooding · 2023-03-25T00:24:54Z

The results are similar to above, except a number of emitter functions show up in the list:

Right, this is basically what I covered above:

For example changing from UseEvexEncoding() && IsEvexEncodedInstruction(ins) to IsVexOrEvexEncodableInstruction(ins). This ultimately removed a lot of code, but without PGO the nested UseVEXEncoding() check doesn't get outlined

This should get fixed with PGO data, but if it still isn't outlined explicitly after the next PGO update we can refactor it a bit so the core check is inlined and the more complex bit of logic is not

This reverts commit c6cc201.

jakobbotsch · 2023-03-25T13:01:34Z

I did some wall clock measurements of ASP.NET tier0 contexts:

I created a .mcl of the tier 0 contexts in the current ASP.NET collection. This is 58624 contexts.
I hacked superpmi to replay every context 10 times (instead of just 1) to amortize the cost of the file handling
I did 10 replays of the .mcl with a base and diff jit, where the diff jit is with this commit, and the base jit is the parent commit.

The results were the following, in milliseconds, taken from SPMI's output:

base = {25776.552800, 26043.440300, 25901.721800, 25787.304400, 
   26002.092800, 25677.519400, 25641.274600, 25619.626000, 
   25898.820300, 26292.012000};
diff = {26312.449900, 26448.550100, 26414.532700, 26370.734300, 
   26243.087700, 26335.908700, 26299.542700, 26181.570600, 
   26385.097500, 26297.585800};

The 95% confidence interval of this is a regression in wall clock time of between 1.2% and 2.4%. This includes time spent in the SPMI part of the JIT-EE interface, so the real regression is a bit higher (though probably not much, IIRC we spend around 10-15% of SPMI replay in SPMI).

Just some more data to help prioritize follow-up work. Maybe open an issue to track recuperating some of this with an appropriate milestone?

tannergooding added 5 commits March 19, 2023 08:07

Remove JitForceEVEXEncoding in favor of the existing AltJit enablement

bdce1f4

Rename IsVexEncodingInstruction to IsVexEncodableInstruction

30f54f1

Updating the JIT to support EVEX by default

63ccbb5

Mark the AVX512 ISAs as "fully implemented" since they have no unimpl…

1db8c9a

…emented instructions

Simplify some of the EVEX related checks in emitxarch

15a0755

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 19, 2023

ghost assigned tannergooding Mar 19, 2023

tannergooding commented Mar 19, 2023

View reviewed changes

Tweak the Vector512 ISA check to properly account for VL

56f27aa

tannergooding commented Mar 19, 2023

View reviewed changes

Applying formatting patch

d8f9ef7

This was referenced Mar 19, 2023

Roslyn source generator crash on mono/linux/arm64 #81123

Closed

[release/6.0] Doublelinklist GC failures on Mono #83245

Closed

Ensure we're checking for actual KMask usage and not just potential u…

0974171

…sage

This was referenced Mar 20, 2023

IOException running NuGet-Migrations during tests in dotnet CLI first run #80619

Closed

WasmTestOnBrowser-System.* test failures in CI #83655

Closed

tannergooding added 4 commits March 20, 2023 08:03

Fixing CORJIT_ALLOCMEM_FLG_RODATA_64BYTE_ALIGN for the managed VM

2328f46

Fixing the DoJitStressEvexEncoding check to account for VEX vs EVEX d…

081516d

…ifferences

Merge remote-tracking branch 'dotnet/main' into evex

9e780f8

Break apart an overly long assert

222e972

tannergooding force-pushed the evex branch from 1a7e770 to ef90489 Compare March 21, 2023 14:11

tannergooding closed this Mar 21, 2023

tannergooding reopened this Mar 21, 2023

tannergooding mentioned this pull request Mar 24, 2023

WIP #83873

Closed

BruceForstall approved these changes Mar 25, 2023

View reviewed changes

tannergooding merged commit c6cc201 into dotnet:main Mar 25, 2023

tannergooding deleted the evex branch March 25, 2023 00:18

jkotas added a commit that referenced this pull request Mar 25, 2023

Revert "Enable EVEX support by default (#83648)"

39b10b6

This reverts commit c6cc201.

jkotas mentioned this pull request Mar 25, 2023

Revert "Enable EVEX support by default" #83918

Closed

tannergooding mentioned this pull request Mar 25, 2023

Disable EVEX support until the bug can be resolved #83922

Merged

BruceForstall mentioned this pull request Mar 26, 2023

AVX-512 throughput improvement opportunties #83946

Open

BruceForstall added the avx512 Related to the AVX-512 architecture label Mar 27, 2023

ghost locked as resolved and limited conversation to collaborators Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable EVEX support by default #83648

Enable EVEX support by default #83648

tannergooding commented Mar 19, 2023

ghost commented Mar 19, 2023

tannergooding Mar 19, 2023

tannergooding Mar 21, 2023

tannergooding Mar 19, 2023

tannergooding Mar 19, 2023

tannergooding Mar 19, 2023

tannergooding Mar 19, 2023

tannergooding Mar 19, 2023

tannergooding Mar 19, 2023

tannergooding Mar 19, 2023

tannergooding commented Mar 22, 2023

BruceForstall commented Mar 24, 2023

tannergooding commented Mar 24, 2023

jakobbotsch commented Mar 24, 2023

tannergooding commented Mar 24, 2023

EgorBo commented Mar 24, 2023

tannergooding commented Mar 24, 2023

EgorBo commented Mar 24, 2023

tannergooding commented Mar 24, 2023 •

edited

Loading

kunalspathak commented Mar 24, 2023

pr-benchmarks bot commented Mar 24, 2023

kunalspathak commented Mar 24, 2023

kunalspathak commented Mar 24, 2023

EgorBo commented Mar 24, 2023

BruceForstall commented Mar 24, 2023

tannergooding commented Mar 24, 2023

BruceForstall commented Mar 25, 2023

BruceForstall commented Mar 25, 2023

tannergooding commented Mar 25, 2023

jakobbotsch commented Mar 25, 2023 •

edited

Loading

		@@ -103,7 +103,7 @@ const char* CodeGen::genInsDisplayName(emitter::instrDesc* id)

		const emitter* emit = GetEmitter();

		if (emit->IsVexOrEvexEncodedInstruction(ins))

Enable EVEX support by default #83648

Enable EVEX support by default #83648

Conversation

tannergooding commented Mar 19, 2023

ghost commented Mar 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Mar 22, 2023

BruceForstall commented Mar 24, 2023

tannergooding commented Mar 24, 2023

DOTNET_EnableAVX512F=1 (this PR, AVX512 enabled hardware)

DOTNET_EnableAVX512F=0 (this PR, non-AVX512 enabled hardware)

jakobbotsch commented Mar 24, 2023

tannergooding commented Mar 24, 2023

EgorBo commented Mar 24, 2023

tannergooding commented Mar 24, 2023

EgorBo commented Mar 24, 2023

tannergooding commented Mar 24, 2023 • edited Loading

kunalspathak commented Mar 24, 2023

pr-benchmarks bot commented Mar 24, 2023

kunalspathak commented Mar 24, 2023

kunalspathak commented Mar 24, 2023

EgorBo commented Mar 24, 2023

BruceForstall commented Mar 24, 2023

tannergooding commented Mar 24, 2023

BruceForstall commented Mar 25, 2023

BruceForstall commented Mar 25, 2023

tannergooding commented Mar 25, 2023

jakobbotsch commented Mar 25, 2023 • edited Loading

tannergooding commented Mar 24, 2023 •

edited

Loading

jakobbotsch commented Mar 25, 2023 •

edited

Loading